Find answers from the community

Updated 2 years ago

Is anyone here using pinecone I m a

Is anyone here using pinecone? I'm a little stuck. I've been using the local json option and am trying to shift to using pinecone for a few reasons. So I have two versions of a script working and am trying to merge them. One using gpt_index with SimpleDirectoryReader and SimpleVectorIndex and the other manually upserting into Pinecone. I looked at the GPT_Index Pinecone example and don't understand how to setup the id_to_text_map πŸ€”
a
j
6 comments
Plain Text
def get_recent_files(dir_path, after_date):
    # Convert the input string to a datetime object
    after_date = datetime.strptime(after_date, '%Y-%m-%d')
    # List all files in the directory
    for root, dirs, files in os.walk(dir_path):
        for file in files:
            file_path = os.path.join(root, file)
            # Get the modification time of the file
            mod_time = datetime.fromtimestamp(os.path.getmtime(file_path))
            # Check if the file was modified after the given date
            if ".md" in file.lower():
                if mod_time > after_date:
                    f = open(file_path, encoding="utf8")
                    text = f.read()
                    text = re.sub(regex, '', text, 0, re.DOTALL)
                    text = os.linesep.join([s for s in text.splitlines() if s])
                    print("\n");
                    print("Processing " + file + " ...")
                    print("\n")
                    text_chunks = text.splitlines()
                    for chunk in text_chunks:
                        if chunk != "":
                            meta = [{'text': chunk, 'filename': file}]
                            if (len(chunk) > 20):
                                vector_id = uuid.uuid4()
                                vector_id = str(vector_id)
                                print("Creating and indexing embed with id " + str(vector_id))
                                res = openai.Embedding.create(input=chunk, engine=cs.embedding_model)
                                embeds = [record['embedding'] for record in res['data']]
                                to_upsert = zip([vector_id], embeds, meta)
                                #print(meta)
                                index.upsert(vectors=list(to_upsert))

get_recent_files(cs.directory, cs.date_modified)
Plain Text
def my_function():

    directory = 'Test MD Files'

    # Retrieve settings from user input
    question = q_entry.get()
    number_of_nodes = n_entry.get()
    run_embedding = run_embed.get()

    # Load directory of files
    documents = SimpleDirectoryReader(directory, recursive=True, required_exts=[".md"]).load_data(concatenate=False)

    # Ask if user wants to embed or not
    if run_embedding == 1:
        index = GPTSimpleVectorIndex(documents)
        index.save_to_disk('index_test.json')

    index = GPTSimpleVectorIndex.load_from_disk('index_test.json')

    response = index.query(question, response_mode="compact", similarity_top_k=int(number_of_nodes))

    print(response)
    # print(response.get_formatted_sources())
@arminta7 oh the id_to_text_map is just a map from your pinecone ID to your underlying ID. Here's an example screenshot of how to use the PineconeReader on some test data! Notice how my id's are "A",...,"E", and the id_to_text_map just contains some simple text.

You can specify the query_vector in reader.load_data; you would then feed these documents into a GPT Index data structure.
Attachment
Screen_Shot_2023-01-05_at_1.47.01_PM.png
@jerryjliu0 Is there something like SimpleVectorIndex to upload to Pinecone?
@arminta7 ah not yet (we have one for Weaviate and one for Faiss, but I can add one for Pinecone too)
I may just switch to Weaviate if there already is one. Plus it's open source...
Add a reply
Sign up and join the conversation on Discord