Is anyone here using pinecone I m a

At a glance

Is anyone here using pinecone? I'm a little stuck. I've been using the local json option and am trying to shift to using pinecone for a few reasons. So I have two versions of a script working and am trying to merge them. One using gpt_index with SimpleDirectoryReader and SimpleVectorIndex and the other manually upserting into Pinecone. I looked at the GPT_Index Pinecone example and don't understand how to setup the id_to_text_map 🤔

6 comments

aarminta7

Plain Text

def get_recent_files(dir_path, after_date):
    # Convert the input string to a datetime object
    after_date = datetime.strptime(after_date, '%Y-%m-%d')
    # List all files in the directory
    for root, dirs, files in os.walk(dir_path):
        for file in files:
            file_path = os.path.join(root, file)
            # Get the modification time of the file
            mod_time = datetime.fromtimestamp(os.path.getmtime(file_path))
            # Check if the file was modified after the given date
            if ".md" in file.lower():
                if mod_time > after_date:
                    f = open(file_path, encoding="utf8")
                    text = f.read()
                    text = re.sub(regex, '', text, 0, re.DOTALL)
                    text = os.linesep.join([s for s in text.splitlines() if s])
                    print("\n");
                    print("Processing " + file + " ...")
                    print("\n")
                    text_chunks = text.splitlines()
                    for chunk in text_chunks:
                        if chunk != "":
                            meta = [{'text': chunk, 'filename': file}]
                            if (len(chunk) > 20):
                                vector_id = uuid.uuid4()
                                vector_id = str(vector_id)
                                print("Creating and indexing embed with id " + str(vector_id))
                                res = openai.Embedding.create(input=chunk, engine=cs.embedding_model)
                                embeds = [record['embedding'] for record in res['data']]
                                to_upsert = zip([vector_id], embeds, meta)
                                #print(meta)
                                index.upsert(vectors=list(to_upsert))

get_recent_files(cs.directory, cs.date_modified)

aarminta7

Plain Text

def my_function():

    directory = 'Test MD Files'

    # Retrieve settings from user input
    question = q_entry.get()
    number_of_nodes = n_entry.get()
    run_embedding = run_embed.get()

    # Load directory of files
    documents = SimpleDirectoryReader(directory, recursive=True, required_exts=[".md"]).load_data(concatenate=False)

    # Ask if user wants to embed or not
    if run_embedding == 1:
        index = GPTSimpleVectorIndex(documents)
        index.save_to_disk('index_test.json')

    index = GPTSimpleVectorIndex.load_from_disk('index_test.json')

    response = index.query(question, response_mode="compact", similarity_top_k=int(number_of_nodes))

    print(response)
    # print(response.get_formatted_sources())

jjerryjliu0

@arminta7 oh the id_to_text_map is just a map from your pinecone ID to your underlying ID. Here's an example screenshot of how to use the PineconeReader on some test data! Notice how my id's are "A",...,"E", and the id_to_text_map just contains some simple text.

You can specify the query_vector in reader.load_data; you would then feed these documents into a GPT Index data structure.

Attachment

Screen_Shot_2023-01-05_at_1.47.01_PM.png

aarminta7

@jerryjliu0 Is there something like SimpleVectorIndex to upload to Pinecone?

jjerryjliu0

@arminta7 ah not yet (we have one for Weaviate and one for Faiss, but I can add one for Pinecone too)

aarminta7

I may just switch to Weaviate if there already is one. Plus it's open source...

Add a reply

Find answers from the community

Is anyone here using pinecone I m a