Find answers from the community

Updated 3 months ago

Missing nodes?

Maybe some newb questions, but I'm having issues with the accuracy of my responses when querying my generated index, have narrowed it down to what looks like query is only checking the first node of the index.

In my case the first node mentions there are 22 players in a team and the name of the first player, and subsequent nodes contain information related to each player, but the query response is only aware of context from the first node, it can't name all the players (that are in subsequent nodes)

Anything I'm doing incorrectly?

Plain Text
# Define prompt_helper and settings
max_input_size = 4096
num_outputs = 1
max_chunk_overlap = 20
embedding_limit = 10000
chunk_size_limit = 120
prompt_helper = PromptHelper(
    max_input_size, num_outputs, max_chunk_overlap, embedding_limit, chunk_size_limit)

# Load data to train the model
directory_path = './data'
documents = SimpleDirectoryReader(directory_path).load_data()

# Create the index from the data
index = GPTSimpleVectorIndex(
    documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)
index.save_to_disk('index.json')

Plain Text
# Predict function to take user query and generate the response from the index
def predict():
    query = request.json['query']
    index = GPTSimpleVectorIndex.load_from_disk('index.json')
    response = index.query(
        'For the Bishops Stortford team ' + query + '. Explain how you arrived at that response', mode="default", response_mode="default")
    return jsonify({'response': response.response})
Attachments
Screenshot_2023-03-02_at_12.30.01.png
Screenshot_2023-03-02_at_12.28.44.png
L
h
r
27 comments
The vector index only returns the top 1 node based on comparing embeddings from the query to node embeddings

Try setting similarity_top_k=2 or larger in your query call

This will send the top 2 nodes to the model (which will cost a more btw)
Thanks @Logan M , I guess the other option is to make the nodes larger? With chunk_size?
Might have to try a few options and see what you like 💪
Am I missing some fundamental understanding of the index? That a query doesn't iterate all nodes?
Correct, the vector index only looks at the top_k matching nodes using vector similarity

A list index will always go through every node (which is good for summarization)
Interesting... so I could change from vector to GPTListIndex and see the results?
@hamish with GPTListIndex or GPTSimpleVectorIndex while querying use mode = "embedding" and similarity_top_k = 2 (or more) to get better answer.
response = index.query('<your-query>', mode = "embedding", similarity_top_k = 2)
You could alternatively try just using GPTListIndex which will iterate over all the available chunks.
Thanks, I tried GPTListIndex but was getting a doc type error, ValueError: doc_type list not found in type_to_struct. Make sure that it was registered in the index registry
Will try playing with the settings for GPTSimpleVectorIndex in meantime
interesting. What kind of documents you have in the folder?
There is one .txt
Did you try to load the old index json with the list type? You'll need to build a new index json for each index type
Yea I regenerate the index on each save, so the index was re-generated after changing from Vector to List index
Totally unrelated, but any tips on how to make the model any better at comparing numbers? It doesn't seem great at understanding how to order numbers asc/desc. The data is being returned correctly against each player, but not in the correct order and not the highest.
Attachments
Screenshot_2023-03-02_at_15.31.30.png
Screenshot_2023-03-02_at_15.28.00.png
@hamish ohh this is structured data. How about looking into this - https://gpt-index.readthedocs.io/en/latest/guides/sql_guide.html and see if this will help you?
Sorry that screenshot is from sheets, but I load the data as more conversational in a .txt
Even in .txt it should be probably a comma seperated values I guess?
Loaded like this
Attachment
Screenshot_2023-03-02_at_15.39.07.png
okay. I guess for this to answer you need to iterate over all chunks I guess?
or are you using similarity_top_k ?
Yea have tried a few numbers for similarity_top_k. I've got my chunk size so that all context in one node in the index
Attachments
Screenshot_2023-03-02_at_15.48.17.png
Screenshot_2023-03-02_at_15.45.52.png
let me understand it correctly, did you include all the contex in one chunk? and how many words/ tokens in that single chunk?
Yea I had to include all in one chunk after realised that vector index doesn’t iterate all nodes. 3100 max chunk size, max input size 4096
yeah that's the reason it is better to use GPTListIndex for your usecase.
Add a reply
Sign up and join the conversation on Discord