Missing nodes?

hhamish

Maybe some newb questions, but I'm having issues with the accuracy of my responses when querying my generated index, have narrowed it down to what looks like query is only checking the first node of the index.

In my case the first node mentions there are 22 players in a team and the name of the first player, and subsequent nodes contain information related to each player, but the query response is only aware of context from the first node, it can't name all the players (that are in subsequent nodes)

Anything I'm doing incorrectly?

Plain Text

# Define prompt_helper and settings
max_input_size = 4096
num_outputs = 1
max_chunk_overlap = 20
embedding_limit = 10000
chunk_size_limit = 120
prompt_helper = PromptHelper(
    max_input_size, num_outputs, max_chunk_overlap, embedding_limit, chunk_size_limit)

# Load data to train the model
directory_path = './data'
documents = SimpleDirectoryReader(directory_path).load_data()

# Create the index from the data
index = GPTSimpleVectorIndex(
    documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)
index.save_to_disk('index.json')

Plain Text

# Predict function to take user query and generate the response from the index
def predict():
    query = request.json['query']
    index = GPTSimpleVectorIndex.load_from_disk('index.json')
    response = index.query(
        'For the Bishops Stortford team ' + query + '. Explain how you arrived at that response', mode="default", response_mode="default")
    return jsonify({'response': response.response})

Attachments

27 comments

LLogan M

The vector index only returns the top 1 node based on comparing embeddings from the query to node embeddings

Try setting similarity_top_k=2 or larger in your query call

This will send the top 2 nodes to the model (which will cost a more btw)

hhamish

Thanks @Logan M , I guess the other option is to make the nodes larger? With chunk_size?

LLogan M

Yea exactly!

LLogan M

Might have to try a few options and see what you like 💪

hhamish

Am I missing some fundamental understanding of the index? That a query doesn't iterate all nodes?

LLogan M

Correct, the vector index only looks at the top_k matching nodes using vector similarity

A list index will always go through every node (which is good for summarization)

hhamish

Interesting... so I could change from vector to GPTListIndex and see the results?

rravitheja

@hamish with GPTListIndex or GPTSimpleVectorIndex while querying use mode = "embedding" and similarity_top_k = 2 (or more) to get better answer.

rravitheja

response = index.query('<your-query>', mode = "embedding", similarity_top_k = 2)

rravitheja

You could alternatively try just using GPTListIndex which will iterate over all the available chunks.

hhamish

Thanks, I tried GPTListIndex but was getting a doc type error, ValueError: doc_type list not found in type_to_struct. Make sure that it was registered in the index registry

hhamish

Will try playing with the settings for GPTSimpleVectorIndex in meantime

rravitheja

interesting. What kind of documents you have in the folder?

hhamish

There is one .txt

LLogan M

Did you try to load the old index json with the list type? You'll need to build a new index json for each index type

hhamish

Yea I regenerate the index on each save, so the index was re-generated after changing from Vector to List index

hhamish

Totally unrelated, but any tips on how to make the model any better at comparing numbers? It doesn't seem great at understanding how to order numbers asc/desc. The data is being returned correctly against each player, but not in the correct order and not the highest.

Attachments

rravitheja

@hamish ohh this is structured data. How about looking into this - https://gpt-index.readthedocs.io/en/latest/guides/sql_guide.html and see if this will help you?

hhamish

Sorry that screenshot is from sheets, but I load the data as more conversational in a .txt

rravitheja

Even in .txt it should be probably a comma seperated values I guess?

hhamish

Loaded like this

Attachment