The simple index is organized as nested dictionaries / json. You can access index.index_struct.nodes_dict
, where you have your text chunks as id: Node
dictionaries, and each Node has node_info
where start
tells you the starting location in the document. Look for zero start indices π
yep! another thing you can do is just create a separate GPTSimpleVectorIndex per document if you want more granularity in the index
Basically I'm using SimpleDirectoryReader to read a folder of txt files, and then pass it to GPTSimpleVectorIndex to create the index
Is that what you meant @jerryjliu0 by creating a separate index per document?
to clarify, for your use case, you want to get the first text chunk for every source document right? and out of curiosity, what's the use case for this?
why send the first chunk of text from the document as opposed to the most relevant chunk of text
Because as a first step I want to tell the type of the document, and for this it's enough to only send the first chunk in each document. Later on I want the ability to ask questions on every part, that's why I'm indexing everything regardless
Ah got it. For the first part you could try defining a GPTListIndex over each document, which stores nodes/text chunks in a sequential list. Then just fetch the first node.
list_index = GPTListIndex(documents)
list_index.nodes[0]
we should offer other abstractions (e.g. composability) that should help you with your use case, but as a start you can try using this
also, try setting chunk_size_limit to a smaller value (e.g. 512) when building the index
What happens in the next part where I want to ask questions on every part of the documents? Will I need to re-index the documents using the SImpleVectorIndex or can I use the same ListIndex for this task as well?
you could use ListIndex for both, but you can prob get better results by using list index for the first step and simple vector index for the second (yes it will reindex). indexing with the list index is fast and doesn't cost any $$ though
it's effectively just a text chunking operation
Isn't there an easy way to get the first chunk from each source document using the SImpleVectorIndex (i.e based on Mikko's suggestion above)?
Because I just need the raw text, I'm not querying anything in the first part
yes that's also possible - but since we just use a dict under the hood you will have to traverse every node to find the first node of each document with start_idx=0.
up to you as to which option you choose
I see, so it's less efficient when talking big data
yeah, using gpt list index during construction is basically just a text splitter
question: if you want to tell the type of document, and you mentioned you're not doing any querying, then why not do this outside of gpt index?
I want to ask GPT to tell me the type based on the first chunk
The type can be derived from the first chunk
As for using ListIndex, can I use the text - splitter once for both ListIndex and SImpleVectorIndex so I don't have to split it twice when creating the two indices?
atm nope :/
Though for that use case you could manually split the text yourself and use the list index to query for the type
index = GPTListIndex([document[:1000]])
index.query("What is the type of this document?")
I deleted my previous question as I understand that both ListIndex and SimpleVectorIndex are using LLM to get the final answer.
I guess I have some gaps understanding the difference between ListIndex and SImpleVectorIndex. I understand that one is a List and the other a Dict, but in terms of efficiency all keys in the dict have to be compared against, right? so overall to me it seems they both have the same quality of results and the same efficiency, but I'm probably missing something here
When you query against a vector index, there are two models used: one to embed your query and then one to the LLM to synthesize an answer. The vector index embeddings are used to get similarity_top_k
best matching chunks, which are sent to the LLM as context. The list index doesn't do any similarity search and uses all your text chunks to synhtesize an answer. This may actually make more calls to the LLM than a vector index.
Hi @Mikko , circling back on your last answer, can you please elaborate on how querying ListIndex actually work? Since it doesn't use any embeddings, will it run the LLM on my query for every chunk? (Unless I require specific keywords in the chunk to appear)
that's basically how that works
Thanks @jerryjliu0 !
One more question: how does it work when I query a ListIndex built on top of SimpleVector indices. Will it run the LLM for every summarization? How does it know which underlying vector index to query?
yeah so if you build a list index on top of simple vector indices, each node in the list will correspond to a simple vector index. so a query will go through every node in the list, but instead of just using the "text" in the node, the query will actually go down to the simple vector index and query the simple vector index. you'll end up querying the n simple vector indices. does that make sense?
So basically the LLM will run for each top_k, meaning number of SimpleVector indices * top_k times? Is there a way to only get the top_K results without running the LLM?
yep you can set response_mode="no_text" in the query parameters
when using response_mode="no_text" no LLM will be called?
For some reason I cannot find any documentation for this mode ("no text")
Well the query needs to be embedded, but it won't call an LLM to synthesize an answer
@yoelk it's response_mode="no_text"
my bad for not including it in the docs!
that's a good catch, i'll add it in