Which example are you look at?
You can use the SimpleDirectoryReader to load your text from any directory, and put that into an index.
Or are you trying to do something different?
So if I just replace the paul_graham_essay.txt in the "data" folder it should read that instead?
My data appears to be too long to get to the bottom, I think, as some of the stuff halfway through the file doesn't seem to be searchable, can I split it into two files in the data folder? :ThisIsFine:
Llama index should be handling any splitting under the hood.
Can you give an example of a query that isn't working well? There are some settings that can be adjusted πͺ
(If your document has clear sections though, splitting beforehand can help)
weird, now it's not working at all really, it was kinda working but not providing results past last name L, but now it won't provide data past the first couple names, here is the data:
https://share.getcloudapp.com/NQuWj4Z8 (but I copy/pasted it into a txt file instead, but the format is the same) and here is the interface you can query
I didn't change anything so i don't understand why all of a sudden it won't return any results except Alice Ahmad
It seems to work for names that are at the bottom of this list
Yah I just realized it's returning some names but not others.
Do you think I should just go take all this data and format it better?
For example if I asked who Desmond M. Balakrishnan is, it doesn't know, I guess the same reason when I ask who works on Capital Markets, his name isn't included.
Yea there could be a better input format π€ if you are using a vector index, you can also set how many nodes are retrieved
By default, it fetches the top 1 closest matching node according to vector similarity
You can increase this by doing something like index.query(..., similarity_top_k=3, response_mode="compact")
You can also play around with the chunk size too, using the service context object. By default, documents are split into overlapping chunks of 3900 tokens, sometimes a small chunk size + larger top k works well
But going back to the document format, it's even a little confusing for me to read haha
Yeah, it's just a copy paste from a law firms website lol, I should fix it up. I'll play around with it a bit more and also play with your suggestions. Thank you fellow Canadian (even though I'm in Seoul! ;))
Haha no worries! Good luck πͺπͺ
Really useful guide, for "compact" mode, does that mean the retrieved nodes will be combined and set as context for the query?
Exactly. And if the combined nodes are too big for the model, it will break it up into the correct sizes
If the combined nodes are splitted, could you explain a little bit more on how does it choose the chunks?
It just splits into chunks with some token overlap (20 tokens)
It needs to make sure that your query + prompt template + context (plus optional existing answer) all fit within 4097 tokens
For the most part, the overlap works pretty well for this process