Find answers from the community

Updated 2 months ago

llama_index/examples/paul_graham_essay/d...

In the Paul Graham essay exmple on github https://github.com/jerryjliu/llama_index/tree/main/examples/paul_graham_essay/data I notice that the ASCII/text data is pretty neatly organized into lines, each of which has a \n\n after them. In general, if we do have the luxury of controlling the contents of the txt file, whats the best way to structure the .txt files for ingest (VectorDB) to optimize subsequent performance (search and otherwise)? How long should groups or lines be? Newlines between? etc.
L
f
4 comments
The newlines aren't too important tbh, unless you are dividing sections.

I think the best way to organize your data is to separate logical elements into their own document objects (if you can)

For example, if you have a document with many sections, each section could be it's own document object, just by pre-processing the data a bit.

Obviously, most real-world examples aren't this easy to do for lol
I see, so you're implying that when reading training data the notion of the "file" is a sort of semantic indicator (at least in a small way), so thats why a set of txt docs (broken up from a single mono-doc) may yield better results.. (?)
Yea in a sense

When you input a document into the index, it will get broken into chunks and each chunk is embedded.

So if you pre-split the documents a but ahead of time, you can help ensure the embeddings better represent specific sections
Add a reply
Sign up and join the conversation on Discord