Find answers from the community

Updated 4 months ago

llama_index/examples/paul_graham_essay/d...

At a glance

The community member is asking about the best way to structure text files for ingestion into a vector database to optimize performance for search and other operations. They note that the example data from a Paul Graham essay on GitHub is neatly organized into lines with double newlines. The community members in the comments suggest that the notion of a "file" can be a semantic indicator, and that pre-splitting the documents into logical sections can help ensure the embeddings better represent specific sections. The overall recommendation is to separate logical elements into their own document objects, if possible, to improve the performance of the vector database.

Useful resources
In the Paul Graham essay exmple on github https://github.com/jerryjliu/llama_index/tree/main/examples/paul_graham_essay/data I notice that the ASCII/text data is pretty neatly organized into lines, each of which has a \n\n after them. In general, if we do have the luxury of controlling the contents of the txt file, whats the best way to structure the .txt files for ingest (VectorDB) to optimize subsequent performance (search and otherwise)? How long should groups or lines be? Newlines between? etc.
L
f
4 comments
The newlines aren't too important tbh, unless you are dividing sections.

I think the best way to organize your data is to separate logical elements into their own document objects (if you can)

For example, if you have a document with many sections, each section could be it's own document object, just by pre-processing the data a bit.

Obviously, most real-world examples aren't this easy to do for lol
I see, so you're implying that when reading training data the notion of the "file" is a sort of semantic indicator (at least in a small way), so thats why a set of txt docs (broken up from a single mono-doc) may yield better results.. (?)
Yea in a sense

When you input a document into the index, it will get broken into chunks and each chunk is embedded.

So if you pre-split the documents a but ahead of time, you can help ensure the embeddings better represent specific sections
Add a reply
Sign up and join the conversation on Discord