Find answers from the community

Updated 3 months ago

Hello guys! I have a question about

Hello guys! I have a question about managing multi-level node structure within a vector database.

Imagine the following situation:

I have a markdown file. It has headings ranging from h1 to theoretically h4-h5. I want to create a vector database that stores different chunks of these .md files. My embedding model bge-large has passage length limit of 512 tokens. Thus, it is possible, that in order to effectively embed my documents i would probably need to split my documents.

I'm planning on splitting my documents recursively by header sections. That is, h1 section, then if it is longer then embedding model max length, then h2 sections within the h1 section, and so on and so forth until it fits, or if there are no deeper heading, then just using TokenTextSplitter, for example.

Now, since i want to be able to reconstruct my larger sections using parent node information, i would want to store larger unembeddable nodes in my collection (nodes without embeddings).

Is it possible to store nodes with and without embeddings simultaneously within one vector DB? Does it make sense at all? Or there are better approaches which I'm not aware of?

Or do i need to store them at all, maybe it is just possible to reconstuct them from child nodes using parent node info?

Appreciate any help!
L
s
3 comments
Most vector dbs don't really make sense to store things without embeddings. You'd want some external storage or table to put them I think?
Yeah, probably this is the best option. Do we have smth for this in llama_index out of the box?

I would love to store the full docs and maybe some intermediate splits, so i can reference to them after retrieving the chunks.
this would be a usecase for the docstore yea πŸ‘
Add a reply
Sign up and join the conversation on Discord