Hello guys! I have a question about

Hello guys! I have a question about managing multi-level node structure within a vector database.

Imagine the following situation:

I have a markdown file. It has headings ranging from h1 to theoretically h4-h5. I want to create a vector database that stores different chunks of these .md files. My embedding model bge-large has passage length limit of 512 tokens. Thus, it is possible, that in order to effectively embed my documents i would probably need to split my documents.

I'm planning on splitting my documents recursively by header sections. That is, h1 section, then if it is longer then embedding model max length, then h2 sections within the h1 section, and so on and so forth until it fits, or if there are no deeper heading, then just using TokenTextSplitter, for example.

Now, since i want to be able to reconstruct my larger sections using parent node information, i would want to store larger unembeddable nodes in my collection (nodes without embeddings).

Is it possible to store nodes with and without embeddings simultaneously within one vector DB? Does it make sense at all? Or there are better approaches which I'm not aware of?

Or do i need to store them at all, maybe it is just possible to reconstuct them from child nodes using parent node info?

Appreciate any help!

Find answers from the community

Hello guys! I have a question about