Find answers from the community

Updated 2 years ago

Parsed nodes

At a glance
Hi! just getting started with llamaindex. Are there good debugging guides for understanding how the vector index behaves? I'm trying to debug if the documents are not well parsed into nodes, or the relationships are off, or other issues. What I'm finding is that the generated context just isn't quite what I'd expect. Any guidance here? Happy to do some background reading if that's needed as well.
L
s
9 comments
In every index, ingested documents are broken into chunks that are up to 1024 tokens, with some overlap. The chunking is done by splitting by tokens. In a vector index, each chunk is embedded

At query time, the query text is embedded, and the top k most similar nodes are retrieved (default is 2) and used to answer the query
What kinds of issues are you facing?
I'm feeding ipynb notebooks as documents, which includes both text and code. I'm finding that not enough cells are included in the context with some prompts. E.g. asking for example code, or text, doesn't yield enough. Sometimes several sequential cells (or at least partial results from each cell) from the notebook I think should be included.
So I'm thinking I either have to set node relationships, or add more metadata to the documents.
But its hard to know how to proceed
Hmm. Yea code is tough for sure, because you kind of want to avoid splitting in the middle of a code block

Maybe you can split your notebooks into document objects where each document is a code block, plus optionally any text above it?

Or another quick hack might be to just increase the chunk size a bit in the service context
Yea can definitely try this, any examples that show this level of tweaking?
mmmm not really πŸ˜… A lot of it would be manually parsing the ipynb json stuff, which I think is pretty unique

But once you have the text/code in the chunks you want, creating the documents it's easy

Plain Text
from llama_index import Document, VectorStoreIndex

text_chunks = [...]
documents = [Document(t) for t in text_chunks]

index = VectorStoreIndex.from_documents(documents)


Adjusting the chunk size is easy though!

Plain Text
from llama_index import ServiceContext, VectorStoreIndex

service_context = ServiceContext.from_defaults(..., chunk_size=2048)

index = VectorStoreIndex.from_documents(documents)
Thanks I'll give that a whirl
Add a reply
Sign up and join the conversation on Discord