Find answers from the community

Updated 3 months ago

Start char idx

I've noticed that when I try to access a node's start_char_idx in version 0.8.29 it is always None, making it impossible to map the node text back to a location in the original document. With version 0.7.13, it works fine and I get a character idx. Is this known behavior and is there another way to map the node text back to its location in a document? For example when I run this code in both versions:
Plain Text
from llama_index import (
    Document,
    OpenAIEmbedding,
    ServiceContext,
    VectorStoreIndex,
)
from llama_index.llms import OpenAI
from llama_index.node_parser import SimpleNodeParser


with open("test.txt", "r") as f:
    text = f.read()

documents = [Document(text=text, metadata={"doc_id": "1234"})]
node_parser = SimpleNodeParser.from_defaults(chunk_size=250, chunk_overlap=20)
service_context = ServiceContext.from_defaults(
    embed_model=OpenAIEmbedding(),
    node_parser=node_parser,
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.7, max_tokens=500),
)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

query = "what are the liability limits?"
retriever = index.as_retriever()
results = retriever.retrieve(query)

results[0].node.__dict__
I get
Plain Text
results[0].node.start_char_idx
is None in 0.8.29 and equal to 11058 in version 0.7.13.
L
S
2 comments
This was actually removed in newer versions -- it was extremely hard to track properly, and was often very wrong. It was also making new text splitters hard to implement

In SECInsights (our demo full stack app), we just fuzzy-searched the text text from the source nodes in the original document for highlighting. Not too hard, and it works well

https://github.com/run-llama/sec-insights/blob/b74965f2ae7c0edfc5011b80e9036b5f2d302d8c/frontend/src/utils/multi-line-highlight.tsx#L18
Ok thank you for the response
Add a reply
Sign up and join the conversation on Discord