I've noticed that when I try to access a node's
start_char_idx
in version 0.8.29 it is always None, making it impossible to map the node text back to a location in the original document. With version 0.7.13, it works fine and I get a character idx. Is this known behavior and is there another way to map the node text back to its location in a document? For example when I run this code in both versions:
from llama_index import (
Document,
OpenAIEmbedding,
ServiceContext,
VectorStoreIndex,
)
from llama_index.llms import OpenAI
from llama_index.node_parser import SimpleNodeParser
with open("test.txt", "r") as f:
text = f.read()
documents = [Document(text=text, metadata={"doc_id": "1234"})]
node_parser = SimpleNodeParser.from_defaults(chunk_size=250, chunk_overlap=20)
service_context = ServiceContext.from_defaults(
embed_model=OpenAIEmbedding(),
node_parser=node_parser,
llm=OpenAI(model="gpt-3.5-turbo", temperature=0.7, max_tokens=500),
)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
query = "what are the liability limits?"
retriever = index.as_retriever()
results = retriever.retrieve(query)
results[0].node.__dict__
I get
results[0].node.start_char_idx
is None in 0.8.29 and equal to 11058 in version 0.7.13.