Find answers from the community

Updated last year

Is it possible to create an index

Is it possible to create an index without chunking where a MetadataExtractor can operate on the entire document (that fits within the context window)? I've tried setting chunk_size to a large number and chunk_overlap to 0 in both the text_splitter and the node_parser and set a large context_window in the prompt_helper but my input files still continue to be split into small chunks. What am I missing here?
L
r
b
7 comments
Not sure what the exact issue is here πŸ€”

Modifying the chunk size is pretty easy

(Minor size discrepancy, maybe due to spaces)

Plain Text
>>> from llama_index import Document
>>> from llama_index.node_parser import SimpleNodeParser
>>> node_parser = SimpleNodeParser.from_defaults(chunk_size=3000)
>>> nodes = node_parser.get_nodes_from_documents([Document.example()])
>>> len(nodes[0].text)
1290
>>> len(Document.example().text)
1292
>>> node_parser = SimpleNodeParser.from_defaults(chunk_size=100)
>>> nodes = node_parser.get_nodes_from_documents([Document.example()])
>>> len(nodes[0].text)
341
>>> 
counting characters here, but just showing that the size is changing
Thanks for answering! If I'm using SimpleDirectoryReader to create the documents/nodes, does it always use a small chunk size?
If I ingest a single markdown file in my directory, I get 19 nodes instead of 1
Aha, the markdown file has exactly 19 headings!
Is there a way to override this behavior and just get one node per Markdown file?
Is it the way you're parsing the nodes aka the TextSplitter you're using?
Add a reply
Sign up and join the conversation on Discord