Find answers from the community

Updated last year

Is it possible to create an index

Is it possible to create an index without chunking where a MetadataExtractor can operate on the entire document (that fits within the context window)? I've tried setting chunk_size to a large number and chunk_overlap to 0 in both the text_splitter and the node_parser and set a large context_window in the prompt_helper but my input files still continue to be split into small chunks. What am I missing here?

7 comments

LLogan M

Not sure what the exact issue is here 🤔

Modifying the chunk size is pretty easy

(Minor size discrepancy, maybe due to spaces)

Plain Text

>>> from llama_index import Document
>>> from llama_index.node_parser import SimpleNodeParser
>>> node_parser = SimpleNodeParser.from_defaults(chunk_size=3000)
>>> nodes = node_parser.get_nodes_from_documents([Document.example()])
>>> len(nodes[0].text)
1290
>>> len(Document.example().text)
1292
>>> node_parser = SimpleNodeParser.from_defaults(chunk_size=100)
>>> nodes = node_parser.get_nodes_from_documents([Document.example()])
>>> len(nodes[0].text)
341
>>>

LLogan M

counting characters here, but just showing that the size is changing

rrichie404

Thanks for answering! If I'm using SimpleDirectoryReader to create the documents/nodes, does it always use a small chunk size?

rrichie404

If I ingest a single markdown file in my directory, I get 19 nodes instead of 1

rrichie404

Aha, the markdown file has exactly 19 headings!

rrichie404

Is there a way to override this behavior and just get one node per Markdown file?

bbmax

Is it the way you're parsing the nodes aka the TextSplitter you're using?

Add a reply