Find answers from the community

Updated 6 months ago

It isn't clear to me the default

At a glance

The community member is having difficulty understanding the default chunking and tokenization performed by VectorStoreIndex.from_documents(). Another community member suggests using the ingestion pipeline, which is more transparent about the process. The ingestion pipeline documentation is provided as a reference.

Useful resources

nnickjtay

It isn't clear to me the default chunking and tokenization that is being performed under VectorStoreIndex.from_documents(). Usually I can figure this out on my own, but having difficulty. Is this documented somewhere?

2 comments

LLogan M

SentenceSplitter() using chunk_size=1024 and gpt-3.5 tokenizer

I agree it's a bit opaque -- The ingestion pipeline is a bit more preffered, since it's much more transparent as to whats happening

https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/root.html

nnickjtay

Awesome, thank you @Logan M

Add a reply