Find answers from the community

Updated 2 months ago

Ingesting a pdf with llamaindex ingestion pipeline creates multiple vectors in chromadb collection

At a glance

The community member has a PDF with 307 tokens and is ingesting it into ChromaDB using the LlamaIndex IngestionPipeline. They set the chunk size to 1024, but after ingestion, they see that the collection has 2 vectors, which they were not expecting. When they set the chunk size to 210, the collection has 7 vectors.

The community members discuss the issue, and some suggest that it could be a problem with ChromaDB. They also discuss using the SentenceSplitter and SentenceWindowNodeParser together in the pipeline, which seems to create more vectors than expected.

The community members explore different solutions, such as combining documents and then splitting them, and using an embedding model in the transformation step. The community member also asks if the chunk size limitation applies to other data loaders like Google Docs, Drive, Slack, Notion, and Confluence.

The community member provides a code snippet to solve the issue, and another community member, Logan M, confirms that the approach seems fine.

There is no explicitly marked answer, but the community members seem to have found a solution through their discussion.

Is this bug or I am missing something . I have a pdf of 307 token, I am ingesting into chromadb with llamaindex IngestionPipeline, My chunk size is 1024, after ingestion i see it creates 2 vectors in chroma collection , i am expecting one collection because the pdf is of 307 token and my chunk size is 1024. when i set chunk size to 210 it creates 7 vectors (I mean chroma_collection.count()=7)
W
L
L
30 comments
Could be issue at chroma side?

You can try creating nodes locally and see how many are being formed.
The code is ok and i can expect 1 vector right?
Yes with the set value of 1024 it should give you one node only
@WhiteFang_Jr is this allowed to use SentenceSplitter and
SentenceWindowNodeParser together in pipeline?
like this " pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap),
SentenceWindowNodeParser.from_defaults(
window_size=1,
window_metadata_key="window",
original_text_metadata_key="original_text",
),
embed_model,
],
when i use them together the vector count become 2 to 11
  1. the default PDF loader will split pdfs into a Document per page (so minimum two vectors)
  1. The sentence window node parser with a window size of one will create a node for every sentence, not sure you want to do that
Thank you @Logan M
what could be the solution for loading pdf with chunk size 1024 or 2048 or more? we are using google docs loader , drive loader, slack loader, notion loader, confluence loader. do the chunk size limitation apply to these loaders also?
Chunk sizes apply to any data loader, it doesn't matter.

The main factor here is you are chunking each Document object into nodes, so if that document object has more tokens than the chunk size, it will be chunked into more than one node
Ok thank you @Logan M
Hi @Logan M , is this the correct way to solve that?
def IndexDocuments(project, reader_type: str, documents):

splitter_o = SentenceSplitter(
separator=" ", paragraph_separator="\n", chunk_size=1024, chunk_overlap=30
)



all_doc_chunks = []
combined_text = " ".join(document.text for document in documents)
combined_metadata = {key: value for document in documents for key, value in document.metadata.items()}

text_chunks = splitter_o.split_text(combined_text)
all_doc_chunks = [Document(text=t, metadata=combined_metadata) for t in textchunks] = project.vector.pipeline.run(documents=all_doc_chunks, show_progress=True)
Here the argument documents is from loader and we are using ingestion pipeline
@Logan M Could you please give me a feedback here . We are thinking about launcing the project next week
I feel like I'm maybe lost, whats the issue? Your code seems fine to me?
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap),
embed_model,
]
in this code chunk size was not working
chunk size was coming from pdf reader
then i do this
splitter_o = SentenceSplitter(
separator=" ", paragraph_separator="\n", chunk_size=1024, chunk_overlap=30
)



all_doc_chunks = []
combined_text = " ".join(document.text for document in documents)
combined_metadata = {key: value for document in documents for key, value in document.metadata.items()}

text_chunks = splitter_o.split_text(combined_text)
all_doc_chunks = [Document(text=t, metadata=combined_metadata) for t in text_chunks]

= project.vector.pipeline.run(documents=all_doc_chunks, show_progress=True)
is this the proper way of chunking when using pdf reader ?
yea that seems fine. PDF reader by default will load a Document object per page. Here you are combining documents and then splitting from there.
pipeline = IngestionPipeline(
transformations=[
embed_model,
],
docstore=RedisDocumentStore.from_host_and_port(
REDIS_LLM_CACHE_CONFIG.host, REDIS_LLM_CACHE_CONFIG.port, namespace="document_store"
),
vector_store=vector_store,
cache=cache,
docstorestrategy=DocstoreStrategy.UPSERTS, ) = VectorStoreIndex.from_vector_store(
pipeline.vector_store,
embed_model=embed_model,
)
Hi @Logan M

I’ve added an embedding model to the transformation step in the ingestion pipeline while sending split text, and it seems to work well. However, I wanted to confirm something: If I remove the transformation step, the pipeline defaults to using the sentence splitter and embedding model in the transformation by default.

Would it be fine if I add just the embedding model in the transformation and handle sentence splitting separately elsewhere? I want to ensure this approach won’t cause any issues.

Thanks!
You totally can, assuming things are split before embedding
@Logan M Thank you
Add a reply
Sign up and join the conversation on Discord