Find answers from the community

Updated 8 hours ago

Ingesting a pdf with llamaindex ingestion pipeline creates multiple vectors in chromadb collection

Is this bug or I am missing something . I have a pdf of 307 token, I am ingesting into chromadb with llamaindex IngestionPipeline, My chunk size is 1024, after ingestion i see it creates 2 vectors in chroma collection , i am expecting one collection because the pdf is of 307 token and my chunk size is 1024. when i set chunk size to 210 it creates 7 vectors (I mean chroma_collection.count()=7)
W
L
L
29 comments
Could be issue at chroma side?

You can try creating nodes locally and see how many are being formed.
The code is ok and i can expect 1 vector right?
Yes with the set value of 1024 it should give you one node only
@WhiteFang_Jr is this allowed to use SentenceSplitter and
SentenceWindowNodeParser together in pipeline?
like this " pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap),
SentenceWindowNodeParser.from_defaults(
window_size=1,
window_metadata_key="window",
original_text_metadata_key="original_text",
),
embed_model,
],
when i use them together the vector count become 2 to 11
  1. the default PDF loader will split pdfs into a Document per page (so minimum two vectors)
  1. The sentence window node parser with a window size of one will create a node for every sentence, not sure you want to do that
Thank you @Logan M
what could be the solution for loading pdf with chunk size 1024 or 2048 or more? we are using google docs loader , drive loader, slack loader, notion loader, confluence loader. do the chunk size limitation apply to these loaders also?
Chunk sizes apply to any data loader, it doesn't matter.

The main factor here is you are chunking each Document object into nodes, so if that document object has more tokens than the chunk size, it will be chunked into more than one node
Ok thank you @Logan M
Hi @Logan M , is this the correct way to solve that?
def IndexDocuments(project, reader_type: str, documents):

splitter_o = SentenceSplitter(
separator=" ", paragraph_separator="\n", chunk_size=1024, chunk_overlap=30
)



all_doc_chunks = []
combined_text = " ".join(document.text for document in documents)
combined_metadata = {key: value for document in documents for key, value in document.metadata.items()}

text_chunks = splitter_o.split_text(combined_text)
all_doc_chunks = [Document(text=t, metadata=combined_metadata) for t in textchunks] = project.vector.pipeline.run(documents=all_doc_chunks, show_progress=True)
Here the argument documents is from loader and we are using ingestion pipeline
@Logan M Could you please give me a feedback here . We are thinking about launcing the project next week
I feel like I'm maybe lost, whats the issue? Your code seems fine to me?
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap),
embed_model,
]
in this code chunk size was not working
chunk size was coming from pdf reader
then i do this
splitter_o = SentenceSplitter(
separator=" ", paragraph_separator="\n", chunk_size=1024, chunk_overlap=30
)



all_doc_chunks = []
combined_text = " ".join(document.text for document in documents)
combined_metadata = {key: value for document in documents for key, value in document.metadata.items()}

text_chunks = splitter_o.split_text(combined_text)
all_doc_chunks = [Document(text=t, metadata=combined_metadata) for t in text_chunks]

= project.vector.pipeline.run(documents=all_doc_chunks, show_progress=True)
now it seems work
is this the proper way of chunking when using pdf reader ?
yea that seems fine. PDF reader by default will load a Document object per page. Here you are combining documents and then splitting from there.
pipeline = IngestionPipeline(
transformations=[
embed_model,
],
docstore=RedisDocumentStore.from_host_and_port(
REDIS_LLM_CACHE_CONFIG.host, REDIS_LLM_CACHE_CONFIG.port, namespace="document_store"
),
vector_store=vector_store,
cache=cache,
docstorestrategy=DocstoreStrategy.UPSERTS, ) = VectorStoreIndex.from_vector_store(
pipeline.vector_store,
embed_model=embed_model,
)
Hi @Logan M

I’ve added an embedding model to the transformation step in the ingestion pipeline while sending split text, and it seems to work well. However, I wanted to confirm something: If I remove the transformation step, the pipeline defaults to using the sentence splitter and embedding model in the transformation by default.

Would it be fine if I add just the embedding model in the transformation and handle sentence splitting separately elsewhere? I want to ensure this approach won’t cause any issues.

Thanks!
You totally can, assuming things are split before embedding
Add a reply
Sign up and join the conversation on Discord