Could be issue at chroma side?
You can try creating nodes locally and see how many are being formed.
The code is ok and i can expect 1 vector right?
Yes with the set value of 1024 it should give you one node only
@WhiteFang_Jr is this allowed to use SentenceSplitter and
SentenceWindowNodeParser together in pipeline?
like this " pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap),
SentenceWindowNodeParser.from_defaults(
window_size=1,
window_metadata_key="window",
original_text_metadata_key="original_text",
),
embed_model,
],
when i use them together the vector count become 2 to 11
- the default PDF loader will split pdfs into a Document per page (so minimum two vectors)
- The sentence window node parser with a window size of one will create a node for every sentence, not sure you want to do that
what could be the solution for loading pdf with chunk size 1024 or 2048 or more? we are using google docs loader , drive loader, slack loader, notion loader, confluence loader. do the chunk size limitation apply to these loaders also?
Chunk sizes apply to any data loader, it doesn't matter.
The main factor here is you are chunking each Document
object into nodes, so if that document object has more tokens than the chunk size, it will be chunked into more than one node
Hi @Logan M , is this the correct way to solve that?
def IndexDocuments(project, reader_type: str, documents):
splitter_o = SentenceSplitter(
separator=" ", paragraph_separator="\n", chunk_size=1024, chunk_overlap=30
)
all_doc_chunks = []
combined_text = " ".join(document.text for document in documents)
combined_metadata = {key: value for document in documents for key, value in document.metadata.items()}
text_chunks = splitter_o.split_text(combined_text)
all_doc_chunks = [Document(text=t, metadata=combined_metadata) for t in textchunks] = project.vector.pipeline.run(documents=all_doc_chunks, show_progress=True)
Here the argument documents is from loader and we are using ingestion pipeline
@Logan M Could you please give me a feedback here . We are thinking about launcing the project next week
I feel like I'm maybe lost, whats the issue? Your code seems fine to me?
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap),
embed_model,
]
in this code chunk size was not working
chunk size was coming from pdf reader
splitter_o = SentenceSplitter(
separator=" ", paragraph_separator="\n", chunk_size=1024, chunk_overlap=30
)
all_doc_chunks = []
combined_text = " ".join(document.text for document in documents)
combined_metadata = {key: value for document in documents for key, value in document.metadata.items()}
text_chunks = splitter_o.split_text(combined_text)
all_doc_chunks = [Document(text=t, metadata=combined_metadata) for t in text_chunks]
= project.vector.pipeline.run(documents=all_doc_chunks, show_progress=True)
is this the proper way of chunking when using pdf reader ?
yea that seems fine. PDF reader by default will load a Document object per page. Here you are combining documents and then splitting from there.
pipeline = IngestionPipeline(
transformations=[
embed_model,
],
docstore=RedisDocumentStore.from_host_and_port(
REDIS_LLM_CACHE_CONFIG.host, REDIS_LLM_CACHE_CONFIG.port, namespace="document_store"
),
vector_store=vector_store,
cache=cache,
docstorestrategy=DocstoreStrategy.UPSERTS, ) = VectorStoreIndex.from_vector_store(
pipeline.vector_store,
embed_model=embed_model,
)
Hi @Logan M
I’ve added an embedding model to the transformation step in the ingestion pipeline while sending split text, and it seems to work well. However, I wanted to confirm something: If I remove the transformation step, the pipeline defaults to using the sentence splitter and embedding model in the transformation by default.
Would it be fine if I add just the embedding model in the transformation and handle sentence splitting separately elsewhere? I want to ensure this approach won’t cause any issues.
Thanks!
You totally can, assuming things are split before embedding