Find answers from the community

Updated 4 months ago

I'm embarrassed to even ask this, but

I'm embarrassed to even ask this, but here goes. 😰

I have a very strange issue. I recursively load a directory full of HTML using
Plain Text
documents = SimpleDirectoryReader(
    input_dir=source_directory,
    file_extractor={".html": UnstructuredReader()},
    file_metadata=lambda x: {"biz_id": int(biz_id)},
    required_exts=[".html"],
    recursive=True,
).load_data()


It loads all 193 documents and the data look correct. BUT, when I run the ingestion pipeline off the loaded docs, I always only get 7 nodes! Furthermore, if I change up the transformations in the pipeline, swapping params and even different transformers, I still always only get 7 nodes back!

There's a person w/a very unique name in the docs. I can search the doc text and find it. But, it's not in the transformed nodes; I'm missing data. What am I doing wrong?

Here's the pipeline. (The commented out code was me trying different variants. It makes no difference.):
Plain Text
pipeline = IngestionPipeline(
    transformations=[
        # Option 1: Use SemanticSplitterNodeParser for semantic splitting
        # SemanticSplitterNodeParser(
        #     buffer_size=512,
        #     breakpoint_percentile_threshold=95,
        #     embed_model=embed_model,
        #     verbose=True,
        # ),
        # Option 2: Use SentenceSplitter for sentence-level splitting
        SentenceSplitter(),
        # Option 3: Use UnstructuredElementNodeParser for custom parsing
        # UnstructuredElementNodeParser(),
    ],
    docstore=SimpleDocumentStore(),
    vector_store=vector_store,
    cache=IngestionCache(),
)
nodes = pipeline.run(documents=documents, show_progress=True, in_place=True)
L
J
23 comments
tbh I would remove the cache and docstore. If you instantiate them new every time, they aren't doing anything

I would even remove the vector store as well and see if it returns the proper nodes
Trying that now. I thought if you provided vector_store= then the pipeline will store the embeddings -- so, it's a sink. Either way, trying.
Yea it will, but just for sanity I like to start simple πŸ˜…
There we go.

(Pdb) len(nodes)
246
... and my missing text appears.
ok cool, so that works. So then, if we add JUST the vector store, does it return the same?
Reading my mind. Running.
I bet it's the docstore.
Yea either the docstore or cache

The docstore does some deduplicating/upserting based on document ids. So if your input document ids are not unique across all documents, that could cause some issues πŸ€”
Adding back the vector_store I get the same number.
What happens if you do this?

Plain Text
ids = [document.doc_id for document in documents]
print("Unique doc ids: ", len(set(ids)))
It's the docstore.
I added it back and got 7 nodes again.
I mistakenly thought this

docstore=SimpleDocumentStore(),

would create a new one.
it will actually
which is why I think you might have duplicate ids
(which would be a bug with the unstructured reader i think?)
I am rerunning the pipeline over the same data. So, I expect the IDs to be the same. I just didn't expect that to persist cross-run when I re-created it.
I re-added the IngestionCache and removed the DocStore. I get what I need now. Thanks so much!

πŸ™‡β€β™‚οΈ
Sorry, I meant within the same batch of documents, many documents may have the same ID (Someone made a PR for this in the unstructured reader recently, which is why I suspect this)
Either way I guess, glad it works
Add a reply
Sign up and join the conversation on Discord