I'm embarrassed to even ask this, but

At a glance

I'm embarrassed to even ask this, but here goes. 😰

I have a very strange issue. I recursively load a directory full of HTML using

Plain Text

documents = SimpleDirectoryReader(
    input_dir=source_directory,
    file_extractor={".html": UnstructuredReader()},
    file_metadata=lambda x: {"biz_id": int(biz_id)},
    required_exts=[".html"],
    recursive=True,
).load_data()

It loads all 193 documents and the data look correct. BUT, when I run the ingestion pipeline off the loaded docs, I always only get 7 nodes! Furthermore, if I change up the transformations in the pipeline, swapping params and even different transformers, I still always only get 7 nodes back!

There's a person w/a very unique name in the docs. I can search the doc text and find it. But, it's not in the transformed nodes; I'm missing data. What am I doing wrong?

Here's the pipeline. (The commented out code was me trying different variants. It makes no difference.):

Plain Text

pipeline = IngestionPipeline(
    transformations=[
        # Option 1: Use SemanticSplitterNodeParser for semantic splitting
        # SemanticSplitterNodeParser(
        #     buffer_size=512,
        #     breakpoint_percentile_threshold=95,
        #     embed_model=embed_model,
        #     verbose=True,
        # ),
        # Option 2: Use SentenceSplitter for sentence-level splitting
        SentenceSplitter(),
        # Option 3: Use UnstructuredElementNodeParser for custom parsing
        # UnstructuredElementNodeParser(),
    ],
    docstore=SimpleDocumentStore(),
    vector_store=vector_store,
    cache=IngestionCache(),
)
nodes = pipeline.run(documents=documents, show_progress=True, in_place=True)

23 comments

LLogan M

tbh I would remove the cache and docstore. If you instantiate them new every time, they aren't doing anything

I would even remove the vector store as well and see if it returns the proper nodes

JJasonV

Trying that now. I thought if you provided vector_store= then the pipeline will store the embeddings -- so, it's a sink. Either way, trying.

LLogan M

Yea it will, but just for sanity I like to start simple 😅

JJasonV

Agreed.

JJasonV

There we go.

(Pdb) len(nodes)
246

JJasonV

... and my missing text appears.

LLogan M

ok cool, so that works. So then, if we add JUST the vector store, does it return the same?

JJasonV

Reading my mind. Running.

JJasonV

I bet it's the docstore.

LLogan M

Yea either the docstore or cache

The docstore does some deduplicating/upserting based on document ids. So if your input document ids are not unique across all documents, that could cause some issues 🤔

JJasonV

Adding back the vector_store I get the same number.

JJasonV

246

LLogan M

What happens if you do this?

Plain Text

ids = [document.doc_id for document in documents]
print("Unique doc ids: ", len(set(ids)))

JJasonV

It's the docstore.

JJasonV

I added it back and got 7 nodes again.

JJasonV

I mistakenly thought this

docstore=SimpleDocumentStore(),

would create a new one.

LLogan M

it will actually

LLogan M

which is why I think you might have duplicate ids

LLogan M

(which would be a bug with the unstructured reader i think?)

JJasonV

I am rerunning the pipeline over the same data. So, I expect the IDs to be the same. I just didn't expect that to persist cross-run when I re-created it.

JJasonV

I re-added the IngestionCache and removed the DocStore. I get what I need now. Thanks so much!

🙇‍♂️

LLogan M

Sorry, I meant within the same batch of documents, many documents may have the same ID (Someone made a PR for this in the unstructured reader recently, which is why I suspect this)

LLogan M

Either way I guess, glad it works

Add a reply

Find answers from the community

I'm embarrassed to even ask this, but