I'm embarrassed to even ask this, but here goes. π°
I have a very strange issue. I recursively load a directory full of HTML using
documents = SimpleDirectoryReader(
input_dir=source_directory,
file_extractor={".html": UnstructuredReader()},
file_metadata=lambda x: {"biz_id": int(biz_id)},
required_exts=[".html"],
recursive=True,
).load_data()
It loads all 193 documents and the data look correct. BUT, when I run the ingestion pipeline off the loaded docs, I
always only get 7 nodes! Furthermore, if I change up the transformations in the pipeline, swapping params and even different transformers, I still always only get 7 nodes back!
There's a person w/a very unique name in the docs. I can search the doc text and find it. But, it's not in the transformed nodes; I'm missing data. What am I doing wrong?
Here's the pipeline. (The commented out code was me trying different variants. It makes no difference.):
pipeline = IngestionPipeline(
transformations=[
# Option 1: Use SemanticSplitterNodeParser for semantic splitting
# SemanticSplitterNodeParser(
# buffer_size=512,
# breakpoint_percentile_threshold=95,
# embed_model=embed_model,
# verbose=True,
# ),
# Option 2: Use SentenceSplitter for sentence-level splitting
SentenceSplitter(),
# Option 3: Use UnstructuredElementNodeParser for custom parsing
# UnstructuredElementNodeParser(),
],
docstore=SimpleDocumentStore(),
vector_store=vector_store,
cache=IngestionCache(),
)
nodes = pipeline.run(documents=documents, show_progress=True, in_place=True)