JasonV

Quick question. My ingestion pipeline works just fine for building my vector store. But, as my postgres DB expanded, I needed to embed a few other fields not related to the original ingestion. Would those here, hand-roll a new embedding table? Just create a new VectorStoreIndex? I definitely don't want to add the new embeddings to the original. Any perspectives welcome.

6 comments

JJasonV

Has anyone used `instructor` yet?

Has anyone used instructor yet?

2 comments

JJasonV

How are folks' experience with Anthropic

How are folks' experience with Anthropic? I heard good things in a meeting today, but in my hands for the past few hours, it's been abysmal.

3 comments

JJasonV

Looks like I need hand, if someone so

Looks like I need hand, if someone so wise is around. 😎

I've been screwing up my node filtering all along. The only reason I've gotten such good results is that the queries embed a relevant cue and I'm getting lucky on filtering.

Here's my use case. I want to ingest lots of documents -- estimating close to 300,000 into pgvector. During ingestion, I set a metadata key business_id. I can verify that each node in the table has .metadata['business_id'] set to the correct value.

I need to, at query time, pull only those docs with the specific metadata['business_id'] == some_value the filter the top_k from that set, NOT pull top_k from all nodes and then return those matching. Make sense? I just need a where clause on my SQL query. 🙂

10 comments

JJasonV

I'm embarrassed to even ask this, but

I'm embarrassed to even ask this, but here goes. 😰

I have a very strange issue. I recursively load a directory full of HTML using

Plain Text

documents = SimpleDirectoryReader(
    input_dir=source_directory,
    file_extractor={".html": UnstructuredReader()},
    file_metadata=lambda x: {"biz_id": int(biz_id)},
    required_exts=[".html"],
    recursive=True,
).load_data()

It loads all 193 documents and the data look correct. BUT, when I run the ingestion pipeline off the loaded docs, I always only get 7 nodes! Furthermore, if I change up the transformations in the pipeline, swapping params and even different transformers, I still always only get 7 nodes back!

There's a person w/a very unique name in the docs. I can search the doc text and find it. But, it's not in the transformed nodes; I'm missing data. What am I doing wrong?

Here's the pipeline. (The commented out code was me trying different variants. It makes no difference.):

Plain Text

pipeline = IngestionPipeline(
    transformations=[
        # Option 1: Use SemanticSplitterNodeParser for semantic splitting
        # SemanticSplitterNodeParser(
        #     buffer_size=512,
        #     breakpoint_percentile_threshold=95,
        #     embed_model=embed_model,
        #     verbose=True,
        # ),
        # Option 2: Use SentenceSplitter for sentence-level splitting
        SentenceSplitter(),
        # Option 3: Use UnstructuredElementNodeParser for custom parsing
        # UnstructuredElementNodeParser(),
    ],
    docstore=SimpleDocumentStore(),
    vector_store=vector_store,
    cache=IngestionCache(),
)
nodes = pipeline.run(documents=documents, show_progress=True, in_place=True)

23 comments

JJasonV

Anyone else seeing duplicate OpenAI

Anyone else seeing duplicate OpenAI calls when using MultiStepQueryEngine?

3 comments

JJasonV

Can't I do a query-time metadata filter?

Can't I do a query-time metadata filter?

Let's say I indexed 5 documents each from different authors. The node's metadata has the author on it. The docs seem to indicate I can only add a metadata filter to the retriever then instantiate the query_engine. That means I'm constantly having to re-create the engine if the metadata over which I'm querying changes, like searching for author1 in query1 then author2 in query2.

Other frameworks allow me to filter at query time.

6 comments

Find answers from the community

Mapping Unstructured Text to Knowledge Graphs: Exploring the State of the Art

Python 3.13 Support Across Llama-index

Any practical tips for handling

I see `from llama_index.program.openai

Congrats on the workflows, I'll

Did instrumentation change? Arize

Anyone have best practices in mind?

I love how many people are using `tqdm`

Quick question. My ingestion pipeline

Has anyone used `instructor` yet?

How are folks' experience with Anthropic

Looks like I need hand, if someone so

I'm embarrassed to even ask this, but

Anyone else seeing duplicate OpenAI

Can't I do a query-time metadata filter?