Hello is there a recommended way to

At a glance

The community member is facing an issue with their ingestion pipeline, where 10K documents were inserted into the docstore but there was a failure during embedding. They are wondering if there is a recommended way to rerun the pipeline without skipping the existing documents, which would be considered duplicates.

The comments suggest that running the documents in smaller batches may help with failures, and that the solution would be to delete all the documents from the docstore. Another community member also mentions that it is possible to exclude metadata from being embedded using the document.excluded_embed_metadata_keys parameter.

ggamecode8

Hello is there a recommended way to rerun the ingestion pipeline in case of failure? 10K documents were inserted into the docstore but there was a failure during embedding and now rerunning it will skip them since they will be considered duplicates.

Is the solution to delete all from docstore or is there a better way?

4 comments

LLogan M

if I was running that many documents, I might run them in smaller batches to help with failures

But yea, would have to delete from the docstore in this case

ggamecode8

I see, will try to run in smaller batches. Also is it possible to exclude the metadata from being embedded, i saw that call of BaseEmbedding includes it without a way to exlcude

LLogan M

Metadata can be excluded for embeddings (or llm calls) at the document/node level

document.excluded_embed_metadata_keys=["key", ...]

ggamecode8

Awesome thanks ill try this

Add a reply

Find answers from the community

Hello is there a recommended way to