I am using the following code to ingest

At a glance

jjalateras1963

I am using the following code to ingest a document into a vector store

Plain Text

def process_document(dbdir):
    chroma_client = chromadb.PersistentClient(path=dbdir)
    chroma_collection = chroma_client.get_or_create_collection("bitcoin")
    vector_store = ChromaVectorStore(chroma_collection)

    llm = OpenAI(model="gpt-4-0125-preview")

    loader = PyMuPDFReader()
    docs = loader.load_data(file_path=os.path.join(os.path.dirname(__file__), "..", "docs", "bitcoin.pdf"))
    for doc in docs:
        doc.id_ = hashlib.sha256(doc.text.encode('utf-8')).hexdigest()
    click.echo(f"Loaded {len(docs)} documents")

    embed_model = OpenAIEmbedding()

    extractors = [
        SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model),
        TitleExtractor(nodes=5),
        SummaryExtractor(summaries=["prev", "self", "next"]),
        QuestionsAnsweredExtractor(questions=10, metadata=MetadataMode.EMBED),
        KeywordExtractor(keywords=5),
        embed_model
    ]

    pipeline = IngestionPipeline(transformations=extractors, vector_store=vector_store, cache=IngestionCache())
    processed_nodes = pipeline.run(documents=docs, show_progress=True, store_doc_text=True, store_doc_metadata=True)
    click.echo(f"Processed {len(processed_nodes)} nodes")

How would i use refresh_ref_docs so that when i run the same document again it doesnb't create duplicate entries but updates the associated metadata and embeddings. I use the hash of the content to create my doc_id but whenever i try to add code that calls refresh i get the following error

Plain Text

An error occurred: 'TextNode' object has no attribute 'get_doc_id'

Can i do a refresh as part of my ingest pipeline

6 comments

LLogan M

if you attach a docstore to the pipeline (and remember to persist it), it will remember the documents that have run through it (assuming the input documents have consistant document IDs, it uses those IDs to line up and compare hashes)

jjalateras1963

so that means i don't need to assign the id_ maually as i do above

jjalateras1963

I have attached a docstore but it's still seems to be creating new embeddings when i run the process multiple times. If the SemanticSplitterNodeParseer was creating different chunks on each run then that would explain it. Would that be the case?

LLogan M

the document management is done before running any transformations

Are you saving/loading the same docstore each time?

jjalateras1963

actually probably not lading it correctly. Let me check that. Just checking that if i use the pipeline i don't need to explitctly refrwsh the docs, it's done in the ingestion pipeline

LLogan M

yea its done in the pipeline 👍

Add a reply

Find answers from the community

I am using the following code to ingest