Find answers from the community

Updated 9 months ago

I am using the following code to ingest

At a glance
I am using the following code to ingest a document into a vector store

Plain Text
def process_document(dbdir):
    chroma_client = chromadb.PersistentClient(path=dbdir)
    chroma_collection = chroma_client.get_or_create_collection("bitcoin")
    vector_store = ChromaVectorStore(chroma_collection)

    llm = OpenAI(model="gpt-4-0125-preview")

    loader = PyMuPDFReader()
    docs = loader.load_data(file_path=os.path.join(os.path.dirname(__file__), "..", "docs", "bitcoin.pdf"))
    for doc in docs:
        doc.id_ = hashlib.sha256(doc.text.encode('utf-8')).hexdigest()
    click.echo(f"Loaded {len(docs)} documents")

    embed_model = OpenAIEmbedding()

    extractors = [
        SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model),
        TitleExtractor(nodes=5),
        SummaryExtractor(summaries=["prev", "self", "next"]),
        QuestionsAnsweredExtractor(questions=10, metadata=MetadataMode.EMBED),
        KeywordExtractor(keywords=5),
        embed_model
    ]

    pipeline = IngestionPipeline(transformations=extractors, vector_store=vector_store, cache=IngestionCache())
    processed_nodes = pipeline.run(documents=docs, show_progress=True, store_doc_text=True, store_doc_metadata=True)
    click.echo(f"Processed {len(processed_nodes)} nodes")


How would i use refresh_ref_docs so that when i run the same document again it doesnb't create duplicate entries but updates the associated metadata and embeddings. I use the hash of the content to create my doc_id but whenever i try to add code that calls refresh i get the following error

Plain Text
An error occurred: 'TextNode' object has no attribute 'get_doc_id'


Can i do a refresh as part of my ingest pipeline
L
j
6 comments
if you attach a docstore to the pipeline (and remember to persist it), it will remember the documents that have run through it (assuming the input documents have consistant document IDs, it uses those IDs to line up and compare hashes)
so that means i don't need to assign the id_ maually as i do above
I have attached a docstore but it's still seems to be creating new embeddings when i run the process multiple times. If the SemanticSplitterNodeParseer was creating different chunks on each run then that would explain it. Would that be the case?
the document management is done before running any transformations

Are you saving/loading the same docstore each time?
actually probably not lading it correctly. Let me check that. Just checking that if i use the pipeline i don't need to explitctly refrwsh the docs, it's done in the ingestion pipeline
yea its done in the pipeline πŸ‘
Add a reply
Sign up and join the conversation on Discord