----------------------
I understand that you're loading documents from text parsed from a previous JSON file, rather than using SimpleDirectoryReader. In this case, you have more control over how you create and manage your documents. Let me provide some guidance on how to handle document IDs in this scenario:
- Creating Documents Manually:
When you create documents from parsed text, you can assign custom IDs to ensure consistency across ingestions. Here's an example of how you might do this:
from llama_index.core import Document
documents = []
for i, text in enumerate(parsed_texts):
doc = Document(
text=text,
doc_id=f"custom_id_{i}" # or any other unique identifier
)
documents.append(doc)
- Using Metadata for Identification:
If your JSON file contains metadata that can uniquely identify each document, you can use that for the doc_id. For example:
for item in json_data:
doc = Document(
text=item['text'],
doc_id=item['unique_identifier'],
metadata=item.get('metadata', {})
)
documents.append(doc)
- Generating Consistent IDs:
If you don't have natural unique identifiers, you could generate consistent IDs based on the content of the document. For example, using a hash of the content:
import hashlib
for text in parsed_texts:
doc_id = hashlib.md5(text.encode()).hexdigest()
doc = Document(text=text, doc_id=doc_id)
documents.append(doc)
By assigning consistent doc_ids, you ensure that the UPSERTS strategy in your IngestionPipeline works effectively. This way, when you ingest the same document again (perhaps with updated content), it will update the existing document rather than creating a duplicate.