Anyone have index reccomendations

rrkhettry

Anyone have index reccomendations besides vector store index? Maybe something that can encode high level summaries accross documents, and that is able to dynamically add documents to index. I tried vector store index, seems a little too simple, i also tried raptor, but this isn't optimized for dynamically adding documents (as emails come in)

8 comments

jjerryjliu0

was about to suggest the raptor pack

the VectorStoreIndex has an insert_nodes function that you can use to dynamically add new documents

jjerryjliu0

more generally check out our doc management guide: https://docs.llamaindex.ai/en/stable/module_guides/indexing/document_management/?h=document+managem

rrkhettry

Yes the raptor pack is definetly great, you guys did well implementing it! I think the main issue is the lack of low level document control which is the only thing that prevents it from use in dynamic cases but im excited to see how that progresses.

bbidda7287

@jerryjliu0 i have some doubts

jjerryjliu0

@rkhettry let me know if the insert / doc management resources are helpful

rrkhettry

@jerryjliu0 i used index.insert from the documentation. But im not sure if inserting in bulk would be faster. I loop through a bunch of docs and insert them 1 by 1, but is there another way to do this? Heres my code

def process_project_directory(project_code, network_path, vectordb_path, max_emails):
    project_path = os.path.join(network_path, project_code)
    client = chromadb.PersistentClient(path=vectordb_path)
    collection_name = project_code
    collection = client.get_or_create_collection(collection_name)
    vector_store = ChromaVectorStore(chroma_collection=collection)
    index = VectorStoreIndex.from_vector_store(vector_store=vector_store, embed_model="local:BAAI/bge-large-en-v1.5")

    existing_ids = set(get_existing_ids(collection))
    #loop through XMl files in project_path
    xml_files = [f for f in os.listdir(project_path) if f.endswith('.xml')]
    print(xml_files)
    count = 0
    for xml_file in xml_files:
        if count >= max_emails:
            break
        xml_file_path = os.path.join(project_path, xml_file)
        if xml_file not in existing_ids:
            email_content = extract_xml_content(xml_file_path, xml_file)
            document = Document(
                text=f"Body: {email_content.body}, Date: {email_content.date_sent}, From: {email_content.from_email}, To: {email_content.to_email}, Subject: {email_content.subject}",
                metadata={
                    "file_name": email_content.xml_file,
                    "id": email_content.id,
                    "subject": email_content.subject,
                    "date_sent": email_content.date_sent,
                    "from_email": email_content.from_email,
                    "to_email": email_content.to_email
                }
            )
            index.insert(document)
            count += 1

jjerryjliu0

if you don't need to do any further chunking you could do insert_nodes

bbidda7287

@jerryjliu0 How to use Auto merging retriever alongside chromadb

Add a reply

Find answers from the community

Anyone have index reccomendations