Find answers from the community

Updated 2 months ago

hello, was wondering how can i manage documents that are ingested into the index, how can

hello, was wondering how can i manage documents that are ingested into the index, how can can keep track of which document has already been embedded and remove it in the future if i want to? also is there a way to manage telemetry/evaluation for multiple users?
k
3 comments
Please stand by @RandomCitizen as I search the knowledge base. I normally take about 20 seconds or so. :wave:
----------------------
Sure, I can help with that.

To manage documents that are ingested into the index, you can use the VectorStoreIndex class from the llama_index.core module. This class allows you to create a vector index using your vector store and quickly ask which documents are seen. Here's an example of how you can do this:

Plain Text
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(
    pipeline.vector_store, embed_model=embed_model
)

print(
    index.as_query_engine(similarity_top_k=10).query(
        "What documents do you see?"
    )
)

(Source: Github ())

To avoid document duplication and save embedding tokens, you can manage documents from a source that is constantly updating. For example, if you're using Discord as your source, you can manage the documents in such a way that you always have the latest data without duplicating any messages. You can learn more about this from the Discover LlamaIndex Video Series () on Discord Document Management.

To update or delete an existing document in an index, you can use the doc_id. You can also add a new document to an existing index. More details can be found in the Document Management () section of the LlamaIndex documentation.

As for managing telemetry/evaluation for multiple users, one way to handle this is by using the BaseManager python package to provide sequential access to the index using a separate server and locks. This can help handle concurrent uploads and ensure sequential inserts into the index. Here's a basic example of what your index_server.py might look like:
Plain Text
import os
from multiprocessing import Lock
from multiprocessing.managers import BaseManager
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Document

os.environ["OPENAI_API_KEY"] = "your key here"

index = None
lock = Lock()

def initialize_index():
    global index
    with lock:
        # same as before ...
        pass

def query_index(query_text):
    global index
    query_engine = index.as_query_engine()
    response = query_engine.query(query_text)
    return str(response)

if __name__ == "__main__":
    print("initializing index...")
    initialize_index()

(Source: LlamaIndex Documentation ())

I hope this helps! Let me know if you have any other questions.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
Add a reply
Sign up and join the conversation on Discord