Maxx

Using this code to generate an index

Using this code to generate an index from scratch:

Plain Text

def nodes_to_faiss(nodes, persist_dir):

    # Generate Node Embeddings
    embed_model = OpenAIEmbedding(api_key=OPENAI_KEY)

    for node in nodes:
        node_embedding = embed_model.get_text_embedding(
            node.get_content(metadata_mode="all")
        )
        node.embedding = node_embedding

    # Build the Index from the Nodes
    d = 1536
    faiss_index = faiss.IndexFlatL2(d)
    vector_store = FaissVectorStore(faiss_index=faiss_index)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)

    index = VectorStoreIndex(
        nodes=nodes,
        storage_context=storage_context,
    )

    # save the resulting index to disk so that we can use it later
    print("Index created. Saving to disk...")
    index.storage_context.persist(persist_dir=persist_dir)

This used to work fine, but now I am getting this error:

Plain Text

RetryError: RetryError[<Future at 0x21da34162b0 state=finished raised InvalidRequestError>]

2 comments

MMaxx

I want to be able to delimit my document

I want to be able to delimit my document with a string and then index it so that each node contains the tokens between the delimiter strings. How do I do this?

6 comments

MMaxx

Several Questions (sorry):

I need to bring my index retrieval time down to less than 10 seconds consistently. Is this even possible?
I switched from in-memory vector store to Redis Vector Store which helped a lot with speed but it sometimes still takes upwards of 20 seconds. Is this just a feature of using redis or is there a good chance I am doing something wrong? My 3 indexes have ~2000, 1, and 1 document in them respectively, but even the single document sometimes can timeout.
If not Redis, is there another fast, free external vector DB I could try?

5 comments

MMaxx

I have a dataset of ~2000 documents

I have a dataset of ~2000 documents which contain information about/checklists for various sports trading card sets. I need the index retrieval time to be about half of what it currently is (currently takes about a minute). What kind of considerations should I make when deciding what type of index to use? I am currently using a Vector Store, which gives decent results but takes too long. Will I have to break up the index if I want to retrieve faster?

8 comments

MMaxx

FaissVectorStore saving/loading changes

FaissVectorStore saving/loading changes? My old code for generating/saving and loading an index no longer works.

Plain Text

def construct_index_from_nodes(nodes, persist_dir):
    load_dotenv()
    openai_api_key = os.getenv('OPENAI_API_KEY')

    # Generate Node Embeddings
    embed_model = OpenAIEmbedding(api_key=openai_api_key)

    for node in tqdm(nodes, desc="Generating Node Embeddings"):
        node_embedding = embed_model.get_text_embedding(
            node.get_content(metadata_mode="all")
        )
        node.embedding = node_embedding

    # Build the Index from the Nodes
    print("Building Index from Nodes...")
    d = 1536
    faiss_index = faiss.IndexFlatL2(d)
    vector_store = FaissVectorStore(faiss_index=faiss_index)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)

    index = VectorStoreIndex(
        nodes=nodes,
        storage_context=storage_context,
    )

    # save the resulting index to disk so that we can use it later
    print("Index created. Saving to disk...")
    index.storage_context.persist(persist_dir=persist_dir)
    print("Complete.")


def test_index(persist_dir, test_query, similarity_top_k=3):
    vector_store = FaissVectorStore.from_persist_dir(persist_dir)
    storage_context = StorageContext.from_defaults(
        vector_store=vector_store, persist_dir=persist_dir)
    index = load_index_from_storage(storage_context=storage_context)
    retriever = index.as_retriever(similarity_top_k=similarity_top_k)

    nodes = retriever.retrieve(test_query)
    for i, node in enumerate(nodes):
        print("NODE", i, "[", round(node.get_score(), 2), "]", ":", node.dict()["node"]["text"], "\n\n")

Used to produce:

Plain Text

[docstore.json, graph_store.json, index_store.json, vector_store.json]

Now produces:

Plain Text

[default__vector_store.json, docstore.json, graph_store.json, image__vector_store.json, index_store.json]

8 comments

MMaxx

If I have lists in my documents which

If I have lists in my documents which are longer than the node token count, does that mean that parts of my list are being separated from their original context? As in, no longer attributed with the header of the list or the description which comes before? Would a good way to fix this be to change every entry in the list to a full sentence explaining the item’s relation (x1 is in set y, x2 is in set y, etc.)

5 comments

MMaxx

Does indexing with and without markdown

Does indexing with and without markdown make a difference with LlamaIndex?

3 comments

MMaxx

My data is a collection of documents

My data is a collection of documents which contain long lists of items as well as descriptions of both the items and the sets which contain them (not organized in any particular way because scraped). When indexing I always get the error

Plain Text

Token indices sequence length is longer than the specified maximum sequence length for this model (1050 > 1024). Running this sequence through the model will result in indexing errors

However, even though the index is successfully created, I often get incorrect responses, most commonly something like "X is not in set Y" even when it appears in the list for that set. So my assumption is that the documents are longer than the maximum chunk size and are being split into multiple chunks (with a little bit of overlap), and then I get situations where X appears in the second chunk of a list, only the index has no context for what the name of the set is? Sorry if any part of this is confusing, I am trying to verify that I understand why I am getting bad responses from my data. Would the solution to this problem be to manipulate the data so that either every document is < 1024 tokens? Or instead of having lists formatted like "Set Y: -x1 -x2 ..." have something like "x1 is in Set Y. x2 is in Set Y. ..."?

2 comments

Find answers from the community

Using this code to generate an index

I want to be able to delimit my document

Several Questions (sorry):

I have a dataset of ~2000 documents

FaissVectorStore saving/loading changes

If I have lists in my documents which

Does indexing with and without markdown

My data is a collection of documents