LlamaIndex

Log inLog into community

Find answers from the community

Updated 2 years ago

Docs

Docs

At a glance

·

Hello all - I'm doing a project using LlamaIndex to create text prediction for my documents. I'm looking to understand what is happening within the index and when it is queried. Before I jump into the code is there any background material/documentation that can help?

1

L

M

A

32 comments

For sure! Check out this page https://gpt-index.readthedocs.io/en/latest/guides/index_guide.html

thanks. Is there a next layer down?

e.g. I'm wondering what a node is and particularlly what the query that is generated is.

The questions that come to mind are:

does LlamaIndex start to answer the query? If so, what logic/algorithm is it performing on the query.
what data is being sent to OpenAI? multi-dimensional matrices, text or both?
Can I arrange the data being indexed in a way that improves the "matching"?

1/2 -- yea so when you create an index with a list of documents, each document gets broken up into a node chunk (default is 4000 tokens per chunk). In a vector index, each of these nodes would be embedded and a vector representing each chunk is saved.

Then, at query time (also for a vector index), the query is also embedded. The index fetches the top k closest matching nodes (default top k is 1). If the node + prompt instructions + query is longer than the LLM context length (4096 tokens), it gets broken down into more chunks so that it fits into the the LLM context.

If, after all this, there is more than one chunk of text, the answer is refined across multiple LLM calls

3 -- oh for sure! Usually, if you can separate documents into clear topics or sections, this will help. You can also look into composable indexes, where you can wrap any number of indexes with another top level index: https://gpt-index.readthedocs.io/en/latest/how_to/composability.html

If you know python, I recommend reading the code for the best understanding. I personally like to use a debugger to step through the functions and see what's going on when I call functions.

@Logan M a question on part 2: if the prompt (node+query+instructions) is longer than 4096, how is it splitted? I cant understand that, sorry 🙂
Ofc it’s clear at node level, but split the instructions doesn’t make sense to me, that’s why I asked

In that case, the node would be split into further chunks so that everything fits. Instructions/prompt template and query text is not touched

Now makes completely sense! And use the separator of the text splitter to do so, right?

Almost. In that case, it will be using the prompt helper to split, which has its own separater (which is a space by default)

I’ve always ignored that part. Now it’s completely unlocked!

thanks dude, I was missing that a node was a chunk of 4000 tokens. (doing an ElasticSearch upgrade and node means something compeltely different there :p)

I'm sorta talking to the 🦆 here.

vector representing each chunk <- this is generated by sending the text to an embedding API like OpenAI's ada-002 model.

from the usage docs for setting the response mode: mode="embedding" will synthesize an answer by fetching the top-k nodes by embedding similarity.

How does it know if the top-k nodes is similar? (e.g. a call to OpenAI or it does tha maths on the vectors)

otherwise, I'm starting to see how compostible indexes could work. it seems I can create a main one per user and add sub-indexes of objects they "own" (e.g. modes of transport (like car, bike and van) and storage (like shed, garage, parking space))

Then I query the user index something like "list the models and the storage for each mode of transport owned by this user".

That would traverse over the two sub-indicies and do it's embedding similarity. before sending the full query to OpenAI..?

fyi - I'm about to try a compostible index and see what is outputted 🙂

mode="embedding" is only for list and tree indexes. For those and vector indexes though, you can use similarity_top_k=X in the query to control the top_k

It uses cosine similarity under the hood to compare vectors.

And yea, if you wrap two vector indexes with a list index, that's how it will work 💪

I am wondering about creating a data layer to track Documents, Indexes and Summaries.

Plain Text

    # summarise the documents
    summary = index_1.query(
        "What is a summary of the content of this document?"
    )
    index_1.set_text(str(summary))
    summary = index_2.query(
        "What is a summary of the content of this document?"
    )
    index_2.set_text(str(summary))
    summary = index_3.query(
        "What is a summary of the content of this document?"
    )
    index_3.set_text(str(summary))

    index = GPTListIndex([index_1, index_2, index_3])

part of that is because it has to calculate the summary on each run

at least that is my understanding from this: https://gpt-index.readthedocs.io/en/latest/how_to/composability.html#defining-summary-text

oh, gonna try saving the index after setting it 🤦‍♂️

yeah that worked a treat 🙊

Haha yes, nice! 💪

One thing that could be improved is using a temporary list index with response_mode="tree_summarize" in the query to generate summaries.

If all your indexes are vector indexes, then your summaries are only built using the top_k nodes (1 by default)

Or you can increase the top_k when generating the summaries currently.

But, if your summaries are working fine, carry on! 😎

seems good for the moment thanks. I need to get my head around what happens if I increase the top_k. also nodes, being chunks!! 😉

I've started to go through the code and get familar with it, so I think I can 🖖

Awesome, glad to hear! 💪

I've also been reading up on the cosine function and ❤️ that it can do that locally.

which has got me thinking of other ways to generating embeddings.

I intend on using OpenAI for the text generation. With that in mind do I really need to use OpenAI to generate the embeddings?

could there be a more cost effective method to do it locally... 🤔

By default, it's using text-ada-002 to generate embeddings, and it's pretty dirt cheap tbh

You could do it locally, but everything open source either has a shorter context length then openAI, or isn't as good in general. At least that was my experience

Here's a page showing how you can use any model from huggingface (for example, you could use a sentence transformer model from huggingface)

https://gpt-index.readthedocs.io/en/latest/how_to/embeddings.html#custom-embeddings

There is a question about data sovereignty in my company so I'm liking the idea of keeping as much as possible local ✌️

@Logan M apologies to pick up this old thread, but can you give any pointers in the code where the logic is implemented that multiple llm calls are made when node+promtp+query exceeds the context length?

Sure thing, over here

Here we compact all retrieved nodes into the largest text chunk possible
https://github.com/run-llama/llama_index/blob/c17d1497e8cbb9bfa41d9055f02cdb098d3f5838/llama_index/response_synthesizers/compact_and_refine.py#L26

And this actually builds the response
https://github.com/run-llama/llama_index/blob/c17d1497e8cbb9bfa41d9055f02cdb098d3f5838/llama_index/response_synthesizers/refine.py#L114

thanks, i will have a look what happens in there.

Add a reply

Sign up and join the conversation on Discord