Find answers from the community

Updated 2 months ago

Top_k

Hi, I'd like to know if it's possible to set a custom index for a chat_engine. I'd need to retrieve more than the default 2 documents for each interaction but it seems as it can't be done...
Attachment
image.png
W
d
L
48 comments
You'll need to pass the top_k value in the engine along with chat_mode and all.


By default it only picks two nodes only
chat_engine = index.as_chat_engine(similarity_top_k=10....)
Hi @WhiteFang_Jr, certainly it works. I was looking the class code but i didn't see the similarity_top atrribute:
https://github.com/run-llama/llama_index/blob/main/llama_index/chat_engine/condense_question.py

However, now I see if I put the K=24 for instance, the engine doesn't not make additional calls to take all the nodes/documents. Is is possible to have this feature in chats also?
This is because index.as_chat_engine(similarity_top_k=10....) will pass all kwargs to the chat engine, and also the index that the underlying chat engine uses. similarity_top_k is an argument used in the underlying index.

If you are creating the chat engine yourself, without index.as_chat_engine, then you'll have to pass in similarity_top_k when you create the query engine
@Logan M and is it possible to specify we want apply "refine" when the #tokens doesn't fit into a single prompt? By default when K is big, I only get the first documents in the prompt but there are no additional calls for the refinement.
It should be refining automatically, except for context and context_plus_condense chat engines (since refining doesn't quite make sense there)
Hi @Logan M, the similarity_top_k is an attribute but of the retrievers class : https://github.com/run-llama/llama_index/blob/main/llama_index/indices/list/retrievers.py
if I go to the chatengine it could be perfectly possible to pass the similarity_top_k as a parameter but I don't see in the code how then that is assigned to the retriever.
https://github.com/run-llama/llama_index/blob/main/llama_index/chat_engine/condense_question.py
It might be some advanced concept in python software engineering...
@Logan M yes it's doing it well. However, I've found by looking at the prompt traces with Traceloop that the first completion is more accurate in general, whereas the second completion gives complementary information and sometimes it can be out of the question topic.
Is there any way to give the first completion result and then ask the user if they want to more information? (this would call the second completion without need to specify a second question from the user).
I'd say that would be useful as well in case the first completion failed to retrieve precisely what the user wanted and the second, third, etc. could have the answer
another problem I've found using the condense_question question chat is when one completely changes the topic in the next question. The previous state still contains a lot of information about the former topic and the genrated answer doesn't have anything in relation with the new topic. Is this avoidable by some technique?
It gets passed as kwargs

So as_chat_engine() is defined here
https://github.com/run-llama/llama_index/blob/895c15f283c5c41d1ea43753e029283afa57cdc4/llama_index/indices/base.py#L353

similarity_top_k , if specified, will be in the kwargs dict

Notice that as_query_engine() passes in all kwargs
https://github.com/run-llama/llama_index/blob/895c15f283c5c41d1ea43753e029283afa57cdc4/llama_index/indices/base.py#L356

All those kwargs get passed to as_retriever() -- which will pass it into the retriever
https://github.com/run-llama/llama_index/blob/895c15f283c5c41d1ea43753e029283afa57cdc4/llama_index/indices/base.py#L346
Not really possible? It's making two calls because the retrieved text isn't fitting into one LLM call. So you'd need to reduce similarity_top_k, reduce your chunk_size, or maybe try the compact_accumulate response mode? (that last one might just end up confusing the LLM tbh)
I think it's avoidable by not using the condense_question chat engine πŸ˜… In my opinion, it leads to the least natural-feeling conversations
let me see if I understand it: the CondenseQuestionChatEngine class has an attribute "query_engine: BaseQueryEngine, "
this is BaseQueryEngine should have a retriever and the retriever should have the similarity_top_k
just by creating an instance the CondenseQuestionChatEngine all the rest of the instances (queryengine and retriever) will be also created as an instance but as an attribute o the CondenseQuestionChatEngine?
frankly I don't get how finally the similarity_top_k is assigned in the retriever coming from the CondenseQuestionChatEngine instance as an attribute though the kwargs
Let me draw it πŸ™‚
sure, as the all the possible parameters are not described in the documentation and it's a bit advanced python I'm a bit on a trouble at understanding it
Yea it's just abusing kwargs a bit -- kwargs is just a python way of catching any keyword argument
Attachment
image.png
Probably it will make some more sense if you tried using it in a little test function
But I can garuntee that similarity top k is getting passed into the retriever
hmm, got the point. Thanks a lot Logan!
Attachment
image.png
in my case because I'm loading from file to do the index it should first pass through here
Attachment
image.png
somehow the BaseIndex generated will have been called in its constructor by passing the kwargs as well
Yup exactly! So when you call load_index_from_storage(...) you can do things like passing the service context

index = load_index_from_storage(storage_context, service_context=service_context)
kwargs handles passing that all down into the base index constructor
indeed this was another doubt I had but I passed on
Attachment
image.png
here when I restore the index I don't pass the service_context. I was supposing that what I retrieve also contains the service context stored from when I saved the previous index
Attachment
image.png
Attachment
image.png
Yea it does not save the service context to disk.

However, if you set a global service context in your code, you don't have to worry about passing it in
didn't know about that line of the global_service. Now I know why it's woking. Thanks πŸ˜‰
actually yes, the condense_question, when changing the topic of the questions at first it mixes the former topic to rebuild the question. Whereas the context doesn't suffer of that, instead is a bit confued by the mixing of the former topics in the same the context and goes more to rely into the knowledge of the LLM instead of the documents retrieved. All in all, the context mode is more recommendable at first glance.
@Logan M one doubt about the tree_summarize parameter:
what's the ameaning of it? because it seems as it's defining a tree index but in the previous line I'm just defning the index as a vector_store_index (that allows to make retrievals for k-top similar docs.)
Attachment
image.png
Tree summarize is just a method for what to do after your retrieve text

In this case, it will build a bottom up "tree" of responses, using your query

However, this is really only noticeable with large top k values
ok! kind of a refine mode but doing the LLM completions as a bottom-up tree. The more documents retrieved, the more levels the tree will have to have
yup you got it
Hi @Logan, is there any way to append more documents/nodes to an already existing index? I have creted one out of some pdfs and epubs and it has taken a lot. I have stored in disk the index but I'd like to append another folder of documents. Here there's the code:
Attachment
image.png
the idea would be to append the documents of the variable documentsNassim
Yep, You can insert new docs in an existing index
what's the function because if just use the from_documents function it will create a new index, no?
with something like this

Plain Text
documents = SimpleDirectoryReader(input_files=[files,...], filename_as_id=True).load_data()
for doc in docs:
  index.insert(doc)
it's ingesting. Let's see. Thanks @WhiteFang_Jr
Attachment
image.png
to make the embeddings; I'm doing it now locally with an intel processor. Do you know if it's possible to make a call to Ollama like when calling the LLM? Because for the LLM inference I have the Ollama configured to make the request to a mac with M1 pro chip and it's much faster.

I've check the documentation but I don't such an option:
https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html
Attachment
image.png
Yeah ollama embeddings or ollama embedding feature is not present in llamaindex yet. But you can use custom embedding class to hit the embedding server running on Mac M1


https://docs.llamaindex.ai/en/stable/examples/embeddings/custom_embeddings.html#custom-embeddings
hmm, no idea of how it would be adaptaed to Ollama to be honest. Another option could be to use the langchain embeddings, which I see that they have connection with Ollama?
https://docs.llamaindex.ai/en/stable/examples/embeddings/Langchain.html
https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.ollama.OllamaEmbeddings.html#
there's a bse_url atribute in the langchain documentation. But for the llamaindex I just see the huggingface embeddings in the place of the langchain page
Add a reply
Sign up and join the conversation on Discord