You'll need to pass the top_k value in the engine along with chat_mode and all.
By default it only picks two nodes only
chat_engine = index.as_chat_engine(similarity_top_k=10....)
This is because index.as_chat_engine(similarity_top_k=10....)
will pass all kwargs to the chat engine, and also the index that the underlying chat engine uses. similarity_top_k
is an argument used in the underlying index.
If you are creating the chat engine yourself, without index.as_chat_engine
, then you'll have to pass in similarity_top_k
when you create the query engine
@Logan M and is it possible to specify we want apply "refine" when the #tokens doesn't fit into a single prompt? By default when K is big, I only get the first documents in the prompt but there are no additional calls for the refinement.
It should be refining automatically, except for context and context_plus_condense chat engines (since refining doesn't quite make sense there)
@Logan M yes it's doing it well. However, I've found by looking at the prompt traces with Traceloop that the first completion is more accurate in general, whereas the second completion gives complementary information and sometimes it can be out of the question topic.
Is there any way to give the first completion result and then ask the user if they want to more information? (this would call the second completion without need to specify a second question from the user).
I'd say that would be useful as well in case the first completion failed to retrieve precisely what the user wanted and the second, third, etc. could have the answer
another problem I've found using the condense_question question chat is when one completely changes the topic in the next question. The previous state still contains a lot of information about the former topic and the genrated answer doesn't have anything in relation with the new topic. Is this avoidable by some technique?
Not really possible? It's making two calls because the retrieved text isn't fitting into one LLM call. So you'd need to reduce similarity_top_k
, reduce your chunk_size
, or maybe try the compact_accumulate
response mode? (that last one might just end up confusing the LLM tbh)
I think it's avoidable by not using the condense_question chat engine π
In my opinion, it leads to the least natural-feeling conversations
let me see if I understand it: the CondenseQuestionChatEngine class has an attribute "query_engine: BaseQueryEngine, "
this is BaseQueryEngine should have a retriever and the retriever should have the similarity_top_k
just by creating an instance the CondenseQuestionChatEngine all the rest of the instances (queryengine and retriever) will be also created as an instance but as an attribute o the CondenseQuestionChatEngine?
frankly I don't get how finally the similarity_top_k is assigned in the retriever coming from the CondenseQuestionChatEngine instance as an attribute though the kwargs
sure, as the all the possible parameters are not described in the documentation and it's a bit advanced python I'm a bit on a trouble at understanding it
Yea it's just abusing kwargs a bit -- kwargs is just a python way of catching any keyword argument
Probably it will make some more sense if you tried using it in a little test function
But I can garuntee that similarity top k is getting passed into the retriever
hmm, got the point. Thanks a lot Logan!
in my case because I'm loading from file to do the index it should first pass through here
somehow the BaseIndex generated will have been called in its constructor by passing the kwargs as well
Yup exactly! So when you call load_index_from_storage(...)
you can do things like passing the service context
index = load_index_from_storage(storage_context, service_context=service_context)
kwargs handles passing that all down into the base index constructor
indeed this was another doubt I had but I passed on
here when I restore the index I don't pass the service_context. I was supposing that what I retrieve also contains the service context stored from when I saved the previous index
Yea it does not save the service context to disk.
However, if you set a global service context in your code, you don't have to worry about passing it in
didn't know about that line of the global_service. Now I know why it's woking. Thanks π
actually yes, the condense_question, when changing the topic of the questions at first it mixes the former topic to rebuild the question. Whereas the context doesn't suffer of that, instead is a bit confued by the mixing of the former topics in the same the context and goes more to rely into the knowledge of the LLM instead of the documents retrieved. All in all, the context mode is more recommendable at first glance.
@Logan M one doubt about the tree_summarize parameter:
what's the ameaning of it? because it seems as it's defining a tree index but in the previous line I'm just defning the index as a vector_store_index (that allows to make retrievals for k-top similar docs.)
Tree summarize is just a method for what to do after your retrieve text
In this case, it will build a bottom up "tree" of responses, using your query
However, this is really only noticeable with large top k values
ok! kind of a refine mode but doing the LLM completions as a bottom-up tree. The more documents retrieved, the more levels the tree will have to have
Hi @Logan, is there any way to append more documents/nodes to an already existing index? I have creted one out of some pdfs and epubs and it has taken a lot. I have stored in disk the index but I'd like to append another folder of documents. Here there's the code:
the idea would be to append the documents of the variable documentsNassim
Yep, You can insert new docs in an existing index
what's the function because if just use the from_documents function it will create a new index, no?
with something like this
documents = SimpleDirectoryReader(input_files=[files,...], filename_as_id=True).load_data()
for doc in docs:
index.insert(doc)
it's ingesting. Let's see. Thanks @WhiteFang_Jr
to make the embeddings; I'm doing it now locally with an intel processor. Do you know if it's possible to make a call to Ollama like when calling the LLM? Because for the LLM inference I have the Ollama configured to make the request to a mac with M1 pro chip and it's much faster.
I've check the documentation but I don't such an option:
https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.htmlhmm, no idea of how it would be adaptaed to Ollama to be honest. Another option could be to use the langchain embeddings, which I see that they have connection with Ollama?