@jerryjliu0 I pinged you over on the mlops slack channel to ask for help. there's no rush here, just looking for advice.
i tried several different index types. so i might be doing something wrong, the data should be structured better, or the models I'm using (all local ones) aren;t actually good enough to do this.
what are the pain points that you're facing?
well, the query responses are very poor
so I don;t know if it's constructing a bad context or the context plus prompt aren;t good, or it's the model not doing a good job using that information
one thing that's general advice if you're using GPTSimpleVectorIndex is to set the chunk size to something smaller (eg. default is 4000 tokens-ish, set chunk_size_limit=512), and then set similarity_top_k during query to something higher than 1.
index = GPTSimpleVectorIndex(docs, ..., chunk_size_limit=512)
index.query(..., similarity_top_k=4)
what llm model are you using?
Starting query: who wrote LangChain?
[query] Total LLM token usage: 256 tokens
[query] Total embedding token usage: 0 tokens
None
response = index.query("Notebook : A notebook walking")
> Starting query: Notebook : A notebook walking
> [query] Total LLM token usage: 261 tokens
> [query] Total embedding token usage: 0 tokens
>>> print(response)
A notebook.
python3 -m manifest.api.app --model_type huggingface --model_name_or_path google/flan-t5-xl --fp16 --device 0
embed_model = LangchainEmbedding(HuggingFaceEmbeddings())
i've mostly tested with davinci tbh
ah. so then, i should try with those settings and reproduce
yeah see if that works, if not lemme know!
k. thanks. trying to do things locally (free) first π
it does work much better with openAI embeddings and LLM.
I will see if it works OK with openAI LLM and the huggingface embeddings. a lot of the spend is probably the embedding part
the embeddings are reasonably cheap (though of course if you have a lot of data it'll add up). query-time costs is only the LLM call
Hi Jerry, I think chunk_size_limit is not documented for the GPTSimpleVectorIndex
Is there any better documentation for chunk_size_limit available, e.g. if it's the total length of context sent to the LLM or the max size of each embedding found in the index. 4000 as default sounds like it's the total context lengths, but maybe it's just 4000 because with similarity_top_k=1 only one chunk gets sent to the LLM.
yeah by default we just "stuff" as much text into each chunk that can fit into the total prompt limit, which in the case of davinci it's 4000
isn't there some kind of tradeoff here? smaller chunks mean more targeted doc retrieval and better context in final prompt? (unless too small) And, larger chunks mean fewer embed calls (cheaper) but also larger context chunks and potentially less precise indexing?
yeah there's def a tradeoff! but its more like smaller chunks = cheaper/faster but you lose more context per chunk
smaller chunks means cheaper/faster on the query side. it's more expensive on the embed and index side.
actually, no same number of tokens - openAI would be the same for indexing
cohere would be more expensive
mm yeah you're right in that it's more expensive in storing more embeddings + query time for embeddings
using localembedding, thus far, looks to perform as well as the ada ones. Using "sentence-transformers/all-mpnet-base-v1" since they have a longer embed max length than the default and a better sentence similarity score.
it makes sense that the completion part requires a much better LLM.
also interestingly, the cohere LLM appears to be much worse then the openAI one
Interesting. so the retrieval model can use simpler/free embeddings, but the actual generation part requires an LLM
were you able to resolve this? I've also been having issues with getting poor context from the retrieval model
how you structure your docs seems to matter for retrieval. i'm breaking things into sections/paragraphs and getting pretty good results for similarity_top_k=4.