are you using pure vector search? Have you considered hybrid search as well?
hmm I'm not sure what those are. I haven't seen those terms referenced in the docs I've been using. Can you expand?
I use a pretty outdated re-ranker in that example -- I would recommend BAAI/bge-reranker-base
these days
Gotcha. That makes sense. Does it matter what Embedding Mode or Embedding AI I use to generate the vectors that I then plan to use with a hybrid search? I used the Davinci model with OpenAI in text_search
mode to generate my most effective vector store index so far. Could I store that vector in Weaviate and go from there (I might be misunderstanding the use of a "vector database" so apologies if my question doesn't make sense)?
Perfect. Super excited to dive into vector databases! I started exploring Pinecone the other day but wasn't quite sure what my use case would be and now I have that. Have you used Pinecone? Weaviate > Pinecone?
Both are pretty comparable I think, up to you π Both support hybrid search though
Ok so thanks to your great help, I was able to get hybrid search working with a Weaviate vector store. Unfortunately, it's actually less accurate than my default vector retriever π’ I've messed around with the alpha and that doesn't seem to help. I'm wondering what I can do to debug from here. I've looked at some of the results of the queries that are inaccurate and really don't understand why they aren't retrieving certain pages. I went ahead and tried using a query string that exactly matches a string in the document and it still didn't find the right page. Any idea how I can debug why it's ranking the pages the way it is?
hmm, tbh I'm really not sure π€ I'm not 100% sure how weaviate hybrid search works
One suggestion could be increasing the top k and then adding a re-ranker?
ok I like the idea of a re-ranker. I guess I'll dive down that rabbit hole.
Ok it took a bit of tinkering but I think this reranking update is huge. I'm experimenting with LLMRerank
and SentenceTransformerRerank
and may look at some others as well, but it's all promising. Do you have any resources you can share that explain how rerank algorithms work? they just seem like complete magic haha
Hmmm I think llm rerank is just a prompt to the llm to re-order
Sentence transformers uses models specifically trained for re-ranking usually. I personally think bge-base-reranker is an ideal option
I think they are also callee cross encoders
Alright, Logan! I'm back. I spent a few days going down a re-ranker rabbit hole and gotta say, it wasn't as promising as I had hoped. Both the LLM and bge-reranker-base rerankers performed significantly worse than my most accurate configuration: Weaviate in hybrid mode (default alpha) with similarity_top_k: 5
. With that default, the pages that the retriever grabs contains at least 1 correct page 96% of the time (correct page = I manually found the best page) which is pretty good, but the order is still a big issue. Any thoughts on what else I could do to improve these results?
Interesting, surprised it performed worse. I wonder if it has to do with the length of text (for example, bge cuts off text after 512 tokens)
Anyways -- not really sure how else to improve here.
Maybe can you clarify why the order is an issue if the proper nodes are still in the top 5?
hmm yeah length could be an issue I'm working with pretty large amounts of text. I can look into that.
The reason that the order is an issue is because I really want the list to only contain correct pages. At the moment, when 96% have at least 1 correct page, only 58% of those results list the correct page in the first spot. I'm worried that the issue might be too subjective. In other words, maybe the pages I think are correct, the AI actually doesn't agree with and maybe some of the AI's pages are "technically" better pages. However, from my reviewing of the results, it seems like the pages the retriever is selecting are objectively not the best?
I'm kinda throwing the kitchen sink at you haha
But you seem willing to try out these features π
I am! And keep throwing it! It's definitely appreciated as I just don't have the background to know what to try next.
Where is llama index chunking my data before I pass it to the reranker? Or is llama_index doing that inside the reranker somewhere (if so, I'm not sure where to modify that chunking)?
The data gets chunked when you call from_documents() or instert()
You can adjust the chunk size in the service context (or node parser, but service context is easier)
ServiceContext.from_defaults(chunk_size=512)
So I setup the sentence window but I ended up with exactly the same results which makes me think I'm doing it wrong. The docs only show how to pass the post processor in as_query_engine
but I'm not actually doing any querying; I'm just using the vector index retriever. I've passed the node_parser to the Service Context but is there something else I need to do to pass it to the retriever?
Did you create the index from scratch?
Basically, it needs to parse the nodes when you call from_documents
Each node will be a single sentence (or at least it tries its best). Then the metadata contains a larger window around that sentence
During embeddings, only the single sentence is embedded (not the window). So retrieval will retrieve single sentences
But then, the metadata replacement node-postprocessor is run, which replaces each sentence with it's wider window after retrieval.
In a normal query engine, this happens after retrieval, but before response synthesis
Since you are only using the retriever, you would also need to apply the postprocessor yourself
new_nodes = metadata_replacement.postprocess_nodes(retrieved_nodes)
ohhhh... ok that makes a lot more sense. I'll experiment with generating a new index tomorrow.
Thanks for all your help!
Hey Logan! I have a specific example of a retriever issue I'm seeing related to all of this that might help narrow down ways of improving the algorithm. So I have a CS textbook and the prompt talks about "key loggers". Page 843 talks about "keyloggers" but the space difference is causing the retriever to not find page 843 even with a top_k of 20. Thoughts on how tiny differences like that may be reconciled?
oof thats tough haha
What setup were you testing with that example?