LlamaIndex

Log inLog into community

Find answers from the community

Updated last year

Feedback request post I m working on a

Feedback request post I m working on a

At a glance

The community member is working on a project to pull relevant page references from books based on a question, using OpenAI's Text Search Embeddings to create the vectors for their index. They have achieved an 88% accuracy rate but are looking for feedback on how to improve the results further.

The community members discuss using hybrid search, which combines keyword search and vector search, as a potential solution. They explore using Weaviate as a vector database and discuss the pros and cons of Weaviate vs. Pinecone. The community members also experiment with re-ranking algorithms like LLMRerank and SentenceTransformerRerank, but find that their original Weaviate hybrid search configuration with a similarity_top_k of 5 performs the best, though the order of the results is still an issue.

The community members continue to explore ways to improve the order of the results, such as adjusting the chunking size, using a sentence window, and fine-tuning an adapter on top of the embeddings. They also discuss a specific issue with the retriever not finding a page due to a small difference in the spelling of a term.

Useful resources

·

Feedback request post: I'm working on a project to pull relevant page references from books based on a question. For example: if I asked "what spells has Harry Potter used?", the AI could respond "check pages 50, 100, and 120 for information on that". I'm using OpenAI to do Text Search Embeddings to create the Vectors for my index and then passing the full questions to the retriever and it works pretty well but I would love feedback on what I could tweak/test to get better results. I've managed to get the retriever to be accurate about 88% of the time (i.e., 12% of the time it misses a key page I would have expected it to return).

L

P

30 comments

are you using pure vector search? Have you considered hybrid search as well?

hmm I'm not sure what those are. I haven't seen those terms referenced in the docs I've been using. Can you expand?

Hybrid search is a term usually used for combining keyword search and vector search.

There are a few ways this can be implemented. Either retrieving nodes using both approaches and applying some fusing scoring function (weaviate does this), or fetching nodes through both methods, and re-ranking them to return the true top k

Most popular vector dbs support some form of this, but if they don't you can take a custom approach as well. Using something like BM25 works well
https://gpt-index.readthedocs.io/en/stable/examples/retrievers/bm25_retriever.html#advanced-hybrid-retriever-re-ranking

I use a pretty outdated re-ranker in that example -- I would recommend BAAI/bge-reranker-base these days

Gotcha. That makes sense. Does it matter what Embedding Mode or Embedding AI I use to generate the vectors that I then plan to use with a hybrid search? I used the Davinci model with OpenAI in text_search mode to generate my most effective vector store index so far. Could I store that vector in Weaviate and go from there (I might be misunderstanding the use of a "vector database" so apologies if my question doesn't make sense)?

Yea you could definitely store your vectors in weaviate.

https://gpt-index.readthedocs.io/en/stable/examples/vector_stores/WeaviateIndexDemo.html

Perfect. Super excited to dive into vector databases! I started exploring Pinecone the other day but wasn't quite sure what my use case would be and now I have that. Have you used Pinecone? Weaviate > Pinecone?

Both are pretty comparable I think, up to you 🙂 Both support hybrid search though

💯

Ok so thanks to your great help, I was able to get hybrid search working with a Weaviate vector store. Unfortunately, it's actually less accurate than my default vector retriever 😢 I've messed around with the alpha and that doesn't seem to help. I'm wondering what I can do to debug from here. I've looked at some of the results of the queries that are inaccurate and really don't understand why they aren't retrieving certain pages. I went ahead and tried using a query string that exactly matches a string in the document and it still didn't find the right page. Any idea how I can debug why it's ranking the pages the way it is?

hmm, tbh I'm really not sure 🤔 I'm not 100% sure how weaviate hybrid search works

One suggestion could be increasing the top k and then adding a re-ranker?

ok I like the idea of a re-ranker. I guess I'll dive down that rabbit hole.

Ok it took a bit of tinkering but I think this reranking update is huge. I'm experimenting with LLMRerank and SentenceTransformerRerank and may look at some others as well, but it's all promising. Do you have any resources you can share that explain how rerank algorithms work? they just seem like complete magic haha

Hmmm I think llm rerank is just a prompt to the llm to re-order

Sentence transformers uses models specifically trained for re-ranking usually. I personally think bge-base-reranker is an ideal option

I think they are also callee cross encoders

Alright, Logan! I'm back. I spent a few days going down a re-ranker rabbit hole and gotta say, it wasn't as promising as I had hoped. Both the LLM and bge-reranker-base rerankers performed significantly worse than my most accurate configuration: Weaviate in hybrid mode (default alpha) with similarity_top_k: 5. With that default, the pages that the retriever grabs contains at least 1 correct page 96% of the time (correct page = I manually found the best page) which is pretty good, but the order is still a big issue. Any thoughts on what else I could do to improve these results?

Interesting, surprised it performed worse. I wonder if it has to do with the length of text (for example, bge cuts off text after 512 tokens)

Anyways -- not really sure how else to improve here.

Maybe can you clarify why the order is an issue if the proper nodes are still in the top 5?

Alternatively, you could try fine-tuning an adapter on top of your embeddings?
https://gpt-index.readthedocs.io/en/stable/examples/finetuning/embeddings/finetune_embedding_adapter.html

hmm yeah length could be an issue I'm working with pretty large amounts of text. I can look into that.

The reason that the order is an issue is because I really want the list to only contain correct pages. At the moment, when 96% have at least 1 correct page, only 58% of those results list the correct page in the first spot. I'm worried that the issue might be too subjective. In other words, maybe the pages I think are correct, the AI actually doesn't agree with and maybe some of the AI's pages are "technically" better pages. However, from my reviewing of the results, it seems like the pages the retriever is selecting are objectively not the best?

Llama index is chunking your data into chunks (the default size is 1024), which you could lower

And yea that's fair. Other strategies like sentence window or auto merging retirvers may help too (but for now, they only work with the base vector store, so not with weaviate at the moment)

https://gpt-index.readthedocs.io/en/stable/examples/node_postprocessor/MetadataReplacementDemo.html

https://gpt-index.readthedocs.io/en/stable/examples/retrievers/auto_merging_retriever.html

I'm kinda throwing the kitchen sink at you haha

But you seem willing to try out these features 😅

I am! And keep throwing it! It's definitely appreciated as I just don't have the background to know what to try next.

Where is llama index chunking my data before I pass it to the reranker? Or is llama_index doing that inside the reranker somewhere (if so, I'm not sure where to modify that chunking)?

The data gets chunked when you call from_documents() or instert()

You can adjust the chunk size in the service context (or node parser, but service context is easier)

Plain Text

ServiceContext.from_defaults(chunk_size=512)

So I setup the sentence window but I ended up with exactly the same results which makes me think I'm doing it wrong. The docs only show how to pass the post processor in as_query_engine but I'm not actually doing any querying; I'm just using the vector index retriever. I've passed the node_parser to the Service Context but is there something else I need to do to pass it to the retriever?

Did you create the index from scratch?

Basically, it needs to parse the nodes when you call from_documents

Each node will be a single sentence (or at least it tries its best). Then the metadata contains a larger window around that sentence

During embeddings, only the single sentence is embedded (not the window). So retrieval will retrieve single sentences

But then, the metadata replacement node-postprocessor is run, which replaces each sentence with it's wider window after retrieval.

In a normal query engine, this happens after retrieval, but before response synthesis

Since you are only using the retriever, you would also need to apply the postprocessor yourself

new_nodes = metadata_replacement.postprocess_nodes(retrieved_nodes)

ohhhh... ok that makes a lot more sense. I'll experiment with generating a new index tomorrow.

Thanks for all your help!

Hey Logan! I have a specific example of a retriever issue I'm seeing related to all of this that might help narrow down ways of improving the algorithm. So I have a CS textbook and the prompt talks about "key loggers". Page 843 talks about "keyloggers" but the space difference is causing the retriever to not find page 843 even with a top_k of 20. Thoughts on how tiny differences like that may be reconciled?

oof thats tough haha

What setup were you testing with that example?

Add a reply

Sign up and join the conversation on Discord