Rag

At a glance

The community member is having issues with their Retrieval Augmented Generation (RAG) system, specifically with the results not meeting their expectations. They have read a blog post about evaluating chunk size for a RAG system, but they are looking for additional suggested reads on metrics and root causing. The community member has implemented their RAG system by scraping a large number of websites (14,837 unique URLs) and storing the data in a vector index. They are facing issues where the system is not finding relevant information even when using the exact title from the metadata. Other community members have provided suggestions, such as increasing the similarity top k and using a reranker, as well as exploring hybrid search. The community member has tried these approaches and is seeing some improvement, but there is still room for further optimization.

Useful resources

ssysfor

I'm less than thrilled with my rag results and looking to see if anyone has some suggested reads they found useful around metrics and root causing, etc. I am reading this atm - https://blog.llamaindex.ai/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5 which has good information but admitidaly not a rag expert so there could be much better reads i am overlooking. Rag implementation is basically - scraped a bunch of websites related to a topic. when asking it questions, ones i know data exists and in some cases using the exact title from metadata it's not finding them and returning stuff from other unrelated blog texts.

12 comments

LLogan M

It might be helpful if you shared a few more details.

How many websites? Did you put it all in a vector index without changing any settings? Any other customization?

ssysfor

Sure. So. In total my database (where i sent all the parsed info to) has 14,837 unique URLs.

Metadata:

Plain Text

{"date": "<data article posted - 2022-09-13>", "url": "<car_review_url>", "section": "SUV", "source": "Car and Driver", "title": "<title of the article>"}

Then i have the raw summary (text tag extraction via soup). As far as "cleaning goes" atm I am just lowercasing everything. I'm looking at best practices for this also.

The text + metadata are fed to return a document.

Plain Text

class Configuration:
    def __init__(self):
        self.initialize()

    def initialize(self):
        self.llm = Ollama(model="mixtral:8x7b-instruct-v0.1-q6_K", base_url="http://192.168.0.105:1234")
        
        self.embed_model = MistralAIEmbedding(model_name="mistral-embed", 
                                              api_key=MISTRAL_API_KEY,
                                              embed_batch_size=8)
        
        self.client = chromadb.PersistentClient(path="./dbs/vector_dbs_test/cars/")
        self.chroma_collection = self.client.get_or_create_collection(name="cars_rag_test")
        self.vector_store = ChromaVectorStore(chroma_collection=self.chroma_collection)
        self.storage_context = StorageContext.from_defaults(vector_store=self.vector_store)
        self.service_context = ServiceContext.from_defaults(llm=self.llm,
                                                    chunk_size=1024,
                                                    chunk_overlap=25,
                                                    embed_model=self.embed_model)

ssysfor

building the index:

Plain Text

def extract_and_store_articles_info(db_path):
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    cursor.execute('SELECT a.metadata, a.original_summary FROM articles_processed a')
    rows = cursor.fetchall()
    conn.close()
    
    return rows
        
def load_data(metadata, text):
    document = Document(text=text, metadata=metadata)
    
    return [document]

def main():
    config = Configuration()
    
    document_list = []
    
    rows = extract_and_store_articles_info("./dbs/processed_data_test.db")  
    
    for row in rows[:500]:
        metadata, text = json.loads(row[0]), row[1]
        metadata = {key.lower(): value.lower() if isinstance(value, str) else value for key, value in metadata.items()}
        text = text.lower()
        
        documents = load_data(metadata, text)
        document_list.append(documents)
 
    for document in document_list:
        #document[0].excluded_llm_metadata_keys = ["url"]
        #print(document[0].get_content(metadata_mode=MetadataMode.LLM))
        
        try:
            VectorStoreIndex.from_documents(documents=document,
                                            service_context=config.service_context, 
                                            storage_context=config.storage_context,
                                            show_progress=True)
        except:
            print("Error:", document[0].get_content(metadata_mode=MetadataMode.LLM))
            continue
    
if __name__ == "__main__":
    main()

ssysfor

Then searching wise with the same service context, storage context, etc:

Plain Text

    index = VectorStoreIndex.from_vector_store(vector_store=config.vector_store,
                                               service_context=config.service_context)
    
    query_engine = index.as_query_engine(verbose=True)
    
    USER_PROMPT = """
    Can you give me the Pricing and Specs for the 2024 Toyota RAV4 Review article.
    
    Cite the URL references you used to determine your answer.
    
    Think through the steps before responding.
    """
    response = query_engine.query(USER_PROMPT)
    
    print(response)

ssysfor

Pricing and Specs for the 2024 Toyota RAV4 Review article -> tile of the article which is also in the metadata (and summary): 2024 Toyota RAV4 Review, Pricing, and Specs

I'll just get really weird responses; and it grabs information about unrelated things and includes it in; then seems to completely ignore what i asked for.

ssysfor

@Logan M hopefully that helps. sorry for not including that initially

LLogan M

No worries! That helps immensely!

So, with 14,000+ weboages, it makes sense this approach doesn't work 😅

index.as_query_engine() only uses vector retrieval to return the top 2 most similar chunks. With that amount of data, you can maybe see how a top k of 2 is too small.

Instead what I might do to improve results here, is crank the top k, and then also use a reranker.

For example (I'm going to use v0.10.x code, not sure what you are on right now)

Plain Text

pip install llama-index-postprocessor-flag-embedding-reranker

Plain Text

from llama_index.postprocessor.flag_embedding_reranker import (
    FlagEmbeddingReranker,
)

rerank = FlagEmbeddingReranker(model="BAAI/bge-reranker-base", top_n=3)

query_engine = index.as_query_engine(similarity_top_k=20, node_postprocessors=[rerank])

LLogan M

Another thing you could introduce (on top of adding reranking) is hybrid search as well

ssysfor

Thank you @Logan M - made the rerank changes (looking into hybrid search). Using reranker-large. Seems to be providing responses back a bit better but still room for improvement.

LLogan M

Nice! What kind of top k did you use for the initial similarity top k? 20? (Tweaking that may help, at the cost of runtime)

ssysfor

k - 20 and n - 6

ssysfor

but still playing with it a bit. trying some different rerankers (cohere, etc.)

Add a reply

Find answers from the community

Rag