Find answers from the community

Updated 2 years ago

Hi All

At a glance

The community member has a large website that they crawled and built a vector database for, but the answers from the llamaindex are not polished enough. They experimented with two approaches to improve the answer quality:

a. The "Synthesis over Heterogeneous Data" approach yielded good answers, but had a slow response time of 13 seconds per query.

b. Adding extra information to the documents during the crawl also improved the answer quality.

The community member is now wondering if switching to Pinecone could provide any gains. The comments suggest that the only potential gain would be reduced memory usage, but that's not an issue unless the index is very large.

The comments also provide the following suggestions to improve the responses:

  • Perform text cleaning to remove weird characters and adjust formatting
  • Split the webpages into smaller sections and create a document for each section to improve the embeddings
  • Experiment with increasing the similarity_top_k parameter in the query engine
  • Customize the prompt templates if the LLM is ignoring relevant information in the source nodes
  • Implement streaming to make the responses feel faster
Useful resources
Hi All!

I had a quick question. I have a large website that i crawled along with some local docs to build the vector database. and the llamaindex answers are acceptable but not polished.

I am experimenting in 2 areas to improve answer quality.
  1. the way i am querying over the data
  2. the way i am building the index
I found 2 ways to improve it so far.
a. The best improvement to yield good answers. was Synthesis over Heterogeneous Data approach but it caused a 13 second/query response time which is way too slow. b. I also gained answer quality by adding a bit of extra_info to the Document during the crawl before building the index.


Now to the question

I am using the built in llamaindex index = GPTVectorStoreIndex(nodes), is there any gains to be had by switching over to Pinecone?
L
F
6 comments
The only gains by using pinecone would be reduced memory usage (but that's not an issue unless your index is very big)

To improve responses, any text cleaning you can do will really help (i.e. removing weird characters, adjusting formatting).

Furthermore, if you can split the webpage into smaller sections, and creating a document for each section, that may also help with making the embeddings more representative
thanks for the quick response @Logan M

I am new to this but i think that I am already doing what you said about splitting the web page into smaller sections.

The json file from the scrape is an array of objects, each object is this shape

Plain Text
[{
  "url": "",
  "metadata": {
    "canonicalUrl": "",
    "title": "",
    "description": "",
    "author": ,
    "keywords": ,
    "languageCode": ""
  },
  "text": ""
},...]


the "text", from each object is about 3 paragraphs long, its not extremely long. then im running each object through this tranformer.

Plain Text
def transform_dataset_item(item):
    url = item.get("url") or ""  # default to an empty string if url is None
    title = item.get("metadata", {}).get("title") or ""  # same for title
    description = (
        item.get("metadata", {}).get("description") or ""
    )  # same for description

    # Ensure that all the values are strings
    url, title, description = str(url), str(title), str(description)

    return Document(
        item.get("text"),
        extra_info={
            "url": url,
            "title": title,
            "description": description,
        },
    )



So i think that each document is manageable. If this is all we can do, do you know if there are any wasy to speed up the "Synthesis over Heterogeneous Data approach"? its taking 13 seconds and it, by far had the best answers.
and is the text fairly clean? (i.e. not a bunch of newline and other noisy characters everywhere?)

Tbh that approach you referenced is largely limited by LLM calls. That approach requires 2 LLM calls at a *minimum *(and each LLM call takes around 3-6s with OpenAI)

I think we can improve results without resorting to that. When you get a response that is not as high quality as you want, if you check response.source_nodes, is the answer actually in the source nodes?

You could try increasing the top k index.as_query_engine(similarity_top_k=3). If the answer is in the source nodes, but the LLM is ignoring it for whatever reason, you can also try customizing the prompt templates.

Lastly, you can implement streaming to make the responses feel a little faster. You can also enable streaming for the Synthesis over Heterogeneous Data approach, but it's a little tricky to setup there (and will only stream the last LLM call)
Thank you for the help! I checked the response.source_nodes and the answers were in there, was just also including irrelevant answers in the final result too. If i ask "what products do you have" it would list a general answer that was right, then go WAY into the details on a random product. I went off and tried similarity_top_k=1 just for fun, and that did not help. giving it more context, helped the answers be somewhat balanced.

I tested each thing and here are my findings.

Configuring a Query Engine

  • similarity_top_k=3 made improvement. going to k=5 and k=1 made it worse
  • response mode_tree_summarize didnt do anything
  • node_postprocessors=[reranker] made it worse
These findings are based on a rough testing workflow that i built. Its hard to get a hard "pass/fail" with the natrual language answers. I basically have arrays of questions and it output test results from each test to a main csv sheet along with comments on what was changed. That way you can see how the answers evolve and what caused the change. do you have any other standardized testing frameworks for this?
We do have some evaluation stuff here:
https://gpt-index.readthedocs.io/en/latest/how_to/evaluation/evaluation.html

I would also recommend checking out this nifty package too
https://github.com/explodinggradients/ragas
Oh! This is going to be super helpful! Thank you.
Add a reply
Sign up and join the conversation on Discord