LlamaIndex

Log inLog into community

Find answers from the community

Updated last year

llama-hub/llama_hub/tools/notebooks/goog...

llama-hub/llama_hub/tools/notebooks/goog...

At a glance

·

I'm trying to work out how to add a Google Search into a query chain.

I can see some example code here (https://github.com/run-llama/llama-hub/blob/main/llama_hub/tools/notebooks/google_search.ipynb) that shows calling Google to provide some initial information that is then fed to the LLM. However, I want to include this as part of a chain (e.g. a query engine).

Is there any example of how I could use Google as part of a RAG query where it is just another index source?

L

j

57 comments

wrap it in a custom query engine? https://docs.llamaindex.ai/en/stable/examples/query_engine/custom_query_engine.html#defining-a-custom-query-engine

Thanks - let me delve into this approach, and I'll get back to let you know how I progressed

Thanks @Logan M - I have got this working, but I'm unsure how this is different from a straight forward simple query engine approach, which I can then add to a sub-query approach. Let me clarify:

This is the custom query engine approach:

I'm using a question about Sam Altman's sacking, because this is specifically not contained within the LLM's training data.

The webSearchDocs function simply returns the content as documents that can be indexed.

Plain Text

class RAGQueryEngine(CustomQueryEngine):
    """RAG Query Engine."""

    retriever: BaseRetriever
    response_synthesizer: BaseSynthesizer

    def custom_query(self, query_str: str):
        nodes = self.retriever.retrieve(query_str)
        response_obj = self.response_synthesizer.synthesize(query_str, nodes)
        return response_obj

qry = "When and why was Sam Altman sacked from the OpenAI board?.  Who left with him? Who temporarily replaced him?"
googleQueryDocs = webSearchDocs(query=qry, maxhits=2)

googleIndex = VectorStoreIndex.from_documents(googleQueryDocs, service_context=service_context)

retriever = googleIndex.as_retriever()

synthesizer = get_response_synthesizer(response_mode="compact")
query_engine = RAGQueryEngine(
    retriever=retriever, response_synthesizer=synthesizer
)
response = query_engine.query(qry)
print(response)

If I use a standard query engine, then this is how I would do it:

Plain Text

qry = "When and why was Sam Altman sacked from the OpenAI board?.  Who left with him? Who temporarily replaced him?"

googleQueryDocs = webSearchDocs(query=qry, maxhits=2)
googleIndex = VectorStoreIndex.from_documents(googleQueryDocs, service_context=service_context)
response = query_engine.query(qry)

print(response)

I get back the same response.

The custom query engine gives me more flexibility, but I need to create a retriever, and synthesizer. Am I missing a trick? Would a sub-query engine be better, or does this more sophisticated angle give me more options?

As a general point, I wouldn't have been able to come up with the second approach without having looked at the first. So in that sense, the first one was very useful. Now I'm trying to understand the specific additional benefit of the first approach

See, now with the custom query engine, you have a query that can, on the fly, look something up on the internet and respond.

If you use that in a sub-question query engine using that custom query engine as a QueryEngineTool, doesn't that achieve your original goal?

You can probably clean up the class a little bit too -- let me take a stab in a sec

Hmmm... I'm having trouble chaining them into a subquery engine (including the custom one). I'll find my bug and come back on it

Plain Text

class RAGQueryEngine(CustomQueryEngine):
    """RAG Query Engine."""

    service_context: ServiceContext

    def custom_query(self, query_str: str):
        googleQueryDocs = webSearchDocs(query=qry, maxhits=2)
        googleIndex = VectorStoreIndex.from_documents(googleQueryDocs, service_context=self.service_context)
        return googleIndex.as_query_engine().query(query_str)


query_engine = RAGQueryEngine(service_context=service_context)
tool = QueryEngineTool.from_defaults(query_engine, name="google_search", description="Useful for looking up information on the internet.")

subquestion_engine = SubQuestionQueryEngine.from_defaults([query_engine, <other tools?>], ...)

response = subquestion_engine.query("query str")

That's closer to what I was thinking

(very untested lol)

Thank you! I'll test it now 🙂

I'm close, but there is still a problem. I'm not sure the problem is with your code because it seems to be the same bug I had earlier, but something obvious is missing.

I have now changed the use case to look at the Lyft financial docs (from 10k example), but knowing that there is no 2023 data in the pdf, and none in the LLM training set.

So I would therefore expect the lyft pdf to be scanned for the 2021 content, and a web query to be done for the 2023 content

Plain Text

class RAGQueryEngine(CustomQueryEngine):
    """RAG Query Engine."""
    service_context: ServiceContext

    def custom_query(self, query_str: str):
        googleQueryDocs = webSearchDocs(query=qry, maxhits=maxhits)
        googleIndex = VectorStoreIndex.from_documents(googleQueryDocs, service_context=self.service_context)
        return googleIndex.as_query_engine(service_context=service_context).query(query_str)

google_query_engine = RAGQueryEngine(service_context=service_context)

query_engine_tools = [
    QueryEngineTool(
        query_engine=lyft_engine,
        metadata=ToolMetadata(
            name="lyft_10k",
            description="Provides information about Lyft financials for year 2020, and 2021",
        ),
    ),
    QueryEngineTool(
        query_engine=google_query_engine,
        metadata=ToolMetadata(
            name="google_search",
            description="Provides information from a google search when there is no concrete answer from existing context",
        ),
    ),
]

s_engine = SubQuestionQueryEngine.from_defaults(
    service_context = service_context,
    query_engine_tools=query_engine_tools
)

qry = "What was the adjusted revenue for lyft, for Q1 in 2023. How does this compare to Q1 in 2021. Where no specific answer exists, provide other context that may allow someone to make their own conclusion. Provide references, source web sites, or page numbers where these exist"

response = s_engine.query(qry)

Error follows:

Plain Text

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pydantic/v1/main.py:522, in BaseModel.parse_obj(cls, obj)
    521 try:
--> 522     obj = dict(obj)
    523 except (TypeError, ValueError) as e:

ValueError: dictionary update sequence element #0 has length 1; 2 is required

The above exception was the direct cause of the following exception:

So it looks like this problem is not related to this specific use-case. I also now have it with a previously working 10k example. Something in my environment may have been corrupted

hmm, are you using an open-source LLM? It seems like JSON element the LLM wrote is not correct

I am using an open-source LLM, but this has worked in the past. I've blitzed my pip and started reinstalling everything. I'm not using the new RAGQueryEngine. Just the standard lyft and uber engines. Each individual query works, but the subquery engine fails

The model is zephyr Beta 7B - certified as working on the LlamaIndex site

I'm trying zephr alpha now

If I spent all my time using the actual OpenAI API, I'd be broke 🙂

Same error on zephyr alpha, and this worked perfectly in the past. Has something changed with the subqueryengine code?
I did try downgrading to 0.9.30, and 0.9.31, but it didn't make a difference

** certified on initial impression, but not always 🙂

nope nothing change. But models do not have deterministic output, especially when the temp is high

I've set Temperature to 0.0 for RAG

I think something else is broken - I'll try to resolve it and come back. I appreciate your help with this

Hmmm.... looks like something has changed in the local LLM. I just retested this against OpenAI and it performed the previous 10k test with zero problems. I'll come back to this when I've worked around it

So.... this could be something where LlamaIndex is making an assumption of a format or some parsing that works with OpenAI but may be stricter with others. This is the return from the AI model.

Plain Text

  "id": "cmpl-jb446ivlyhs3tet14398p1",
  "object": "text_completion",
  "created": 1705535543,
  "model": "/Users/jon/.cache/lm-studio/models/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/mixtral-8x7b-instruct-v0.1.Q8_0.gguf",
  "choices": [
    {
      "index": 0,
      "text": "

json\n{\n "items": [\n {\n "sub_question": "What are the customer segments of Uber",\n "tool_name": "uber_10k"\n },\n {\n "sub_question": "What are the geographies of Uber",\n "tool_name": "uber_10k"\n },\n {\n "sub_question": "What are the customer segments of Lyft",\n "tool_name": "lyft_10k"\n },\n {\n "sub_question": "What are the geographies of Lyft",\n "tool_name": "lyft_10k"\n }\n ]\n}\n

Plain Text

",
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 450,
    "completion_tokens": 171,
    "total_tokens": 621
  }

Now when I dug down into the 'text' key, this is what I see:

Plain Text

{
        "items": [
                    {
                        "sub_question": "What are the customer segments of Uber",
                        "tool_name": "uber_10k"
                    },
                    {
                        "sub_question": "What are the geographies of Uber",
                        "tool_name": "uber_10k"
                    },
                    {
                        "sub_question": "What are the customer segments of Lyft",
                        "tool_name": "lyft_10k"
                    },
                    {
                        "sub_question": "What are the geographies of Lyft",
                        "tool_name": "lyft_10k"
                    }
                ]
    }

It looks fine to me. I don't know why this is being rejected. It may be that it is coming back as valid JSON, but has a str type, and not a dictionary type. Forcing a conversion may work. This shouldn't need to happen if pydantic was doing its job, but it seems that a casting of valid JSON from str isn't taking place

In the above instance, this was a mixtral model, but the same issue occurs with zephyr and multiple others

And here is the bottom of the error trace to suggest that this is exactly what is taking place:

Plain Text

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/llama_index/question_gen/llm_generators.py:78, in LLMQuestionGenerator.generate(self, tools, query)
     71 prediction = self._llm.predict(
     72     prompt=self._prompt,
     73     tools_str=tools_str,
...
    526 return cls(**obj)

ValidationError: 1 validation error for SubQuestion
__root__
  SubQuestion expected dict not str (type=type_error)

Here's the code that does the parsing
https://github.com/run-llama/llama_index/blob/fcfab6486bc6a0eec31a983dd3056ef9cbe8ceb2/llama_index/question_gen/output_parser.py#L10

My guess is there's something causing an issue... I noticed the text actually starts with json\n... but that doesn't feel like it would be difficult to remove

I can try running the function myself in a bit

I've had a play... It looks like the 'item' key is causing the issue here:

The JSON payload seems perfectly legal and passes initial parsing. It's when it comes to the SubQuestion parsing that issues arrise. With the retained 'items' key, it falls over, but if you pass the values contained within the items key, then it passes parsing.

For example:

Plain Text

from typing import Any

from llama_index.output_parsers.base import StructuredOutput
from llama_index.output_parsers.utils import parse_json_markdown
from llama_index.question_gen.types import SubQuestion
from llama_index.types import BaseOutputParser

def parse(output: str) -> Any:
    json_dict = parse_json_markdown(output)
    if not json_dict:
        raise ValueError(f"No valid JSON found in output: {output}")

    items=json_dict['items']  # <=== Note that this will now point to the actual subquestions/tools 

#    sub_questions = [SubQuestion.parse_obj(item) for item in json_dict]
    sub_questions = [SubQuestion.parse_obj(item) for item in items] # <=== now pointing to items, not json_dict
    print(f"sub_questions = {sub_questions}")
#    return StructuredOutput(raw_output=output, parsed_output=sub_questions)


x = "

json\n{\n "items": [\n {\n "sub_question": "What were the top 3 customer segments for Lyft in terms of revenue growth in year 2021?",\n "tool_name": "lyft_10k"\n },\n {\n "sub_question": "Which geographies had the highest revenue growth for Lyft in year 2021?",\n "tool_name": "lyft_10k"\n },\n {\n "sub_question": "What were the top 3 customer segments for Uber in terms of revenue growth in year 2021?",\n "tool_name": "uber_10k"\n },\n {\n "sub_question": "Which geographies had the highest revenue growth for Uber in year 2021?",\n "tool_name": "uber_10k"\n }\n ]\n}\n

Plain Text

"

parse(output=x)

This gives the following:

Plain Text

What were the top 3 customer segments for Lyft in terms of revenue growth in year 2021?, ==> lyft_10k
Which geographies had the highest revenue growth for Lyft in year 2021?, ==> lyft_10k
What were the top 3 customer segments for Uber in terms of revenue growth in year 2021?, ==> uber_10k
Which geographies had the highest revenue growth for Uber in year 2021?, ==> uber_10k
sub_questions = [SubQuestion(sub_question='What were the top 3 customer segments for Lyft in terms of revenue growth in year 2021?', tool_name='lyft_10k'), SubQuestion(sub_question='Which geographies had the highest revenue growth for Lyft in year 2021?', tool_name='lyft_10k'), SubQuestion(sub_question='What were the top 3 customer segments for Uber in terms of revenue growth in year 2021?', tool_name='uber_10k'), SubQuestion(sub_question='Which geographies had the highest revenue growth for Uber in year 2021?', tool_name='uber_10k')]

i.e. no error.

The JSON payload is obviously an LLM output, but to a llamaindex design, correct? In which case, something that is not being caught when it's run with OpenAI, but is being caught when it's anything is failing. Is there a routine that is OpenAI specific (e.g JSON payload design, or parser) that is not going to work for open source engines?

I think probably, the parse_json_markdown(output) needs to be more robust, or there needs to be another function to get the list of json items

In this case, items: [] was not part of the instruction to the LLM, it hallucinated that

So, maybe there is some way to more generically to get a list of dicts out from the resulting output

either that, or the prompt to generate the questions should be tweaked slightly for the given LLM you are using?

Interestingly, I got the same json payload structure with multiple models (zephyr, mixtral, hermes etc.). Some LLMs will not generate a JSON payload at all, but when it does, why would it automatically choose this? Is it an LMChat standard bias? Something in the llamaindex prompt (for the subquestions) seems to be promoting this as a common output.

thatll do it

Attachment

Probably that should be removed? It should just be a list of JSON items, no need for the items prefix

this is in llama_index/question_gen/prompts.py

It's obviously there for 'some' reason. May be worth looking at the git log and see why. Either way, perhaps there's an option (e.g. a manual setting/flag) to allow a non 'items' payload to be chosen

I think its an arbitrary design decision IMO (trust, I spend all day in this codebase lol). But then our markdown parsing seems to have forgotten that it was ever there.

We seem to have got there in the end

Worth confirming and removing if that makes sense. Not sure why it hasn't crashed on OpenAI then.

openai uses function calling instaed of generating json

ahhhhhhhh

so it's much safer/reliant

For the moment, I've done a git clone on llama-index and added the following into the parse() function:

Plain Text

 if 'items' in json_dict:
        json_dict = json_dict['items']

I've tested this and it seems to work

so I can do a pip install from a local version. I assume your suggested change will make its way into a future version. There are a lot of open-source LLMs underpinning llamaindex these days with OpenAI seeming to get more expensive as more sophisticated use-cases are being tested

If you want to make a PR with this change, that would be great 🙂 Otherwise I'll make one in a bit as well

I'm happy to do it - leave it with me

thank you!

I really appreciate your support with this.

PR created - I did not create a test suite

that's fine! As long as CI passes 🙂

Thanks!

Add a reply

Sign up and join the conversation on Discord