Thanks - let me delve into this approach, and I'll get back to let you know how I progressed
Thanks @Logan M - I have got this working, but I'm unsure how this is different from a straight forward simple query engine approach, which I can then add to a sub-query approach. Let me clarify:
This is the custom query engine approach:
I'm using a question about Sam Altman's sacking, because this is specifically not contained within the LLM's training data.
The
webSearchDocs function simply returns the content as documents that can be indexed.
class RAGQueryEngine(CustomQueryEngine):
"""RAG Query Engine."""
retriever: BaseRetriever
response_synthesizer: BaseSynthesizer
def custom_query(self, query_str: str):
nodes = self.retriever.retrieve(query_str)
response_obj = self.response_synthesizer.synthesize(query_str, nodes)
return response_obj
qry = "When and why was Sam Altman sacked from the OpenAI board?. Who left with him? Who temporarily replaced him?"
googleQueryDocs = webSearchDocs(query=qry, maxhits=2)
googleIndex = VectorStoreIndex.from_documents(googleQueryDocs, service_context=service_context)
retriever = googleIndex.as_retriever()
synthesizer = get_response_synthesizer(response_mode="compact")
query_engine = RAGQueryEngine(
retriever=retriever, response_synthesizer=synthesizer
)
response = query_engine.query(qry)
print(response)
If I use a standard query engine, then this is how I would do it:
qry = "When and why was Sam Altman sacked from the OpenAI board?. Who left with him? Who temporarily replaced him?"
googleQueryDocs = webSearchDocs(query=qry, maxhits=2)
googleIndex = VectorStoreIndex.from_documents(googleQueryDocs, service_context=service_context)
response = query_engine.query(qry)
print(response)
I get back the same response.
The custom query engine gives me more flexibility, but I need to create a retriever, and synthesizer. Am I missing a trick? Would a sub-query engine be better, or does this more sophisticated angle give me more options?
As a general point, I wouldn't have been able to come up with the second approach without having looked at the first. So in that sense, the first one was very useful. Now I'm trying to understand the specific additional benefit of the first approach
See, now with the custom query engine, you have a query that can, on the fly, look something up on the internet and respond.
If you use that in a sub-question query engine using that custom query engine as a QueryEngineTool, doesn't that achieve your original goal?
You can probably clean up the class a little bit too -- let me take a stab in a sec
Hmmm... I'm having trouble chaining them into a subquery engine (including the custom one). I'll find my bug and come back on it
class RAGQueryEngine(CustomQueryEngine):
"""RAG Query Engine."""
service_context: ServiceContext
def custom_query(self, query_str: str):
googleQueryDocs = webSearchDocs(query=qry, maxhits=2)
googleIndex = VectorStoreIndex.from_documents(googleQueryDocs, service_context=self.service_context)
return googleIndex.as_query_engine().query(query_str)
query_engine = RAGQueryEngine(service_context=service_context)
tool = QueryEngineTool.from_defaults(query_engine, name="google_search", description="Useful for looking up information on the internet.")
subquestion_engine = SubQuestionQueryEngine.from_defaults([query_engine, <other tools?>], ...)
response = subquestion_engine.query("query str")
That's closer to what I was thinking
Thank you! I'll test it now π
I'm close, but there is still a problem. I'm not sure the problem is with your code because it seems to be the same bug I had earlier, but something obvious is missing.
I have now changed the use case to look at the Lyft financial docs (from 10k example), but knowing that there is no 2023 data in the pdf, and none in the LLM training set.
So I would therefore expect the lyft pdf to be scanned for the 2021 content, and a web query to be done for the 2023 content
class RAGQueryEngine(CustomQueryEngine):
"""RAG Query Engine."""
service_context: ServiceContext
def custom_query(self, query_str: str):
googleQueryDocs = webSearchDocs(query=qry, maxhits=maxhits)
googleIndex = VectorStoreIndex.from_documents(googleQueryDocs, service_context=self.service_context)
return googleIndex.as_query_engine(service_context=service_context).query(query_str)
google_query_engine = RAGQueryEngine(service_context=service_context)
query_engine_tools = [
QueryEngineTool(
query_engine=lyft_engine,
metadata=ToolMetadata(
name="lyft_10k",
description="Provides information about Lyft financials for year 2020, and 2021",
),
),
QueryEngineTool(
query_engine=google_query_engine,
metadata=ToolMetadata(
name="google_search",
description="Provides information from a google search when there is no concrete answer from existing context",
),
),
]
s_engine = SubQuestionQueryEngine.from_defaults(
service_context = service_context,
query_engine_tools=query_engine_tools
)
qry = "What was the adjusted revenue for lyft, for Q1 in 2023. How does this compare to Q1 in 2021. Where no specific answer exists, provide other context that may allow someone to make their own conclusion. Provide references, source web sites, or page numbers where these exist"
response = s_engine.query(qry)
Error follows:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pydantic/v1/main.py:522, in BaseModel.parse_obj(cls, obj)
521 try:
--> 522 obj = dict(obj)
523 except (TypeError, ValueError) as e:
ValueError: dictionary update sequence element #0 has length 1; 2 is required
The above exception was the direct cause of the following exception:
So it looks like this problem is not related to this specific use-case. I also now have it with a previously working 10k example. Something in my environment may have been corrupted
hmm, are you using an open-source LLM? It seems like JSON element the LLM wrote is not correct
I am using an open-source LLM, but this has worked in the past. I've blitzed my pip and started reinstalling everything. I'm not using the new RAGQueryEngine. Just the standard lyft and uber engines. Each individual query works, but the subquery engine fails
The model is zephyr Beta 7B - certified as working on the LlamaIndex site
I'm trying zephr alpha now
If I spent all my time using the actual OpenAI API, I'd be broke π
Same error on zephyr alpha, and this worked perfectly in the past. Has something changed with the subqueryengine code?
I did try downgrading to 0.9.30, and 0.9.31, but it didn't make a difference
** certified on initial impression, but not always π
nope nothing change. But models do not have deterministic output, especially when the temp is high
I've set Temperature to 0.0 for RAG
I think something else is broken - I'll try to resolve it and come back. I appreciate your help with this
Hmmm.... looks like something has changed in the local LLM. I just retested this against OpenAI and it performed the previous 10k test with zero problems. I'll come back to this when I've worked around it
So.... this could be something where LlamaIndex is making an assumption of a format or some parsing that works with OpenAI but may be stricter with others. This is the return from the AI model.
"id": "cmpl-jb446ivlyhs3tet14398p1",
"object": "text_completion",
"created": 1705535543,
"model": "/Users/jon/.cache/lm-studio/models/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/mixtral-8x7b-instruct-v0.1.Q8_0.gguf",
"choices": [
{
"index": 0,
"text": "
json\n{\n "items": [\n {\n "sub_question": "What are the customer segments of Uber",\n "tool_name": "uber_10k"\n },\n {\n "sub_question": "What are the geographies of Uber",\n "tool_name": "uber_10k"\n },\n {\n "sub_question": "What are the customer segments of Lyft",\n "tool_name": "lyft_10k"\n },\n {\n "sub_question": "What are the geographies of Lyft",\n "tool_name": "lyft_10k"\n }\n ]\n}\n
",
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 450,
"completion_tokens": 171,
"total_tokens": 621
}
Now when I dug down into the 'text' key, this is what I see:
{
"items": [
{
"sub_question": "What are the customer segments of Uber",
"tool_name": "uber_10k"
},
{
"sub_question": "What are the geographies of Uber",
"tool_name": "uber_10k"
},
{
"sub_question": "What are the customer segments of Lyft",
"tool_name": "lyft_10k"
},
{
"sub_question": "What are the geographies of Lyft",
"tool_name": "lyft_10k"
}
]
}
It looks fine to me. I don't know why this is being rejected. It may be that it is coming back as valid JSON, but has a str type, and not a dictionary type. Forcing a conversion may work. This shouldn't need to happen if pydantic was doing its job, but it seems that a casting of valid JSON from str isn't taking place
In the above instance, this was a mixtral model, but the same issue occurs with zephyr and multiple others
And here is the bottom of the error trace to suggest that this is exactly what is taking place:
File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/llama_index/question_gen/llm_generators.py:78, in LLMQuestionGenerator.generate(self, tools, query)
71 prediction = self._llm.predict(
72 prompt=self._prompt,
73 tools_str=tools_str,
...
526 return cls(**obj)
ValidationError: 1 validation error for SubQuestion
__root__
SubQuestion expected dict not str (type=type_error)
My guess is there's something causing an issue... I noticed the text actually starts with json\n...
but that doesn't feel like it would be difficult to remove
I can try running the function myself in a bit
I've had a play... It looks like the 'item' key is causing the issue here:
The JSON payload seems perfectly legal and passes initial parsing. It's when it comes to the SubQuestion parsing that issues arrise. With the retained 'items' key, it falls over, but if you pass the values contained within the items key, then it passes parsing.
For example:
from typing import Any
from llama_index.output_parsers.base import StructuredOutput
from llama_index.output_parsers.utils import parse_json_markdown
from llama_index.question_gen.types import SubQuestion
from llama_index.types import BaseOutputParser
def parse(output: str) -> Any:
json_dict = parse_json_markdown(output)
if not json_dict:
raise ValueError(f"No valid JSON found in output: {output}")
items=json_dict['items'] # <=== Note that this will now point to the actual subquestions/tools
# sub_questions = [SubQuestion.parse_obj(item) for item in json_dict]
sub_questions = [SubQuestion.parse_obj(item) for item in items] # <=== now pointing to items, not json_dict
print(f"sub_questions = {sub_questions}")
# return StructuredOutput(raw_output=output, parsed_output=sub_questions)
x = "
json\n{\n "items": [\n {\n "sub_question": "What were the top 3 customer segments for Lyft in terms of revenue growth in year 2021?",\n "tool_name": "lyft_10k"\n },\n {\n "sub_question": "Which geographies had the highest revenue growth for Lyft in year 2021?",\n "tool_name": "lyft_10k"\n },\n {\n "sub_question": "What were the top 3 customer segments for Uber in terms of revenue growth in year 2021?",\n "tool_name": "uber_10k"\n },\n {\n "sub_question": "Which geographies had the highest revenue growth for Uber in year 2021?",\n "tool_name": "uber_10k"\n }\n ]\n}\n
This gives the following:
What were the top 3 customer segments for Lyft in terms of revenue growth in year 2021?, ==> lyft_10k
Which geographies had the highest revenue growth for Lyft in year 2021?, ==> lyft_10k
What were the top 3 customer segments for Uber in terms of revenue growth in year 2021?, ==> uber_10k
Which geographies had the highest revenue growth for Uber in year 2021?, ==> uber_10k
sub_questions = [SubQuestion(sub_question='What were the top 3 customer segments for Lyft in terms of revenue growth in year 2021?', tool_name='lyft_10k'), SubQuestion(sub_question='Which geographies had the highest revenue growth for Lyft in year 2021?', tool_name='lyft_10k'), SubQuestion(sub_question='What were the top 3 customer segments for Uber in terms of revenue growth in year 2021?', tool_name='uber_10k'), SubQuestion(sub_question='Which geographies had the highest revenue growth for Uber in year 2021?', tool_name='uber_10k')]
i.e. no error.
The JSON payload is obviously an LLM output, but to a llamaindex design, correct? In which case, something that is not being caught when it's run with OpenAI, but is being caught when it's anything is failing. Is there a routine that is OpenAI specific (e.g JSON payload design, or parser) that is not going to work for open source engines?
I think probably, the parse_json_markdown(output)
needs to be more robust, or there needs to be another function to get the list of json items
In this case, items: []
was not part of the instruction to the LLM, it hallucinated that
So, maybe there is some way to more generically to get a list of dicts out from the resulting output
either that, or the prompt to generate the questions should be tweaked slightly for the given LLM you are using?
Interestingly, I got the same json payload structure with multiple models (zephyr, mixtral, hermes etc.). Some LLMs will not generate a JSON payload at all, but when it does, why would it automatically choose this? Is it an LMChat standard bias? Something in the llamaindex prompt (for the subquestions) seems to be promoting this as a common output.
Probably that should be removed? It should just be a list of JSON items, no need for the items prefix
this is in llama_index/question_gen/prompts.py
It's obviously there for 'some' reason. May be worth looking at the git log and see why. Either way, perhaps there's an option (e.g. a manual setting/flag) to allow a non 'items' payload to be chosen
I think its an arbitrary design decision IMO (trust, I spend all day in this codebase lol). But then our markdown parsing seems to have forgotten that it was ever there.
We seem to have got there in the end
Worth confirming and removing if that makes sense. Not sure why it hasn't crashed on OpenAI then.
openai uses function calling instaed of generating json
so it's much safer/reliant
For the moment, I've done a git clone on llama-index and added the following into the parse() function:
if 'items' in json_dict:
json_dict = json_dict['items']
I've tested this and it seems to work
so I can do a pip install from a local version. I assume your suggested change will make its way into a future version. There are a lot of open-source LLMs underpinning llamaindex these days with OpenAI seeming to get more expensive as more sophisticated use-cases are being tested
If you want to make a PR with this change, that would be great π Otherwise I'll make one in a bit as well
I'm happy to do it - leave it with me
I really appreciate your support with this.
PR created - I did not create a test suite
that's fine! As long as CI passes π