Some examples of your class and what llm you are using would be helpful
By "failing silently" I'm assuming the llm is just not predicting the proper structure?
Hi, I'm a teammate of @Dankey_Spankey and I can give a little more context to our situation. By failing silently, the structured predict returns an error that it received 0 tool calls when it attempts to parse our output. However, after a bit of digging, it seems that what's actually going on is that the on-the-fly function-calling tool generated by LlamaIndex either isn't generated or fails during its execution and doesn't provide an output, which the structured predict method is expecting. The error itself isn't silent, since it's clear the tool failed, but it doesn't provide an explanation or stack trace of what actually went wrong with the tool.
I'm not sure I can provide an exact code snippet due to proprietary limitations but we have nested BaseModels similar to the types in the documentation. The structured prediction works fine when our output class is a single non-nested BaseModel with builtin python type fields. However, even the introduction of simple nesting (like replacing one of the string fields in the top-level model with a BaseModel that only contains one string field) causes the tool to fail.
I understand that not being able to see the class isn't an ideal troubleshooting environment so I can write a mock if you need something more visual.
Any approximate mock and the current way you are using the pydantic class (along with the LLM class you are using) would be helpful
This problem was challenging for me to solve. I ended up making a nested, recursive model walker with loop detection that ended up working. Consider introspection w/Pydantic.
I'd considered that but given that the OpenAI API and LlamaIndex both have documented examples of a case like this "just working" I wanted to check here to see if I missed something or if it's not working as intended before I implement a custom solution.
oh my, using mistal/nemo π
That will probably be 100x hard mode. Let me try with ollama locally and see whats up though
IMO structured outputs + open source models == bad time (open-source models are not great at instruction following, especially with complex models)
I know this example isn't real, but I would use a flat model where possible for open-source models, pro tip
Unfortunately our use case is pretty tightly tied to NIMs at the moment. We have the capability to switch to Ollama or drop in other NIMs but they're our primary driver.
Mistral isn't completely necessary, I've just noticed good accuracy with it so if there's a better model for this kind of thing I'd love to hear suggestions
code-based llms might have a better time (writing JSON is easier for models trained on code in my experience) i.e. codestral, qwen2.5-coder, etc.
I realize what we're trying to do also includes retrieval which is a whole extra dimension but it's unfortunately also a requirement
I really appreciate you taking the time to look at all this, thank you
Yea that example is a lot more straightforward (small input prompt)
In a query engine, there's going to be a lot of text it needs to read, and still remember to output the proper structure
Yea pretty easy to replicate with ollama (both llama3.1 and mistral-nemo suck lol)
What is useful is setting up validators to try and force whatever the llm wrote into the structure you expect
Was experimenting with this + paul graham essay
import json
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.ollama import Ollama
from llama_index.llms.openai import OpenAI
from pydantic import BaseModel, Field, field_validator
from sympy import N
class Fact(BaseModel):
"""A fact from the text."""
fact: str = Field(description="A fact from the text.")
dates: list[str] | None = Field(description="The dates the fact is relevant to.")
surrounding_text: str | None = Field(description="The text surrounding the fact.")
class Facts(BaseModel):
"""A list of facts from the text."""
facts: list[Fact] = Field(description="A list of facts.")
@field_validator("facts", mode="before")
def validate_facts(cls, v):
if not v:
raise ValueError("Facts list is empty")
if isinstance(v, str):
v = json.loads(v)
if isinstance(v, list):
return [Fact(**fact) for fact in v]
return v
llm = Ollama(model="llama3.1:latest", request_timeout=120.0)
# llm = Ollama(model="mistral-nemo:latest", request_timeout=120.0)
# llm = OpenAI(model="gpt-4o-mini")
documents = SimpleDirectoryReader("./docs/docs/examples/data/paul_graham").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(llm=llm, output_cls=Facts)
response = query_engine.query("What did the author do growing up? Extract the facts.")
breakpoint()
print(type(response))
print(response)
# this will fail if the llm failed to write a valid object
print(response.facts)
(the optional fields on
Fact
are always missing, but at least it doesn't error out)
Interesting, how do you think codellama-70b would do? I'm somewhat limited even in model choice because i can only select NIMs I can run on-premise and not all of them are available for that. Did you see the error I was talking about with the 0 tool calls when using mistral with Ollama? Or is that NIM specific?
For some reason, the failure isn't silent anymore. Here's a stack trace.
llm-1 | Terminating the parser. Please open an issue at
llm-1 | https://github.com/noamgat/lm-format-enforcer/issues with the prefix and CharacterLevelParser parameters
llm-1 | Traceback (most recent call last):
llm-1 | File "/opt/nim/llm/.venv/lib/python3.10/site-packages/lmformatenforcer/tokenenforcer.py", line 96, in _compute_allowed_tokens
llm-1 | self._collect_allowed_tokens(state.parser, self.tokenizer_tree.root, allowed_tokens, shortcut_key)
llm-1 | File "/opt/nim/llm/.venv/lib/python3.10/site-packages/lmformatenforcer/tokenenforcer.py", line 144, in _collect_allowed_tokens
llm-1 | self._collect_allowed_tokens(next_parser, next_tree_node, allowed_tokens, None)
llm-1 | File "/opt/nim/llm/.venv/lib/python3.10/site-packages/lmformatenforcer/tokenenforcer.py", line 142, in _collect_allowed_tokens
llm-1 | next_parser = parser.add_character(character)
llm-1 | File "/opt/nim/llm/.venv/lib/python3.10/site-packages/lmformatenforcer/characterlevelparser.py", line 152, in add_character
llm-1 | updated_parser = parser.add_character(new_character)
llm-1 | File "/opt/nim/llm/.venv/lib/python3.10/site-packages/lmformatenforcer/jsonschemaparser.py", line 74, in add_character
llm-1 | updated_parser.object_stack[receiving_idx] = updated_parser.object_stack[receiving_idx].add_character(new_character)
llm-1 | File "/opt/nim/llm/.venv/lib/python3.10/site-packages/lmformatenforcer/jsonschemaparser.py", line 347, in add_character
llm-1 | self.current_key_parser = get_parser(
llm-1 | File "/opt/nim/llm/.venv/lib/python3.10/site-packages/lmformatenforcer/jsonschemaparser.py", line 187, in get_parser
llm-1 | return get_parser(parsing_state, merged_schema)
llm-1 | File "/opt/nim/llm/.venv/lib/python3.10/site-packages/lmformatenforcer/jsonschemaparser.py", line 221, in get_parser
llm-1 | raise ValueError("No definitions found in schema")
llm-1 | ValueError: No definitions found in schema
llm-1 | WARNING 2024-11-15 21:13:55.219 serving_chat.py:263] A tool call was detected, but the resulting JSON was unparseable. Try increasing `max_tokens`: Unterminated string starting at: line 1 column 94 (char 93)
I didn't see the zero tool calls rror, but that might be due to the version you are using?
Thats interesting, seems like the tool call was longer than it had room to write -- thats easier to fix, just by setting max_tokens larger (note that the larger this value is, the smaller the input is, it takes away from input space)
I'm using the most recent versions of everything, I think.
llama-index==0.11.23
llama-index-embeddings-nvidia==0.2.5
llama-index-embeddings-ollama==0.3.1
llama-index-vector-stores-chroma==0.3.0
llama-index-llms-nvidia==0.2.6
llama-index-llms-ollama==0.3.0
Taking away from input space means that repacking will take longer and just result in more calls, right? That might be a tradeoff I'm willing to make if the output is actually there
I was just getting pydantic errors as the response text (because the llm was writing an incorrect tool call) -- this is why I wrote a custom validator
Yea exactly, thats the general tradeoff
setting it to 512 or 1024 is probably enough
This is the error on client-side:
data-process-1 | File "/usr/local/lib/python3.11/site-packages/llama_index/llms/nvidia/base.py", line 290, in get_tool_calls_from_response
data-process-1 | raise ValueError(
data-process-1 | ValueError: Expected at least one tool call, but got 0 tool calls.
hmm, that might be nvidia specific actually?
Its a tricky option to have a default for. If the llm didn't have a tool call, then that means it just wrote some text. Which may or may not be expected by the user
Once all this is nested into a query engine, its a little hard to control imo
max_tokens for the NVIDIA class seems to be 1024 by default and I bump it up to 2048 in my configuration
Is a query engine even the right tool for me to be using here then? I just need to perform RAG and get structured output. If there's a way that gives me more control over individual components of that process I'd be happy to try it. I'm just still fairly new to LlamaIndex
Imo, something like this will probably give you the most control. Sadly you cant really force a tool call, but this will at least give you the tools to handle it better
from llama_index.core.program.function_calling import get_function_tool
retriever = index.as_retriever(similarity_top_k=2, ...)
nodes = retriever.retrieve("query")
context_str = "\n\n".join([x.text for x in nodes])
tool = get_function_tool(OutputCLS)
# could also use chat_history=chat_history instead, but here we assume a single message
resp = llm.chat_with_tools(
[tool],
user_msg=f"Given some context and a user query, do a thing.\nContext: {context_str}\n\nQuery: {query}"
)
tool_calls = llm.get_tool_calls_from_response(resp, error_on_no_tool_call=False)
output_objs
for tool_call in tool_calls:
if tool_call.tool_name == tool.metadata.name:
output_objs.append(OutputCLS(**tool_call.tool_kwargs)
print(outout_objs
So here you have control over
- the prompt sent to the LLM
- handling the tool calls
- creating the pydantic objects
If there are no tool calls, then the LLM likely wrote text in resp.message.content
This has been extremely helpful, thanks so much. My day's ending so I'll look at those docs on Monday. Really appreciate your help.
happy to help! Hope this unblocks π