LlamaIndex

Log inLog into community

Find answers from the community

Updated 7 months ago

llama_index/docs/docs/examples/structure...

llama_index/docs/docs/examples/structure...

At a glance

The community member was working on a structured output notebook with their own data and encountered an error they did not understand. The error occurred during the querying time, and the community member was seeking insights on how the error occurred. The community members discussed potential solutions, such as checking for JSON escaping in the template, using the LLMTextCompletionProgram instead of the structured output approach, and updating the llama-index package. However, the issue persisted, and the community members were unable to identify the root cause. The community member also tried using the LangchainOutputParser to increase the chances of getting structured output, but encountered another error. Overall, the community members were unable to find a definitive solution to the problem.

Useful resources

·

hello there, i was going through the structured output notebook with my own data and came across an error which i do not understand. My rag pipeline follows exactly like section 2. Plug into RAG Pipeline. during querying time i came across the following error. can anyone provide some insights and how did it occured?

this is the notebook https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/structured_outputs/structured_outputs.ipynb

Plain Text

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[20], line 1
----> 1 response_header = header_query_engine.query(new_header_query)

File <hidden_path>\.venv\lib\site-packages\llama_index\core\instrumentation\dispatcher.py:260, in Dispatcher.span.<locals>.wrapper(func, instance, args, kwargs)
    252 self.span_enter(
    253     id_=id_,
    254     bound_args=bound_args,
   (...)
    257     tags=tags,
    258 )
    259 try:
--> 260     result = func(*args, **kwargs)
    261 except BaseException as e:
    262     self.event(SpanDropEvent(span_id=id_, err_str=str(e)))

File <hidden_path>\.venv\lib\site-packages\llama_index\core\base\base_query_engine.py:52, in BaseQueryEngine.query(self, str_or_query_bundle)
     50     if isinstance(str_or_query_bundle, str):
     51         str_or_query_bundle = QueryBundle(str_or_query_bundle)
---> 52     query_result = self._query(str_or_query_bundle)
     53 dispatcher.event(
     54     QueryEndEvent(query=str_or_query_bundle, response=query_result)
     55 )
     56 return query_result
...
--> 302 content = content_template.format(**relevant_kwargs)
    304 message: ChatMessage = message_template.copy()
    305 message.content = content

KeyError: "' Item No"

L

g

34 comments

Did for forget to escape json in your template?

I didn’t create templates. I just created the class Output and pass it to llm.as_structured_output(). Then pass it in to query_engine

Plain Text

class HeaderOutput(BaseModel):
    """Outputs header information containing the following categories: 
    BillTo, BillToAddress, BillToNumber."""
    
    BillTo: str = Field(..., description='This is the name of the customer.')
    BillToAddress: str = Field(..., description='This is the billing address of the customer. If \
cannot be found, extract from the company address of the customer.')
    BillToNumber: str = Field(..., description='This is the Tax Identification Number (TIN) of \
the customer.')
    
struct_llm = llm.as_structured_llm(output_cls=HeaderOutput)

header_query_engine = index.as_query_engine(
    similarity_top_k=top_k,
    node_postprocessor=[reranker],
    llm=struct_llm,
    verbose=False
)

new_header_query = """my query"""

response_header = header_query_engine.query(new_header_query)

You mean prompt template?

Does new_header_query have have json in it?

no, its basically just prompts.

Plain Text

new_header_query = """
You are an intelligent structured data extraction assistant.
Your task is to extract header information from the given context. The context
are part of a digitalized product invoice document. You must ALWAYS extract all
header information in your response unless the information is not available in
the context. If there are any information which cannot be found in the context,
you need not response with that information. The context also contains table
however you DO NOT need any information from tables for your extraction. You
must not invent any information and perform any calculation. You must only use
the information provided in the context.

The header information to be extracted are: BillTo, BillToAddress, BillToNumber.

Following are the requirements for the extraction:
- Do not stop extraction until all categories have been extracted.
- For categories not explicitly stated or given as "NA" in the context,
they do NOT require extraction. Omit them from the structured response.
- Do NOT invent your own categories.
- Only use information provided in the context in your response."""

i googled for more examples/solutions and most of them points to creating LLMTextCompletionProgram but not metioned in structured_outputs.ipynb

do i have to use this import? from llama_index.core.bridge.pydantic import BaseModel

Yea I'd recommend that import

Maybe try updating just to be sure?

i updated the package yesterday.

Name: llama-index
Version: 0.10.51

Name: llama-index-core
Version: 0.10.51

We are at 0.10.67 lol

51 is a few weeks old

lol updated package to 0.10.67 but still giving the same error

im using from llama_index.core.bridge.pydantic import BaseModel, Field

Not sure what else to say lol I've never run into this issue

seems like to me somehow your prompt template is borked

this is how my data looks like before indexing with MarkdownElementNodeParser

Plain Text

# DOCUMENT TYPE
- Invoice

# LAYOUT TYPE
- Product

# DOCUMENT PROCESSING COUNTRY
- United Kingdom

# DOCUMENT NUMBER
- MK 1759226

# DOCUMENT DATE
- 10/07/24

# CUSTOMER PO NUMBER
- D71280

# SUPPLIER DETAILS
- COMPANY NAME: Essentra Components
- ADDRESS: 2nd Floor Hawthorne House Viking Business Park, Jarrow, Tyne & Wear, NE32 3DP
- TAX IDENTIFICATION NUMBER: GB 243 2909 68

# CUSTOMER DETAILS
- COMPANY NAME: Argus Powerbend
- ADDRESS: Pennywell Industrial Estate, Sunderland, SR4 9EN
- TAX IDENTIFICATION NUMBER: GB 938950575

# DELIVERY/SHIPPING LOCATION NAME
- ARGUS

# DELIVERY/SHIPPING LOCATION ADDRESS
- Pennywell Ind Est, Sunderland, SR4 9EN

# BILLING NAME & ADDRESS
- Argus Powerbend, Pennywell Industrial Estate, Sunderland, SR4 9EN

# CURRENCY CODE
- GBP

# CURRENCY EXCHANGE RATE
- NA

# TABLE OF LINE ITEMS
| Item No. | Customer Item No. | Item Description          | Delivery Note Number | Quantity Ordered | UoM | Quantity Shipped | Unit Price (GBP) | Total Price (GBP) |
|----------|-------------------|---------------------------|----------------------|------------------|-----|------------------|------------------|-------------------|
| 16247    | CAP78             | CP QUICK RLSE UNF1X14     | 1803843508           | 1.920            | TH  | 1.920            | 48.7800          | 93.66             |
|          |                   | CP QUICK RLSE UNF1X14     |                      |                  |     |                  |                  |                   |
|          |                   | Standard Delivery         |                      |                  |     |                  |                  |                   |

Testing your code, it seems to work fine in the basic case

the error being
KeyError: "' Item No" seems related to my data?

Works fine for me when I index it
https://gist.github.com/logan-markewich/7759c2fc9d11a638aeb5cd86fe432d0a

i used your prompt, doesnt work for me

i have no idea what went wrong

i changed my data and the same error occurred. it always breaks when it comes to the markdown table

is there any way to check this?

step through with a debugger?

Here's further proof this works for me
https://colab.research.google.com/drive/1YSzGPv1S2OHCU9yvwKZEKFqPbkEsFa1f?usp=sharing

thanks for the notebook

this is how i create my index, just with nodes and objects. different from how you did it but probably setting it up correct.

Plain Text

from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import MarkdownElementNodeParser

# instantiate the node parser
node_parser = MarkdownElementNodeParser(
    llm=llm, num_workers=8, show_progress=False)

# actual splitting of the document into nodes
nodes = node_parser.get_nodes_from_documents([document])

# return the nodes in text and tables
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

# create the vector store index
index =  VectorStoreIndex(nodes=base_nodes+objects)

this sucks but i changed to using LangchainOutputParser to increase my chances of getting structured output. Originally I was replying on prompts to get structured output which fails sometimes

@Logan M finally got it working. I follow your way of indexing the document (just nodes without base_nodes or objects). But another error come about if I try to get the LLM to do more things in my query.

I also set one of the attribute to Optional because sometimes the document simply do not have that information.

Plain Text

from llama_index.core.bridge.pydantic import Field, BaseModel
from typing import Optional

class HeaderOutput(BaseModel):
    """Outputs header information containing the following categories:
    BillTo, BillToAddress, BillToNumber."""

    BillTo: str = Field(..., description='This is the name of the customer.')
    BillToAddress: str = Field(..., description='This is the billing address of the customer. If \
cannot be found, extract from the company address of the customer.')
    BillToNumber: Optional[str] = Field(..., description='This is the Tax Identification Number (TIN) of \
the customer. Response blank if not found or "NA"')

updated my query to:

Plain Text

new_header_query = """What is the HeaderOutput? Information with NA can be omitted."""

this is the error

Plain Text

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[37], line 3
      1 new_header_query = """What is the HeaderOutput? Information with NA can be omitted."""
----> 3 response_header = header_query_engine.query(new_header_query)

File \[...]\.venv\lib\site-packages\llama_index\core\instrumentation\dispatcher.py:260, in Dispatcher.span.<locals>.wrapper(func, instance, args, kwargs)
    252 self.span_enter(
    253     id_=id_,
    254     bound_args=bound_args,
   (...)
    257     tags=tags,
    258 )
    259 try:
--> 260     result = func(*args, **kwargs)
    261 except BaseException as e:
    262     self.event(SpanDropEvent(span_id=id_, err_str=str(e)))

File \[...]\.venv\lib\site-packages\llama_index\core\base\base_query_engine.py:52, in BaseQueryEngine.query(self, str_or_query_bundle)
     50     if isinstance(str_or_query_bundle, str):
     51         str_or_query_bundle = QueryBundle(str_or_query_bundle)
---> 52     query_result = self._query(str_or_query_bundle)
     53 dispatcher.event(
     54     QueryEndEvent(query=str_or_query_bundle, response=query_result)
     55 )
...
    903         )
    904     else:
    905         return []

ValueError: Expected at least one tool call, but got 0 tool calls.

Is there a way to force no tool calls?

the llm im using is gpt-4-turbo-preview

Add a reply

Sign up and join the conversation on Discord