LlamaIndex + Pandas

At a glance

The community member is experiencing an issue with LlamaIndex and Pandas, where the queries generated by the system are returning concatenated data from a Pandas DataFrame. The DataFrame appears to have a complex structure, with a MultiIndex, which may be causing the issue. The community member has shared the code for building the agent and engines, as well as the spending_df, and has asked for help in resolving the problem. Other community members have offered to take a look and provide assistance, but there is no explicitly marked answer in the comments.

aandreas

Running into a bit of an annoying issue with LlamaIndex + Pandas

Sometimes the query that GPT / LlamaIndex comes up with queries (e.g.

Plain Text

df.loc[(slice('07/2023','10/2023'), 'restaurants'), 'breakdown']

) that return data from Pandas like this

Plain Text

07/2023  restaurants    {'Evroulla': 179.0, 'Foukou tou Yiakoumi': 60....
08/2023  restaurants    {'Evroulla': 181.0, 'Foukou tou Yiakoumi': 58....
09/2023  restaurants    {'Evroulla': 176.0, 'Foukou tou Yiakoumi': 59....
10/2023  restaurants    {'Evroulla': 48.0, 'Foukou tou Yiakoumi': 15.0...

but the data it returns is concatenated as shown above.

This is what the whole df looks like

Plain Text

                                                             breakdown  total
month   category                                                             
01/2023 Total        {'transport': 307, 'groceries': 499, 'restaura...   7354
        groceries    {'Alphamega Hypermarket': 368.0, 'Athienitis':...    499
        home                   {'Rent': 3987.0, 'Electricity': 1824.0}   5811
        restaurants  {'Evroulla': 178.0, 'Foukou tou Yiakoumi': 59....    447
        shopping         {'ZARA': 191.0, 'Amazon': 52.0, 'Ebay': 47.0}    290
        transport    {'ESSO': 186.0, 'Petrolina': 90.0, 'Panikos Ca...    307
02/2023 Total        {'transport': 307, 'groceries': 522, 'restaura...   7290
        groceries    {'Alphamega Hypermarket': 405.0, 'Athienitis':...    522
        home                   {'Rent': 3891.0, 'Electricity': 1813.0}   5704
        restaurants  {'Evroulla': 181.0, 'Foukou tou Yiakoumi': 61....    444
        shopping         {'ZARA': 207.0, 'Amazon': 52.0, 'Ebay': 54.0}    313
        transport    {'ESSO': 185.0, 'Petrolina': 90.0, 'Panikos Ca...    307

what to do, lol?

9 comments

rravitheja

@andreas are you using PandasQueryEngine?

Can you share the code, data and query in the format you are trying to out?

aandreas

@ravitheja thanks for taking a look at this

System prompt (you can prob skip this)

Plain Text

SYSTEM_PROMPT = "You are AskGeorge, an expert personal financial helper with context about my checking and\
                saving account balances. The date today is 2023-10-08 (8th of October, 2023). If I ask you how\
                much money I have all together, add the latest entries from my current account to my savings account. You are almost always talking\
                about money or percentage changes, the currency is euros, so when you format numbers make \
                sure to add the currency sign and if it's a decimal number make it 2 decimal points. Use chain of thought reasoning to make sure you're answering what the user is actually asking. \
                Be conversational and informative in your responses. For example, when asked 'How much has my savings account \
                grown in the last 3 months', you will respond along the lines of 'Your savings account grew \
                X euros, from Y to Z, which is a A% increase. Well done!'"

building the agent + engines (yes, I'm using PandasQueryEngine) Important parts here are spending_engine and spending_df, that's what is having the issues rn

Plain Text

# build agent
llm = OpenAI(model="gpt-4", temperature=0)
service_context = ServiceContext.from_defaults(llm=llm)


# build balance engine

balance_df = getBalancePandas()

balance_engine = PandasQueryEngine(
    balance_df, 
    service_context=service_context,
    verbose=True,
)

# build spending engine

spending_df = getSpendingPandas()

spending_engine = PandasQueryEngine(
    spending_df,
    service_context=service_context,
    verbose=True
)

query_engine_tools = [
    QueryEngineTool(
        query_engine=balance_engine,
        metadata=ToolMetadata(
            name="balance_pandas_df_tool",
            description="Useful for translating a natural language query into a Pandas query over a dataframe containing the end of day balance of my checkings and savings account for a range of dates. It is extremely important that nothing is put after the Pandas query is outputted. df.columns: Index(['id', 'checkingAccountBalance', 'savingsAccountBalance'], dtype='object') and this is df.index: DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', ...., '2023-10-07', '2023-10-08'], dtype='datetime64[ns]', name='date', length=281, freq=None)"
        ),
    ),
    QueryEngineTool(
        query_engine=spending_engine,
        metadata=ToolMetadata(
            name="spending_pandas_df_tool",
            description="Useful for translating a natural language query into a Pandas query over a dataframe containing my monthly spending breakdown. It is extremely important that nothing is put after the Pandas query is outputted. The two columns are 'breakdown' and 'total'. df['breakdown']['01/2023']['transport'] will give you the breakdown of spending in the transport category for January 2023. df['total']['02/2023]['Total'] will give you the total spending for that month. The pandas dataframe is MultiIndex. If you're querying the Pandas df and extract data using loc, add a .apply(pd.Series) after your query to make things easier to analyze and so that data isn't concatenated"
        ),
    ),
]

Agent

Plain Text

agent = OpenAIAgent.from_tools(
        query_engine_tools, 
        llm=llm, 
        verbose=True,
        system_prompt=SYSTEM_PROMPT
)

def DataAgent(query_text):
    response = agent.chat(query_text)
    return response

aandreas

spending_engine is what is causing the trouble here

aandreas

I'm going to copy the spending_df rn

aandreas

spending_df is here. It's a bit complex (MultiIndex) which might be mucking things up

rravitheja

@andreas can you share data and code for getBalancePandas, getSpendingPandas functions. Can try out and see where it is going wrong.

aandreas

@ravitheja

aandreas

they just return two df's, nothing complicated. The problem is that the query that the agent generates outputs a concatenated version of the DF, because the DF is perhaps too big

aandreas

Thank you again for offering to take a look

Add a reply

Find answers from the community

LlamaIndex + Pandas