Pandas index

bbfgmisc

So where does the index layer come in ? Maybe it’s a dumb question: but I just don’t … understand what an index is doing here. If the LLM can transform the query into a piece of code, then how does indexing help? I can then provide information about the data frame (like column names) in the prompt itself right ?

7 comments

LLogan M

Yea that's basically all the pandas index does right now

It reads the schema of the df, and then generates pandas code based on your query text, and then tries to execute and return the result of the pandas code

LLogan M

If you have any ideas to make this smarter, it's super open to PRs! 💪

LLogan M

It might make more sense too if you read the prompts

https://github.com/jerryjliu/llama_index/blob/main/llama_index/prompts/default_prompts.py#L304

https://github.com/jerryjliu/llama_index/blob/main/llama_index/indices/struct_store/pandas_query.py#L20

LL-Cocuy

@bfgmisc Not sure what you mean here (https://discord.com/channels/1059199217496772688/1059200010622873741/1104796706945646593).

Perhaps an example would help?

Indexing, in this case, does not really change very much, except that all (or almost all) the llama_index APIs are then available to use with a Pandas df. Sure, you could provide the LLM with the entire information manually, or by writing the code yourself, but thanks to llama_index you do not need to do that.

bbfgmisc

@L-Cocuy : Thanks! this makes a lot of sense. And the pieces are coming together in my mind now. So when we create a PandasIndex, its essentially adding the information like a SQL schema (i know its not the right jargon, but the idea is similar) and hence the LLM has the information required to understand the query.

My concern was how the LLM still seems to be guessing what the actual column values would be. So for instance if i were to ask for Select 20 random U.S. equities then it does the first part of inferring the name of the column correctly asset_class but not on the second bit which should be df[df[asset_class] == 'U.S. Equity']] but it produces df[df[asset_class] == 'U.S. equities']]

Now I dont expect the model to know that it should be U.S. Equity and not U.S. equities. But couldn't we get accurate results if we could pass that granular information about column values as well. Does this make sense?

LL-Cocuy

@bfgmisc I understand what you mean. There is no real magic here. It seems confusing because llama_index is doing so much automatically under the hood.
In a nutshell, the LLM is being passed a "pandas specific" prompt (see image). That way the LLM has some basic information to work with. However, the prompt might not be perfect, and you might want to adjust it to your use case. You can create your own instance of the PandasPrompt and pass it onto the query engine generator.
You can check the default prompt being used by typing query_engine._pandas_prompt.prompt

Attachment

Screenshot_2023-05-09_at_11.11.59_AM.png

LL-Cocuy

Pro tip: there are some (mostly unwritten) conventions out there in the world on how to name pandas columns. You would probably be best served by making sure your dataframe adheres to those conventions as good as you can. That will probably make the task easier for your LLM.

Add a reply

Find answers from the community

Pandas index