Find answers from the community

Updated 3 months ago

I have a folder with multiple MS Excel

I have a folder with multiple MS Excel spreadsheets with offers from different providers . I want to extract the information about products (like EAN code, description, price, etc.) in the JSON format, and then send it to the SQL table.

Each spreadsheet has different structure (e.g. columns), and may include some headers. Moreover, their size is much larger than GPT-4 context window.

At the moment I successfully process small spreadsheets with the following code (see attachment). Unfortunately, for larger files it generates error (number tokens larger than context window). Asyncio doesn't help...


How to adopt the script to cope with larger files (to avoid the problem with small context window)? In other words: how to chunk the doc, and then create a complete JSON from all the chunks?

I have also tried a llamaindex PandasExcelReader (see second attachment): it reads a larger file, but doesn't find the correct answer for my products.

Do you have any tips on data extraction from different, large Excel spreadsheets? I would really appreciate Your help πŸ™‚

Andy
Attachments
image.png
image.png
L
A
2 comments
My tip would be to figure out some kind of pipeline for transforming this all into SQL before trying to query it

I feel like there must be some manual heuristic way to transform these json files into a common SQL schema πŸ€”
I did it by workoaround: I used LLM to identify the first line of real data (that with a column names), and then pd.readxls(skiprows=first_line). Works well with ChatGPT 4 and temperature=0 πŸ™‚
Add a reply
Sign up and join the conversation on Discord