I have a folder with multiple MS Excel spreadsheets with offers from different providers . I want to extract the information about products (like EAN code, description, price, etc.) in the JSON format, and then send it to the SQL table.
Each spreadsheet has different structure (e.g. columns), and may include some headers. Moreover, their size is much larger than GPT-4 context window.
At the moment I successfully process small spreadsheets with the following code (see attachment). Unfortunately, for larger files it generates error (number tokens larger than context window). Asyncio doesn't help...
How to adopt the script to cope with larger files (to avoid the problem with small context window)? In other words: how to chunk the doc, and then create a complete JSON from all the chunks?
I have also tried a llamaindex PandasExcelReader (see second attachment): it reads a larger file, but doesn't find the correct answer for my products.
Do you have any tips on data extraction from different, large Excel spreadsheets? I would really appreciate Your help π
I did it by workoaround: I used LLM to identify the first line of real data (that with a column names), and then pd.readxls(skiprows=first_line). Works well with ChatGPT 4 and temperature=0 π