i didn't state clearly in this thread but i think I did better in the new message I just posted but nowhere to find. doesn't matter anyway.
I mean, I am extracting data from webpages, the workflow is
- firecrawl scrape page into markdown
- feed into LLM. the input could already be tens of KBs, let's say 50KB
- extract data. the extracted could be a bit lesser, let's say 30K.
so, context window is not a problem for literally any LLM, even latest Llama 3.2 3b, but the max output tokens is. gemini/claude-3.5-sonnet are already largest but only 8192 tokens. in a chat client like chatgpt, we gotta reply with "continue" again and again to get the full output.
what about in Llama Index or any other workflow management? how do I handle large output?
I will do embedding/reranking later, but I have to extract the data first.