Hi! I want to parse longer articles from multiple sources like mail, pdf files etc., which sometimes exceed the token limit so the output gets truncated.
Details: The articles are surrounded with noise, like page information, header, footer etc. The noise is not easily removed, with e.g. BeautifulSoup or similar, because they are all different.
Status: For articles that fit into the token limit it already works fine and only the resulting article with a SEO friendly headline is produced. This is what should happen.
Issue: Longer texts... So i do not want to summarize and only pick some chunks, but the whole article without any words changed.
Question: Are there options for a multi prompt setup that e.g. each prompt takes some chunks and then the output is recombined? As far as i understand:
https://python.langchain.com/docs/use_cases/question_answering/how_to/analyze_document is just summarising, which is not what i want. Has somebody a better idea on how to approach it?