Analyze Document | 🦜🔗 Langchain

At a glance

The community member wants to parse longer articles from various sources like emails and PDFs, but the output gets truncated when the articles exceed the token limit. The articles contain noise like page information, headers, and footers, which are difficult to remove using tools like BeautifulSoup. The community member's current solution works for articles that fit within the token limit, but for longer texts, they do not want to summarize or pick chunks, but rather output the entire article without any changes. The community member is asking if there are options for a multi-prompt setup where each prompt handles a chunk of the article, and the outputs are then recombined.

In the comments, another community member asks if the problem would still remain if the noise could be properly removed from the articles.

Useful resources

TTorben

Hi! I want to parse longer articles from multiple sources like mail, pdf files etc., which sometimes exceed the token limit so the output gets truncated.

Details: The articles are surrounded with noise, like page information, header, footer etc. The noise is not easily removed, with e.g. BeautifulSoup or similar, because they are all different.
Status: For articles that fit into the token limit it already works fine and only the resulting article with a SEO friendly headline is produced. This is what should happen.
Issue: Longer texts... So i do not want to summarize and only pick some chunks, but the whole article without any words changed.

Question: Are there options for a multi prompt setup that e.g. each prompt takes some chunks and then the output is recombined? As far as i understand: https://python.langchain.com/docs/use_cases/question_answering/how_to/analyze_document is just summarising, which is not what i want. Has somebody a better idea on how to approach it?

1 comment

DDS

Hey, so if you were able to remove all the noise properly would they fit into the context window or would the problem still remain?

Add a reply

Find answers from the community

Analyze Document | 🦜🔗 Langchain