Find answers from the community

Updated 3 months ago

Analyze Document | πŸ¦œπŸ”— Langchain

Hi! I want to parse longer articles from multiple sources like mail, pdf files etc., which sometimes exceed the token limit so the output gets truncated.

Details: The articles are surrounded with noise, like page information, header, footer etc. The noise is not easily removed, with e.g. BeautifulSoup or similar, because they are all different.
Status: For articles that fit into the token limit it already works fine and only the resulting article with a SEO friendly headline is produced. This is what should happen.
Issue: Longer texts... So i do not want to summarize and only pick some chunks, but the whole article without any words changed.

Question: Are there options for a multi prompt setup that e.g. each prompt takes some chunks and then the output is recombined? As far as i understand: https://python.langchain.com/docs/use_cases/question_answering/how_to/analyze_document is just summarising, which is not what i want. Has somebody a better idea on how to approach it?
D
1 comment
Hey, so if you were able to remove all the noise properly would they fit into the context window or would the problem still remain?
Add a reply
Sign up and join the conversation on Discord