Hello I have a problem. I have a 15 pages PDF file. If I ask you "what is article 111 about?" It tells me that this information does not exist in the PDF. If I write the content of article 111, it does detect it. Do I have to change the way the nodes are generated?
I'm assuming you are using a vector index? Embeddings don't do a great job at capturing exact words (they capture "general ideas" of text)
you can use keywords to help this though
Plain Text
index.query(
"What did the author do after Y Combinator?",
similarity_top_k=3
required_keywords=["Combinator"],
exclude_keywords=["Italy"],
response_mode="compact"
)
@Manu Lorenzo I highly recommend that you move towards composable indices and breaking down each of those articles into its own nodes and use LLM to generate summary. Each of those nodes high level article nodes should then subsequently be connected to a series of smaller parsed nodes say for each sentence. That way you can do your search more accurately. First find the right article(s) and then retrieve the right sentence level nodes that can help answer the query.
Hello again Logan. How are you? The problem I have is that I'm trying to make an app to chat with PDF files. But there are certain files in which I ask him something and he answers that it does not appear in the document. Like for example, "article 111" that we talked about the other day. I have printed the chunks that are generated for indexing and there is a lot of text that does not appear, I don't know if that is the problem. The fact is that there is a website "chatPDF.com", which works with the same technology but its recognition is perfect. I've been having this problem for a crazy week, what could be the cause? chunk size? max_chunk_overlap? I don't know if the problem is from llama_index. Are there other people who have reported this problem?
I suspect other products might be using extractive techniques (i.e. identifying the start/end positions in the text to answer queries), rather than trying to synthesize new sentences/explanations as answers to queries
But regardless, the main solution here (in addition to other info in this thread) is prompt engineering I think. You can check out the bottom of the FAQ for some helpful links that show how to customize prompts in llama index