Hello I have a problem I have a 15 pages

MManu Lorenzo

Hello I have a problem. I have a 15 pages PDF file. If I ask you "what is article 111 about?" It tells me that this information does not exist in the PDF. If I write the content of article 111, it does detect it. Do I have to change the way the nodes are generated?

Attachment

14 comments

LLogan M

I'm assuming you are using a vector index? Embeddings don't do a great job at capturing exact words (they capture "general ideas" of text)

you can use keywords to help this though

Plain Text

index.query(
    "What did the author do after Y Combinator?", 
    similarity_top_k=3
    required_keywords=["Combinator"], 
    exclude_keywords=["Italy"],
    response_mode="compact"
)

MManu Lorenzo

wow, okay. Lets try

MManu Lorenzo

Thanks again, Logan

MManu Lorenzo

I used other apps that can find the article 111, but not mine.The error may be due to tokenization?

LLogan M

Maybe? You can play around with the chunk size in the service context

service_context = ServiceContext.from_defaults(..., chunk_size_limit=512)

there are also other indexes or index structures that might work better! (List, Tree, Keyword, Composable indexes)

MManu Lorenzo

Thats my current config... only this can influence whether or not in find a phrase?

Attachment

LLogan M

Chunk size limit actually goes outside of the prompt helper (like in my example above, it goes directly into the service context object)

BBioHacker

@Logan M Wjat dpes chunk_size_limit actually do? The amount of characters in each chunk?

LLogan M

The amount of tokens in each chunk actually 👍

BBioHacker

haha my bad still thinking in terms of strings.

BBioHacker

@Manu Lorenzo I highly recommend that you move towards composable indices and breaking down each of those articles into its own nodes and use LLM to generate summary. Each of those nodes high level article nodes should then subsequently be connected to a series of smaller parsed nodes say for each sentence. That way you can do your search more accurately. First find the right article(s) and then retrieve the right sentence level nodes that can help answer the query.

LLogan M

Great point @BioHacker

Composability docs are here https://gpt-index.readthedocs.io/en/latest/how_to/index_structs/composability.html

MManu Lorenzo

Hello again Logan. How are you? The problem I have is that I'm trying to make an app to chat with PDF files. But there are certain files in which I ask him something and he answers that it does not appear in the document. Like for example, "article 111" that we talked about the other day. I have printed the chunks that are generated for indexing and there is a lot of text that does not appear, I don't know if that is the problem. The fact is that there is a website "chatPDF.com", which works with the same technology but its recognition is perfect. I've been having this problem for a crazy week, what could be the cause? chunk size? max_chunk_overlap? I don't know if the problem is from llama_index. Are there other people who have reported this problem?

LLogan M

I suspect other products might be using extractive techniques (i.e. identifying the start/end positions in the text to answer queries), rather than trying to synthesize new sentences/explanations as answers to queries

But regardless, the main solution here (in addition to other info in this thread) is prompt engineering I think. You can check out the bottom of the FAQ for some helpful links that show how to customize prompts in llama index

https://docs.google.com/document/d/1bLP7301n4w9_GsukIYvEhZXVAvOMWnrxMy089TYisXU/edit?usp=sharing

Add a reply

Find answers from the community

Hello I have a problem I have a 15 pages