Hey, Just wanted to make sure, You want to create one document object per pdf?
How to do that using SimpleDirectoryReader?
I have checked docs_reader.py file in llama_index/readers/file directory.
There, For docx documents, you have created 1Document object per 1docx
But For PDF, you have created Document objects per page.
@WhiteFang_Jr Please advice me how to create one document object per pdf using SimpleDirectoryReader
But for PDF, I think it will not work in cases where PDF is too big as most of the LLM have total token generation capacity at a time to be 2048 tokens.
So you will have to keep this in mind else LLM call will fail in this case
If you still want to do that , You will have to do it your side and do it manually like read each page of the pdf, combine them and create document object at the end using Document class
I required to provide any kind of answer from pdf, docx,etc.
But not working well for pdf files, I have used GPTVectorStoreIndex
How to get answer from full pdf file then?
I think, chunking them to default size works! I work with the PDFs as well. I chunk them to 512 token size and it works for me
chunk setting is working in service_context, right?
service_context = ServiceContext.from_defaults(
llm_predictor=llm_predictor,
chunk_size=512
)
index = GPTVectorStoreIndex.from_documents(documents,service_context=service_context)
I am querying from 100~1000 pages pdf.
query_engine = index.as_query_engine(response_mode="tree_summarize", streaming=True, text_qa_template=QA_PROMPT,)
I set query engine like this.
do I need to add similarity_top_k, and similarity_cutoff?
similarity_cutoff
will remove the document chunks which have lower cosine similarity confidence then your set value.
Whereas similarity_top_k
picks the top K similar document chunk from the total docs.
They help in making the response better.
Yes this is by default value and it works, rest if yuoo want to change you can do that from service context
I have changed with 512, but not working well.
Of course, I get answer, but answer is not correct and not professional
I ask like this: "What is the purpose of this document? "
I wanted AI to think and provide answer.
You will have to check the source nodes whether they are able to pick relevant source text or not.
Also the question that you have asked is basically a summary of the entire doc. You can explore DocumentSummaryIndex for as it saves summary of each chunk.
Rest if you want the bot to answer query like this, Maybe you can add the total summary of this doc inside the summary by yourself in the Prompt for the document. Then it may be able to answer query like these.
Not sure whether it will work or not but you can try this
DocumentSummaryIndex
I have ever tried using DocumentSummaryIndex , but If I ask questions like "How are you?" It could not work.
It will not work for queries like this as you are not interacting with the LLM directly.
Query -> finding similar to "how are you" in docs -> LLM call
That is why you wont get answer to queries like that
How to make things to work for everything ?
"How are you?", "Summary this document", etc using DocumentSummaryIndex.
Can use a query engine router