Find answers from the community

Updated last year

For PDF documents SimpleDirectoryReader

For PDF documents, SimpleDirectoryReader makes Document Object Per page.
Before You created Document Object from full document.
Now I need to create Document Object based on only full document context.

How to do that?
@Logan M
W
o
b
32 comments
Hey, Just wanted to make sure, You want to create one document object per pdf?
How to do that using SimpleDirectoryReader?
I have checked docs_reader.py file in llama_index/readers/file directory.
There, For docx documents, you have created 1Document object per 1docx

But For PDF, you have created Document objects per page.
@WhiteFang_Jr Please advice me how to create one document object per pdf using SimpleDirectoryReader
But for PDF, I think it will not work in cases where PDF is too big as most of the LLM have total token generation capacity at a time to be 2048 tokens.

So you will have to keep this in mind else LLM call will fail in this case
If you still want to do that , You will have to do it your side and do it manually like read each page of the pdf, combine them and create document object at the end using Document class
I required to provide any kind of answer from pdf, docx,etc.

But not working well for pdf files, I have used GPTVectorStoreIndex
How to get answer from full pdf file then?
I think, chunking them to default size works! I work with the PDFs as well. I chunk them to 512 token size and it works for me
chunk setting is working in service_context, right?
service_context = ServiceContext.from_defaults(
llm_predictor=llm_predictor,
chunk_size=512
)

index = GPTVectorStoreIndex.from_documents(documents,service_context=service_context)
before, I have used 1024
That would be problem?
I am querying from 100~1000 pages pdf.
query_engine = index.as_query_engine(response_mode="tree_summarize", streaming=True, text_qa_template=QA_PROMPT,)

I set query engine like this.
do I need to add similarity_top_k, and similarity_cutoff?
@WhiteFang_Jr
similarity_cutoff will remove the document chunks which have lower cosine similarity confidence then your set value.

Whereas similarity_top_k picks the top K similar document chunk from the total docs.

They help in making the response better.
Yes this is by default value and it works, rest if yuoo want to change you can do that from service context
I have changed with 512, but not working well.
Of course, I get answer, but answer is not correct and not professional
I ask like this: "What is the purpose of this document? "

I wanted AI to think and provide answer.
but it could not
You will have to check the source nodes whether they are able to pick relevant source text or not.

Also the question that you have asked is basically a summary of the entire doc. You can explore DocumentSummaryIndex for as it saves summary of each chunk.


Rest if you want the bot to answer query like this, Maybe you can add the total summary of this doc inside the summary by yourself in the Prompt for the document. Then it may be able to answer query like these.
Not sure whether it will work or not but you can try this
DocumentSummaryIndex

I have ever tried using DocumentSummaryIndex , but If I ask questions like "How are you?" It could not work.
it was getting error.
It will not work for queries like this as you are not interacting with the LLM directly.

Query -> finding similar to "how are you" in docs -> LLM call

That is why you wont get answer to queries like that
How to make things to work for everything ?
"How are you?", "Summary this document", etc using DocumentSummaryIndex.
Can use a query engine router
Add a reply
Sign up and join the conversation on Discord