For PDF documents SimpleDirectoryReader

At a glance

For PDF documents, SimpleDirectoryReader makes Document Object Per page.
Before You created Document Object from full document.
Now I need to create Document Object based on only full document context.

How to do that?
@Logan M

32 comments

WWhiteFang_Jr

Hey, Just wanted to make sure, You want to create one document object per pdf?

oopenmind

Yes

oopenmind

How to do that using SimpleDirectoryReader?

oopenmind

I have checked docs_reader.py file in llama_index/readers/file directory.

oopenmind

There, For docx documents, you have created 1Document object per 1docx

But For PDF, you have created Document objects per page.

oopenmind

@WhiteFang_Jr Please advice me how to create one document object per pdf using SimpleDirectoryReader

WWhiteFang_Jr

But for PDF, I think it will not work in cases where PDF is too big as most of the LLM have total token generation capacity at a time to be 2048 tokens.

So you will have to keep this in mind else LLM call will fail in this case

WWhiteFang_Jr

If you still want to do that , You will have to do it your side and do it manually like read each page of the pdf, combine them and create document object at the end using Document class

oopenmind

I required to provide any kind of answer from pdf, docx,etc.

But not working well for pdf files, I have used GPTVectorStoreIndex

oopenmind

How to get answer from full pdf file then?

WWhiteFang_Jr

I think, chunking them to default size works! I work with the PDFs as well. I chunk them to 512 token size and it works for me

oopenmind

chunk setting is working in service_context, right?

oopenmind

service_context = ServiceContext.from_defaults(
llm_predictor=llm_predictor,
chunk_size=512
)

index = GPTVectorStoreIndex.from_documents(documents,service_context=service_context)

WWhiteFang_Jr

yes

oopenmind

before, I have used 1024

oopenmind

That would be problem?

oopenmind

I am querying from 100~1000 pages pdf.

oopenmind

query_engine = index.as_query_engine(response_mode="tree_summarize", streaming=True, text_qa_template=QA_PROMPT,)

I set query engine like this.

oopenmind

do I need to add similarity_top_k, and similarity_cutoff?

oopenmind

@WhiteFang_Jr

WWhiteFang_Jr

similarity_cutoff will remove the document chunks which have lower cosine similarity confidence then your set value.

Whereas similarity_top_k picks the top K similar document chunk from the total docs.

They help in making the response better.

WWhiteFang_Jr

Yes this is by default value and it works, rest if yuoo want to change you can do that from service context

oopenmind

I have changed with 512, but not working well.

oopenmind

Of course, I get answer, but answer is not correct and not professional

oopenmind

I ask like this: "What is the purpose of this document? "

I wanted AI to think and provide answer.

oopenmind

but it could not

WWhiteFang_Jr

You will have to check the source nodes whether they are able to pick relevant source text or not.

Also the question that you have asked is basically a summary of the entire doc. You can explore DocumentSummaryIndex for as it saves summary of each chunk.

Rest if you want the bot to answer query like this, Maybe you can add the total summary of this doc inside the summary by yourself in the Prompt for the document. Then it may be able to answer query like these.
Not sure whether it will work or not but you can try this

oopenmind

DocumentSummaryIndex

I have ever tried using DocumentSummaryIndex , but If I ask questions like "How are you?" It could not work.

oopenmind

it was getting error.

WWhiteFang_Jr

It will not work for queries like this as you are not interacting with the LLM directly.

Query -> finding similar to "how are you" in docs -> LLM call

That is why you wont get answer to queries like that

oopenmind

How to make things to work for everything ?
"How are you?", "Summary this document", etc using DocumentSummaryIndex.

bbmax

Can use a query engine router

Add a reply

Find answers from the community

For PDF documents SimpleDirectoryReader