Hello Mates

At a glance

The community member is asking about PDF vectorization and how to handle multiple PDF files in a folder. They want to know if it's possible to directly vectorize PDFs instead of extracting the text first, and how to update the vector store when adding new PDF files to the folder.

In the comments, another community member responds that it's not possible to create embeddings without having the text, and suggests using the filename_as_id option in the simple directory reader to refresh the index when adding new documents. However, they note that this feature currently doesn't work with vector database integrations, only the default vector database.

The original community member thanks the other for the information.

AAsh_

Hello Mates,
Can you tell me some thing about pdf verctorization stuff?

Is it possible to directly vectorize the pdf? instead of pulling out the text 1st and then doing the vectiorization.

2nd: If I have a folder /data and it contains a pdf that need to be vectorize, then If I add a 2nd file to the same fodler, do i need to repeat the process for both files or the 2nd fill can be updated to current vector store that is locally saved on the disk.

or is it possible to create multiple vector stores and them use them all together?

Looking for some knowledge, thanks

3 comments

AAsh_

@Logan M will you please give your quick thoughts on this?

LLogan M

It's not possible to create the embeddings without having the text

If you use the filename_as_id option in simple directory reader, you can use the index.refresh_ref_docs(documents) to refresh the index

Basically it uses the doc id as a static identifier, to check if the document is already inserted or needs to be updated in the index

(Note, currently this doesn't work with vectordb integrations, just the default vector db)

AAsh_

Great, thanks for information.

Add a reply

Find answers from the community

Hello Mates