----------------------
LlamaIndex provides a few options for extracting text from PDF files. The
PDFReader
and
PyMuPDFReader
are two such options. However, the fastest method would be dependent on the specific characteristics of your PDF files and your system configuration.
Here's an example of how you can use the
PDFReader
to extract text from a PDF:
from pathlib import Path
from llama_index.readers.file import PDFReader
loader = PDFReader()
docs0 = loader.load_data(file=Path("./data/llama2.pdf"))
from llama_index.core import Document
doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]
(Source:
LlamaIndex Examples ())
And here's how you can use the
PyMuPDFReader
:
from pathlib import Path
from llama_index.core import Document
from llama_index.readers.file import PyMuPDFReader
loader = PyMuPDFReader()
docs0 = loader.load(file_path=Path("./data/llama2.pdf"))
doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]
(Source:
LlamaIndex Examples ())
You can try both and see which one works best for your use case.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------