I want to do pre-processing for uploaded pdf files. I u...

At a glance

The community member is looking for a way to preprocess uploaded PDF files, specifically to remove headers and footers, and convert them to markdown format while implementing multithreading. The community members suggest using the PreprocessReader from LlamaIndex, which can handle various document types including PDFs, but it's not clear if it's multithreading safe. They also mention the PDFMarkerReader from LlamaIndex, but it's not multithreading safe either. Finally, the community members suggest using the UnstructuredReader from LlamaIndex, which can handle PDFs, but its multithreading safety is also not explicitly mentioned.

iihshan :)

@kapa.ai I want to do pre-processing for uploaded pdf files. I used PyMuPDF4LLM but it is not multithreading safe. The reason I am using that library because it removes headers and footer and turn the documents into markdown from which I can do some pre-processing. Do you have any suggestion how to turn uploaded PDF files into markdown so that I can remove headers and footers while implementing multithreading ?

6 comments

kkapa.ai

Please stand by @ihshan :) as I search the knowledge base. I normally take about 20 seconds or so. :wave:

kkapa.ai

----------------------
Based on the information provided in the knowledge sources, you might want to consider using the PreprocessReader from LlamaIndex. This reader uses the Preprocess API service, which is capable of splitting any kind of document into optimal chunks of text for use in language model tasks. It takes into account sections, paragraphs, lists, images, data tables, text tables, and slides, and follows the content semantics for long texts. It supports PDFs, Microsoft Office documents, OpenOffice documents, HTML content, and plain text.

Here is how you can install the Python Preprocess library if it is not already present:

Plain Text

# Install Preprocess Python SDK package
# $ pip install pypreprocess

Source ()

However, it's important to note that the PreprocessReader requires an API key for initialization and it's not explicitly mentioned whether it's multithreading safe or not.

Plain Text

class PreprocessReader(BaseReader):     
    def __init__(self, api_key: str, *args, **kwargs):         
        if api_key is None or api_key == "":             
            raise ValueError("Please provide an api key to be used while doing the auth with the system.")          
        try:             
            from pypreprocess import Preprocess         
        except ImportError:             
            raise ImportError("`pypreprocess` package not found, please run `pip install pypreprocess`")          
        _info = {}         
        self._preprocess = Preprocess(api_key)         
        self._filepath = None         
        self._process_id = None

Source ()

kkapa.ai

If you need to process PDFs specifically and convert them to markdown format, you might also consider using the PDFMarkerReader from LlamaIndex. However, it's not explicitly mentioned whether it's multithreading safe or not, and it doesn't specifically mention the removal of headers and footers.

Plain Text

class PDFMarkerReader(BaseReader):     
    def __init__(self, *args: Any, **kwargs: Any) -> None:         
        super().__init__(*args, **kwargs)      
    def load_data(self, file: Path, max_pages: int = None, langs: List[str] = None, batch_multiplier: int = 2, start_page: int = None, extra_info: Optional[Dict] = None) -> List[Document]:

Source ()

Please refer to the official documentation for more details on how to use these readers.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

iihshan :)

@kapa.ai PDFMarkerReader is not multithreading safe since it uses PyMuPDF which does not support multithreading. How about UnstructuredReader ? how to get it

kkapa.ai

Let me check the knowledge base for you. One moment! :books:

kkapa.ai

----------------------
The UnstructuredReader from LlamaIndex is a general unstructured text reader for a variety of files. It can be used to load data from different types of files including PDFs and emails. However, it's not explicitly mentioned whether it's multithreading safe or not.

Here is how you can initialize and use the UnstructuredReader:

Plain Text

from llama_index.readers.file import UnstructuredReader

# Initialize the UnstructuredReader
loader = UnstructuredReader()

# Load data from a file
documents = loader.load_data(file=Path("your_file_path"))

Source ()

Please note that the UnstructuredReader requires the unstructured package. If it's not installed, you can install it using pip install -U unstructured.

For more details on how to use the UnstructuredReader, please refer to the official documentation ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

Add a reply

Find answers from the community

I want to do pre-processing for uploaded pdf files. I used PyMuPDF4LLM but it is not multi