----------------------
Based on the information provided in the knowledge sources, you might want to consider using the
PreprocessReader
from LlamaIndex. This reader uses the Preprocess API service, which is capable of splitting any kind of document into optimal chunks of text for use in language model tasks. It takes into account sections, paragraphs, lists, images, data tables, text tables, and slides, and follows the content semantics for long texts. It supports PDFs, Microsoft Office documents, OpenOffice documents, HTML content, and plain text.
Here is how you can install the Python Preprocess library if it is not already present:
# Install Preprocess Python SDK package
# $ pip install pypreprocess
Source ()
However, it's important to note that the
PreprocessReader
requires an API key for initialization and it's not explicitly mentioned whether it's multithreading safe or not.
class PreprocessReader(BaseReader):
def __init__(self, api_key: str, *args, **kwargs):
if api_key is None or api_key == "":
raise ValueError("Please provide an api key to be used while doing the auth with the system.")
try:
from pypreprocess import Preprocess
except ImportError:
raise ImportError("`pypreprocess` package not found, please run `pip install pypreprocess`")
_info = {}
self._preprocess = Preprocess(api_key)
self._filepath = None
self._process_id = None
Source ()