how to vectorize the documents (pdf, html) that include...

At a glance

The post asks how to vectorize documents (PDF, HTML) that include images, text, and tables for a Retrieval-Augmented Generation (RAG) system. The community members provide the following steps:

1. Load the documents using the SimpleDirectoryReader class from LlamaIndex.

2. Index the documents using the VectorStoreIndex class from LlamaIndex.

3. Create a query engine using the as_query_engine method of the VectorStoreIndex class.

4. Query the RAG system using the query method of the query engine.

The community members note that this process does not explicitly handle images or tables in the documents, and that additional tools or methods may be needed to extract text from these elements.

The community members also discuss the handling of images and nested points in HTML documents when using LlamaIndex and LlamaParse, but there is no explicitly marked answer to the original question.

ppayload

how to vectorize the documents (pdf, html) that include image ,text, tables for RAG

14 comments

kkapa.ai

Please stand by @payload as I search the knowledge base. I normally take about 20 seconds or so. :wave:

kkapa.ai

----------------------
To vectorize documents (including PDFs, HTMLs, etc.) that contain images, text, tables, etc. for a RAG (Retrieval-Augmented Generation) system using LlamaIndex, you can follow these steps:

Load the documents: Use the SimpleDirectoryReader class from llama_index.core to load the documents from a directory. This class reads all the files in a directory and returns them as a list of Document objects.

Plain Text

from llama_index.core import SimpleDirectoryReader
loader = SimpleDirectoryReader(input_dir="./data")
documents = loader.load_data()

Index the documents: Use the VectorStoreIndex class from llama_index.core to index the documents. This class takes a list of Document objects and creates an index that can be used to retrieve documents based on their content.

Plain Text

from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)

Create a query engine: Use the as_query_engine method of the VectorStoreIndex class to create a query engine. This query engine can be used to retrieve documents based on a query.

Plain Text

rag = index.as_query_engine(llm=mistral_llm)

Query the RAG system: Use the query method of the query engine to retrieve documents based on a query.

Plain Text

response = rag.query(query)
print(response)

Please note that this process does not explicitly handle images or tables in the documents. If you need to extract text from images, you might need to use an OCR (Optical Character Recognition) system like Claude 3 as mentioned in the LlamaIndex documentation (). However, this is not covered in the provided knowledge sources.

kkapa.ai

For tables, if they are represented as text in the documents, they will be handled just like any other text. If they are images or some other format that is not directly readable as text, you will need to use a suitable method to extract the text from them, which is also not covered in the provided knowledge sources.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

ppayload

if a pdf has image and text also SimpleDirectoryReader in llamaindex will manage that automatically, same for HTML files ? @kapa.ai

kkapa.ai

Let me check the knowledge base for you. One moment! :books:

kkapa.ai

----------------------
The SimpleDirectoryReader in LlamaIndex is designed to load data from local files into LlamaIndex. It supports a variety of file types, including .pdf and .html. However, the documentation does not explicitly mention how it handles images within these files.

By default, SimpleDirectoryReader will try to read any files it finds, treating them all as text. It supports the following file types, which are automatically detected based on file extension:

.csv - comma-separated values
.docx - Microsoft Word
.epub - EPUB ebook format
.hwp - Hangul Word Processor
.ipynb - Jupyter Notebook
.jpeg, .jpg - JPEG image
.mbox - MBOX email archive
.md - Markdown
.mp3, .mp4 - audio and video
.pdf - Portable Document Format
.png - Portable Network Graphics
.ppt, .pptm, .pptx - Microsoft PowerPoint

For JSON files, it is recommended to use the JSON Loader.

However, it's important to note that while SimpleDirectoryReader can read these file types, it does not explicitly mention the ability to extract or handle images or other non-text content within these files. If your PDF or HTML files contain images, you may need to use additional tools or methods to extract the text from these images, which is not covered in the provided knowledge sources.

Here is the source () for your reference.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

ppayload

@Logan M @WhiteFang_Jr
i am having documents on notion, which i had exported to html or pdf. The docs contain images and text as well. I tried utilizing llama parse but it is not parsing the images for some reason

Plain Text

parser = LlamaParse(
    api_key=os.environ["LLAMA_CLOUD_API_KEY"], result_type="markdown",  verbose=True, gpt4o_mode=True, gpt4o_api_key=os.getenv("OPENAI_API_KEY")
)

file_extractor = {".pdf": parser, ".html": parser}

documents = SimpleDirectoryReader(
    input_files=files , file_extractor=file_extractor, recursive=True
).load_data()

what is the correct way to do it

WWhiteFang_Jr

Have you checked with both ways to parse your docs: