Find answers from the community

Updated 2 months ago

how to vectorize the documents (pdf, html) that include image ,text, tables for RAG

how to vectorize the documents (pdf, html) that include image ,text, tables for RAG
k
p
W
14 comments
Please stand by @payload as I search the knowledge base. I normally take about 20 seconds or so. :wave:
----------------------
To vectorize documents (including PDFs, HTMLs, etc.) that contain images, text, tables, etc. for a RAG (Retrieval-Augmented Generation) system using LlamaIndex, you can follow these steps:

  1. Load the documents: Use the SimpleDirectoryReader class from llama_index.core to load the documents from a directory. This class reads all the files in a directory and returns them as a list of Document objects.
Plain Text
from llama_index.core import SimpleDirectoryReader
loader = SimpleDirectoryReader(input_dir="./data")
documents = loader.load_data()


  1. Index the documents: Use the VectorStoreIndex class from llama_index.core to index the documents. This class takes a list of Document objects and creates an index that can be used to retrieve documents based on their content.
Plain Text
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)


  1. Create a query engine: Use the as_query_engine method of the VectorStoreIndex class to create a query engine. This query engine can be used to retrieve documents based on a query.
Plain Text
rag = index.as_query_engine(llm=mistral_llm)


  1. Query the RAG system: Use the query method of the query engine to retrieve documents based on a query.
Plain Text
response = rag.query(query)
print(response)


Please note that this process does not explicitly handle images or tables in the documents. If you need to extract text from images, you might need to use an OCR (Optical Character Recognition) system like Claude 3 as mentioned in the LlamaIndex documentation (). However, this is not covered in the provided knowledge sources.
For tables, if they are represented as text in the documents, they will be handled just like any other text. If they are images or some other format that is not directly readable as text, you will need to use a suitable method to extract the text from them, which is also not covered in the provided knowledge sources.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
if a pdf has image and text also SimpleDirectoryReader in llamaindex will manage that automatically, same for HTML files ? @kapa.ai
Let me check the knowledge base for you. One moment! :books:
----------------------
The SimpleDirectoryReader in LlamaIndex is designed to load data from local files into LlamaIndex. It supports a variety of file types, including .pdf and .html. However, the documentation does not explicitly mention how it handles images within these files.

By default, SimpleDirectoryReader will try to read any files it finds, treating them all as text. It supports the following file types, which are automatically detected based on file extension:

  • .csv - comma-separated values
  • .docx - Microsoft Word
  • .epub - EPUB ebook format
  • .hwp - Hangul Word Processor
  • .ipynb - Jupyter Notebook
  • .jpeg, .jpg - JPEG image
  • .mbox - MBOX email archive
  • .md - Markdown
  • .mp3, .mp4 - audio and video
  • .pdf - Portable Document Format
  • .png - Portable Network Graphics
  • .ppt, .pptm, .pptx - Microsoft PowerPoint
For JSON files, it is recommended to use the JSON Loader.

However, it's important to note that while SimpleDirectoryReader can read these file types, it does not explicitly mention the ability to extract or handle images or other non-text content within these files. If your PDF or HTML files contain images, you may need to use additional tools or methods to extract the text from these images, which is not covered in the provided knowledge sources.

Here is the source () for your reference.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@Logan M @WhiteFang_Jr
i am having documents on notion, which i had exported to html or pdf. The docs contain images and text as well. I tried utilizing llama parse but it is not parsing the images for some reason

Plain Text
parser = LlamaParse(
    api_key=os.environ["LLAMA_CLOUD_API_KEY"], result_type="markdown",  verbose=True, gpt4o_mode=True, gpt4o_api_key=os.getenv("OPENAI_API_KEY")
)

file_extractor = {".pdf": parser, ".html": parser}

documents = SimpleDirectoryReader(
    input_files=files , file_extractor=file_extractor, recursive=True
).load_data()

what is the correct way to do it
Have you checked with both ways to parse your docs:
  • With GPT4o in llamaparse
  • Without GPT-4o
?
when i convert html to pdf it is working, but not for html
its ocring the image, it possible to generate a description
with gpt4o mode the ocr is not working
it just giving markdown image !
another point is the nested points from the html are not getting extracted properly

example
doc
  • point 1
    • sub point 1
      • sub point 2
is coming as
  • point 1
  • sub point 1
  • sub point 2
for both the modes
Add a reply
Sign up and join the conversation on Discord