hello there, i have a question about

At a glance

The community member is using LlamaParse to convert tabular data from scanned documents into markdown tables, but sometimes the conversion is not accurate due to issues like small font size or poor scanning. They want to know if they can also extract the tables as images and use them in their pipeline, and how to manage the chunking of the markdown tables and images.

The community members discuss the following: - A tutorial on using multimodal mode to extract both text and images is suggested. - There are issues with the order of the extracted table rows, and the community member is checking if it's due to the way they are chunking the data. - There are questions about indexing the images along with the parsed text, and how to load the images during query time. - The community member tried using Azure Blob Storage to load the documents, but encountered issues with downloading the image screenshots. - The community member set up a multimodal RAG pipeline but is having trouble with empty image metadata when building the MultiModalVectorStoreIndex.

There is no explicitly marked answer in the comments.

Useful resources

ggalvangjx

hello there, i have a question about enhancing data extraction quality from scanned documents over a typical rag + reranker pipeline. i am currently using LlamaParse to convert tabular data into markdown table, then indexing them. There can be a case where the tabular data are not being converted properly (e.g. table fonts are too small, document not scanned properly by people, etc), thus making the markdown table unusable. Since I am using gpt-4o in my pipeline,

questions:

Can I also extract the table as an image and put them in my pipeline? So if markdown table us unusable, gpt-4o can also look into the image for data extraction
Do I also have to manage how I chunk the markdown table and image in sequence if I have more than 1 table?

10 comments

jjerryjliu0

yes! our multimodal tutorial should give you a sense of how to use our multimodal mode to extract both text and images: https://github.com/run-llama/llama_parse/blob/main/examples/multimodal/multimodal_rag_slide_deck.ipynb

i'm not entirely sure what you mean, is this about making sure the tables are in order before you feed it to the LLM?

ggalvangjx

i'm not entirely sure what you mean, is this about making sure the tables are in order before you feed it to the LLM?

yes. i also realised during the retrieval and reranking process, data rows of table are not extracted in order (e.g. rows of page x are extracted instead of extracting from page 1 onwards). i use the following code to check. could it be the way how i am chunking my data?

Plain Text

from llama_index.core.response.notebook_utils import display_source_node

for i, n in enumerate(response_table.source_nodes):
    print(f'Source node {i+1}')
    display_source_node(n, source_length=20000)

ggalvangjx

hey @jerryjliu0 i was reading the notebook but i dont understand why are the path of the images being indexed instead of the image themselves. for my use case, i am reading documents directly from Azure Blob Storage and then LlamaParse to parse the document. I can saved the document from blob storage as a PIL object but what I am confuse is how can I indexed them together with the parsed result from LlamaParse?

jjerryjliu0

@galvangjx re: the first point, that's just because by definition retrieval returns chunks ordered by embedding similarity instead of by page number. are you saying you'd want the chunks ordered by page number?

jjerryjliu0

re: the second point, llamaparse lets you download the image screenshots along with the text. in the notebook, we embed the text chunks, but attach a link to the page screenshots through the metadata of the text chunks. if you want to you can store the page screenshots in blob storage instead of local file system, you would just need the URL to blob storage.

then before calling the LLM during query-time, you get the image paths attached to each text node and use that to load the original image

ggalvangjx

in a way, yes, i would like to chunk by page number. but that doesn't matter because i preprocess llama parse results by condensing and retaining only key information (e.g. basic invocie data and just tables). what i was concerned about was if i chunk tables from different pages, i want to also ensure i am also attaching the correct image of the table as metadata

ggalvangjx

while using AzStorageBlobReader with llamaparse as the file_extractor, i wasn't able to download the image screenshots and came across '<bytes/buffer>' error. i reported it in the llamacloud chat though.

ggalvangjx

sorry im still not getting how i can give the query engine the image during query time 😅 . all i see is just the link to each page screenshots have been given, while creating a custom multi modal query engine. unless you mean the tool created at the later part does the actual loading of the screenshots?

this tutorial shows the images are being embedded with text into a vector db and it makes more sense to me - https://docs.llamaindex.ai/en/stable/examples/multi_modal/gpt4v_multi_modal_retrieval/#build-multi-modal-index-and-vector-store-to-index-both-text-and-images

ggalvangjx

hey @jerryjliu0 i've managed to set up a multimodal rag pipeline following to the tutorial sent in my previous message. now im trying to put everything together using my own data. I have a list of ImageDocument which are the images i loaded from azure blob storage and also a Document which are markdown texts result from llamaparse.

when building the MultiModalVectorStoreIndex, my loaded images (from ImageDocument) are empty, can you help me understand why that is?

ValueError: Cannot build index from nodes with no content. Please ensure all nodes have content.

ggalvangjx

The difference is - the metadata from ImageDocument that was loaded from azure blob storage is empty. but the metadata from ImageDocument that was loaded from local file system is populated. might this be a bug?

Add a reply

Find answers from the community

hello there, i have a question about