Hi team I am building a customized

At a glance

The community member is building a customized chatbot based on user-uploaded PDF files and wants to find the geographic location (axis or lines) of the text in the PDF files to use as the response source when using the chat_engine. The comments suggest that the geographic location may be stored in the metadata of the PDF text nodes, and the community member should extract it themselves and add it to the metadata. The community member also asks about the tool used by SimpleDirectoryReader to extract text from PDF files, how to refine the text nodes returned by response.formatted_sources(), and any other suggestions for implementing the desired functionality.

Useful resources

JJo

Hi team, I am building a customized chatbot based on user-uploaded pdf files. I'm wondering how can I find a PDF file's geographic location (narrow down to the axis or lines) for the response source when I use chat_engine.

5 comments

bbmax

welcome! The geographic location is in the text of the pdf or somewhere else? You can add it to the metadata of each node and then use source_nodes to get the metadata

JJo

Hi bmax, is the geographic location already extracted and stored somewhere else, or should I extract it myself? And how can I add it to the metadata?

bbmax

@Jo I am really not familar with what you mean by geographic location, I'd love to learn more. Are you saying this is specific to your PDF or all pdf's have geo info embedded in them?

bbmax

Anyway, you'd probably have to extract it to yourself and then add it to metadata when adding documents, like this:https://gpt-index.readthedocs.io/en/latest/core_modules/data_modules/documents_and_nodes/usage_documents.html

JJo

The purpose of this function is to pinpoint the origin of answers and refine it to sentences or paragraphs, as opposed to entire pages. (I noticed that response.get_formatted_sources() provides an entire page of text.)

Geographic location means, for example, when using AWS Textract to extract text from a PDF file, resulting in a JSON file that contains information about the text's position. A portion of the JSON file is displayed below:
...{
"BlockType": "LINE",
"Confidence": 99.56,
"Text": "This is a sentence.",
"TextType": "PRINTED",
"Geometry": {
"BoundingBox": {
"Width": 0.12,
"Height": 0.016,
"Left": 0.19,
"Top": 0.055},
"Polygon": [
{"X": 0.19, "Y": 0.055},
{"X": 0.31, "Y": 0.055},
{"X": 0.31, "Y": 0.071},
{"X": 0.19, "Y": 0.071}
]}}...
I believe that when SimpleDirectoryReader extracts text from a PDF file, it also retains similar information somewhere. This is evident when I use QUERY_ENGINE to inquire about the geographical location of the original text from the PDF. It provides a response like "answer + The original text can be found on page 9, line 14 of the PDF file named 'file_name.pdf.'" However, this prompt is only occasionally effective, even when I input the same question with the same prompt. Additionally, this prompt doesn't work when I use CHAT_ENGINE, although I prefer using chat_engine.
So, my questions are as follows:

What tool does SimpleDirectoryReader employ to extract text from a PDF file, and where is it storing similar information to the JSON file I presented?
Is there a way to refine the text nodes when I call response.formatted_sources() in order to obtain more precise text nodes?
Do you have any other suggestions for implementing the functionality I want?

Add a reply

Find answers from the community

Hi team I am building a customized