Find answers from the community

Updated 2 months ago

Hi team I am building a customized

Hi team, I am building a customized chatbot based on user-uploaded pdf files. I'm wondering how can I find a PDF file's geographic location (narrow down to the axis or lines) for the response source when I use chat_engine.
b
J
5 comments
welcome! The geographic location is in the text of the pdf or somewhere else? You can add it to the metadata of each node and then use source_nodes to get the metadata
Hi bmax, is the geographic location already extracted and stored somewhere else, or should I extract it myself? And how can I add it to the metadata?
@Jo I am really not familar with what you mean by geographic location, I'd love to learn more. Are you saying this is specific to your PDF or all pdf's have geo info embedded in them?
Anyway, you'd probably have to extract it to yourself and then add it to metadata when adding documents, like this:https://gpt-index.readthedocs.io/en/latest/core_modules/data_modules/documents_and_nodes/usage_documents.html
The purpose of this function is to pinpoint the origin of answers and refine it to sentences or paragraphs, as opposed to entire pages. (I noticed that response.get_formatted_sources() provides an entire page of text.)

Geographic location means, for example, when using AWS Textract to extract text from a PDF file, resulting in a JSON file that contains information about the text's position. A portion of the JSON file is displayed below:
...{
"BlockType": "LINE",
"Confidence": 99.56,
"Text": "This is a sentence.",
"TextType": "PRINTED",
"Geometry": {
"BoundingBox": {
"Width": 0.12,
"Height": 0.016,
"Left": 0.19,
"Top": 0.055},
"Polygon": [
{"X": 0.19, "Y": 0.055},
{"X": 0.31, "Y": 0.055},
{"X": 0.31, "Y": 0.071},
{"X": 0.19, "Y": 0.071}
]}}...
I believe that when SimpleDirectoryReader extracts text from a PDF file, it also retains similar information somewhere. This is evident when I use QUERY_ENGINE to inquire about the geographical location of the original text from the PDF. It provides a response like "answer + The original text can be found on page 9, line 14 of the PDF file named 'file_name.pdf.'" However, this prompt is only occasionally effective, even when I input the same question with the same prompt. Additionally, this prompt doesn't work when I use CHAT_ENGINE, although I prefer using chat_engine.
So, my questions are as follows:
  1. What tool does SimpleDirectoryReader employ to extract text from a PDF file, and where is it storing similar information to the JSON file I presented?
  2. Is there a way to refine the text nodes when I call response.formatted_sources() in order to obtain more precise text nodes?
  3. Do you have any other suggestions for implementing the functionality I want?
Add a reply
Sign up and join the conversation on Discord