The purpose of this function is to pinpoint the origin of answers and refine it to sentences or paragraphs, as opposed to entire pages. (I noticed that response.get_formatted_sources() provides an entire page of text.)
Geographic location means, for example, when using AWS Textract to extract text from a PDF file, resulting in a JSON file that contains information about the text's position. A portion of the JSON file is displayed below:
...{
"BlockType": "LINE",
"Confidence": 99.56,
"Text": "This is a sentence.",
"TextType": "PRINTED",
"Geometry": {
"BoundingBox": {
"Width": 0.12,
"Height": 0.016,
"Left": 0.19,
"Top": 0.055},
"Polygon": [
{"X": 0.19, "Y": 0.055},
{"X": 0.31, "Y": 0.055},
{"X": 0.31, "Y": 0.071},
{"X": 0.19, "Y": 0.071}
]}}...
I believe that when SimpleDirectoryReader extracts text from a PDF file, it also retains similar information somewhere. This is evident when I use QUERY_ENGINE to inquire about the geographical location of the original text from the PDF. It provides a response like "answer + The original text can be found on page 9, line 14 of the PDF file named 'file_name.pdf.'" However, this prompt is only occasionally effective, even when I input the same question with the same prompt. Additionally, this prompt doesn't work when I use CHAT_ENGINE, although I prefer using chat_engine.
So, my questions are as follows:
- What tool does SimpleDirectoryReader employ to extract text from a PDF file, and where is it storing similar information to the JSON file I presented?
- Is there a way to refine the text nodes when I call response.formatted_sources() in order to obtain more precise text nodes?
- Do you have any other suggestions for implementing the functionality I want?