Find answers from the community

Updated 4 months ago

Are there any approaches to also extract

At a glance

The community members are discussing approaches to extract and index hyperlinks found in PDFs using LlamaIndex with the SimpleDirectoryReader. One community member suggests using a regex to pull the hyperlinks out of the text, while another community member explains that the SimpleDirectoryReader and other PDF data loaders on LlamaHub ignore non-text elements. To resolve this issue, the second community member implemented their own custom PDFReader that crawls the pages, extracts the hyperlinks, and inserts them as text with the hyperlink URI.

Are there any approaches to also extract & index Hyperlinks found in PDFs using LlamaIndex with the SimpleDirectoryReader?
L
b
2 comments
I think probably a regex to pull it out of the text is the best approach lol
The problem is that the SimpleDirectoryReader and all PDF data loaders on LlamaHub ignore non-text elements. To resolve this issue, I had to implement my own PDFReader as such
Plain Text
python def load_data(self, file, extra_info=None):
                    doc = fitz.open(file)
                    text = ""
                    for page in doc:
                        links = page.get_links()
                        logger.error(f"Links: {links}")
                        # Crawl all links on the page and insert them as text + hyperlink at the correct position
                        for link in links:
                            x = 15
                            link_text = page.get_textbox(
                                link["from"] + (-x, -x, x, x)
                            )

                            link_rect = link["from"]

                            annotation_and_link = f"[Link]: {link['uri']}"
                            annotation_and_link = f"[{link_text}]: {link['uri']}"

                            page.insert_text(
                                (link_rect[0], link_rect[2]),
                                annotation_and_link,
                            )

                            logger.error(f"hyperlink found: {annotation_and_link}")

                        text += page.get_text()

                    doc.close()
                    return [
                        Document(
                            text=text,
                            extra_info=extra_info,
                        )
                    ]
Add a reply
Sign up and join the conversation on Discord