Are there any approaches to also extract

At a glance

The community members are discussing approaches to extract and index hyperlinks found in PDFs using LlamaIndex with the SimpleDirectoryReader. One community member suggests using a regex to pull the hyperlinks out of the text, while another community member explains that the SimpleDirectoryReader and other PDF data loaders on LlamaHub ignore non-text elements. To resolve this issue, the second community member implemented their own custom PDFReader that crawls the pages, extracts the hyperlinks, and inserts them as text with the hyperlink URI.

bbenzen

Are there any approaches to also extract & index Hyperlinks found in PDFs using LlamaIndex with the SimpleDirectoryReader?

2 comments

LLogan M

I think probably a regex to pull it out of the text is the best approach lol

bbenzen

The problem is that the SimpleDirectoryReader and all PDF data loaders on LlamaHub ignore non-text elements. To resolve this issue, I had to implement my own PDFReader as such

Plain Text

python def load_data(self, file, extra_info=None):
                    doc = fitz.open(file)
                    text = ""
                    for page in doc:
                        links = page.get_links()
                        logger.error(f"Links: {links}")
                        # Crawl all links on the page and insert them as text + hyperlink at the correct position
                        for link in links:
                            x = 15
                            link_text = page.get_textbox(
                                link["from"] + (-x, -x, x, x)
                            )

                            link_rect = link["from"]

                            annotation_and_link = f"[Link]: {link['uri']}"
                            annotation_and_link = f"[{link_text}]: {link['uri']}"

                            page.insert_text(
                                (link_rect[0], link_rect[2]),
                                annotation_and_link,
                            )

                            logger.error(f"hyperlink found: {annotation_and_link}")

                        text += page.get_text()

                    doc.close()
                    return [
                        Document(
                            text=text,
                            extra_info=extra_info,
                        )
                    ]

Add a reply

Find answers from the community

Are there any approaches to also extract