I use `SimpleWebPageReader` to scrape

At a glance

The community member is using SimpleWebPageReader to scrape a page and then indexing it with GPTListIndex, but is encountering an issue where a single term is larger than the allowed chunk size. The community member suspects this is due to the image being loaded as a base64 string. In the comments, another community member suggests using browsertrix-crawler to crawl websites and output text, which can then be loaded and indexed using SimpleDirectoryReader and GPTSimpleVectorIndex. There is no explicitly marked answer in the comments.

Useful resources

00ptim

I use SimpleWebPageReader to scrape this page https://www.defichainwiki.com/docs/auto/App_Sync_Boost.

Then I index it with GPTListIndex(documents) which throws A single term is larger than the allowed chunk size.Term size: 7103Chunk size: 3714.

I think it must be because of the image which the scraper loads as base64 string (message.txt).

How to work around this issue?

6 comments

00ptim

Thanks @James Moriarty for the guidance. Now I just prep the data like so.

Plain Text

documents = SimpleWebPageReader(html_to_text=True).load_data(urls)

for document in documents:
    document.text = re.sub(
        r'(?<=\S)[^\s]{' + str(3714) + ',}(?=\S)', '', document.text)

yyourbuddyconner

Check out browsertrix-crawler

yyourbuddyconner

I am using it to crawl websites and output text

yyourbuddyconner

TLDR:

$ docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://docs.website.com  --text --collection test --scope-type domain  --allowHashUrls

yyourbuddyconner

The parsed data can then be loaded like so:

Plain Text

docs = SimpleDirectoryReader("./crawls/collections/test/pages").load_data()
website_index = GPTSimpleVectorIndex(docs, llm_predictor=davinci)

vvedWhat

@yourbuddyconner thanks for this tip. This is an awesome crawler

Add a reply

Find answers from the community

I use `SimpleWebPageReader` to scrape