Find answers from the community

Updated 2 months ago

I use `SimpleWebPageReader` to scrape

I use SimpleWebPageReader to scrape this page https://www.defichainwiki.com/docs/auto/App_Sync_Boost.

Then I index it with GPTListIndex(documents) which throws A single term is larger than the allowed chunk size.Term size: 7103Chunk size: 3714.

I think it must be because of the image which the scraper loads as base64 string (message.txt).

How to work around this issue?
0
y
v
6 comments
Thanks @James Moriarty for the guidance. Now I just prep the data like so.

Plain Text
documents = SimpleWebPageReader(html_to_text=True).load_data(urls)

for document in documents:
    document.text = re.sub(
        r'(?<=\S)[^\s]{' + str(3714) + ',}(?=\S)', '', document.text)
Check out browsertrix-crawler
I am using it to crawl websites and output text
TLDR: $ docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://docs.website.com --text --collection test --scope-type domain --allowHashUrls
The parsed data can then be loaded like so:

Plain Text
docs = SimpleDirectoryReader("./crawls/collections/test/pages").load_data()
website_index = GPTSimpleVectorIndex(docs, llm_predictor=davinci)
@yourbuddyconner thanks for this tip. This is an awesome crawler
Add a reply
Sign up and join the conversation on Discord