SimpleWebPageReader
to scrape this page https://www.defichainwiki.com/docs/auto/App_Sync_Boost
.GPTListIndex(documents)
which throws A single term is larger than the allowed chunk size.Term size: 7103Chunk size: 3714
.documents = SimpleWebPageReader(html_to_text=True).load_data(urls) for document in documents: document.text = re.sub( r'(?<=\S)[^\s]{' + str(3714) + ',}(?=\S)', '', document.text)
$ docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://docs.website.com --text --collection test --scope-type domain --allowHashUrls
docs = SimpleDirectoryReader("./crawls/collections/test/pages").load_data() website_index = GPTSimpleVectorIndex(docs, llm_predictor=davinci)