Is there a way to extend the llamaindex document loader or vector builder in a way that it does not add duplicate files, so it filters them on the document loader or vector building step? Are there any example codes for that? Reason: web scrapers often load 1 page multiple times and content can be duplicated.
I have my prototype setup for a product, I want to start scaling it up in Google Cloud. Are there any tutorials/notebooks/anything to help me? My current setup is simple:
I'd like to make everything here parallel by using Google CloudFunctions or some alternative. I'd also want to start using vector index databases (any recommendations in GCP?). VectorStoreIndex takes hours to do sometimes and it gets timed out due to cloudfunction problems (since I work on all the documents in one thread). That is the main reason I want to parallelize it (currently my prototype stores everything on disk). Also - I have one vector index per company, and I have lots of companies, so the vector index must be able to handle that use case (query engine cannot use another companies data)
I just updated too and I cannot even use the code from the readme. Pip freeze -> llama-index==0.10.4 The code from readme, which doesn't run is this simple import: "from llama_index.core import StorageContext, load_index_from_storage"
Leads to error: ImportError: cannot import name 'StorageContext' from 'llama_index.core' (unknown location)