Find answers from the community

Home
Members
Andre Tättar
A
Andre Tättar
Offline, last seen 3 months ago
Joined September 25, 2024
Is there a way to extend the llamaindex document loader or vector builder in a way that it does not add duplicate files, so it filters them on the document loader or vector building step? Are there any example codes for that?
Reason: web scrapers often load 1 page multiple times and content can be duplicated.
65 comments
k
A
Any way to find out how many documents I have in a vector index and some basic information - like size, model used, dimensionality etc?
5 comments
A
W
I have my prototype setup for a product, I want to start scaling it up in Google Cloud. Are there any tutorials/notebooks/anything to help me? My current setup is simple:

loader = SimpleDirectoryReader(self.LOCAL_TEMP_DIR, recursive=True, exclude_hidden=True)
documents = loader.load_data()
vector_index = VectorStoreIndex.from_documents(documents)

I'd like to make everything here parallel by using Google CloudFunctions or some alternative. I'd also want to start using vector index databases (any recommendations in GCP?).
VectorStoreIndex takes hours to do sometimes and it gets timed out due to cloudfunction problems (since I work on all the documents in one thread). That is the main reason I want to parallelize it (currently my prototype stores everything on disk). Also - I have one vector index per company, and I have lots of companies, so the vector index must be able to handle that use case (query engine cannot use another companies data)
17 comments
k
A
I just updated too and I cannot even use the code from the readme. Pip freeze -> llama-index==0.10.4
The code from readme, which doesn't run is this simple import: "from llama_index.core import StorageContext, load_index_from_storage"

Leads to error: ImportError: cannot import name 'StorageContext' from 'llama_index.core' (unknown location)
5 comments
W
A