Find answers from the community

Updated last year

Hi all is there a way to bulk insert

Hi all, is there a way to bulk insert using a VectorStoreIndex using an OpenSearchVectorStore vector_store?
L
T
14 comments
it's already in bulk
Or I guess maybe you mean after the index is created
You can also do index.insert_nodes(nodes)
Hmm does insert_nodes perform the inserts sequentially? My use-case is that I would like to insert multiple documents (potentially thousands) to an AWS OpenSearch instance as a nightly job that runs within a Lambda.

I want this to work as quick as possible, I see that there is this function: https://github.com/run-llama/llama_index/blob/main/llama_index/vector_stores/opensearch.py#L61-L103

I don't see the native bulk insert capabilities of opensearch being used anywhere
I suppose I could use this bulk functionality, but it would require me to generate embeddings first?
No it will calculate embeddings for you (the embeddings calculation also has a batch size)

If you follow from the top level here, it will get down to using that bulk insert function in opensearch
https://github.com/run-llama/llama_index/blob/06127ec09966e8df2fcd4f03a1b53ec566b4a43d/llama_index/indices/vector_store/base.py#L251
Oh I guess I'm slightly confused here, I'm currently generating a list of Document objects using download_loader("PDFReader") and using index.insert to insert each Document. How can I generate the node objects to then use insert_nodes?
Plain Text
from llama_index.node_parser import SimpleNodeParser

node_parser = SimpleNodeParser.from_defaults()
nodes = node_parser.get_nodes_from_documents(documents)
I think this is where the .add function is called which invokes the bulk insert: https://github.com/run-llama/llama_index/blob/main/llama_index/indices/vector_store/base.py#L187

How can the max_chunk_bytes kwarg be passed along for the bulk insert call?
Is there a way to add this? Would be useful to allow larger sizes for bulk inserts. I suppose for now I can monkey patch the function to adjust that param
yea feel free to submit a PR too. it could be an init attribute
For visibility, here is a PR addressing this: https://github.com/run-llama/llama_index/pull/8082
Add a reply
Sign up and join the conversation on Discord