Hi all is there a way to bulk insert

Hi all, is there a way to bulk insert using a VectorStoreIndex using an OpenSearchVectorStore vector_store?

14 comments

it's already in bulk

https://github.com/run-llama/llama_index/blob/06127ec09966e8df2fcd4f03a1b53ec566b4a43d/llama_index/vector_stores/opensearch.py#L61

LLogan M

Or I guess maybe you mean after the index is created

LLogan M

You can also do index.insert_nodes(nodes)

TTL

Hmm does insert_nodes perform the inserts sequentially? My use-case is that I would like to insert multiple documents (potentially thousands) to an AWS OpenSearch instance as a nightly job that runs within a Lambda.

I want this to work as quick as possible, I see that there is this function: https://github.com/run-llama/llama_index/blob/main/llama_index/vector_stores/opensearch.py#L61-L103

I don't see the native bulk insert capabilities of opensearch being used anywhere

TTL

I suppose I could use this bulk functionality, but it would require me to generate embeddings first?

LLogan M

No it will calculate embeddings for you (the embeddings calculation also has a batch size)

If you follow from the top level here, it will get down to using that bulk insert function in opensearch
https://github.com/run-llama/llama_index/blob/06127ec09966e8df2fcd4f03a1b53ec566b4a43d/llama_index/indices/vector_store/base.py#L251

TTL

Oh I guess I'm slightly confused here, I'm currently generating a list of Document objects using download_loader("PDFReader") and using index.insert to insert each Document. How can I generate the node objects to then use insert_nodes?

LLogan M

Plain Text

from llama_index.node_parser import SimpleNodeParser

node_parser = SimpleNodeParser.from_defaults()
nodes = node_parser.get_nodes_from_documents(documents)

TTL

I think this is where the .add function is called which invokes the bulk insert: https://github.com/run-llama/llama_index/blob/main/llama_index/indices/vector_store/base.py#L187

How can the max_chunk_bytes kwarg be passed along for the bulk insert call?

LLogan M

Ya you got it. That add() goes here: https://github.com/run-llama/llama_index/blob/476d065c3e257ce1b814c0c98945427b5fefa263/llama_index/vector_stores/opensearch.py#L363

Which ends up calling this:
https://github.com/run-llama/llama_index/blob/476d065c3e257ce1b814c0c98945427b5fefa263/llama_index/vector_stores/opensearch.py#L228

Which calls bulk ingest

Seems like theres no way to set max_chunk_bytes right now

TTL

Is there a way to add this? Would be useful to allow larger sizes for bulk inserts. I suppose for now I can monkey patch the function to adjust that param

LLogan M

yea feel free to submit a PR too. it could be an init attribute

TTL

For visibility, here is a PR addressing this: https://github.com/run-llama/llama_index/pull/8082

Add a reply

Find answers from the community

Hi all is there a way to bulk insert