Find answers from the community

Updated last year

how to store simpledirectoryreader indexes in aws s3 storage

At a glance

The post asks how to store SimpleDirectoryReader indexes in AWS S3 storage. The comments provide a detailed answer, which includes the following steps:

1. Install the LlamaIndex package using pip.

2. Import the necessary modules and set up logging.

3. Load the environment variables and set up the S3 filesystem.

4. Use SimpleDirectoryReader to load the documents.

5. Create a VectorStoreIndex from the documents and save it to the S3 storage.

6. Load the index from the S3 storage using the StorageContext.

The community members also provide links to the SimpleIndexOnS3.ipynb notebook and the LlamaIndex documentation as additional resources.

@kapa.ai how to store simpledirectoryreader indexes in aws s3 storage
k
3 comments
Please stand by @dhruva as I search the knowledge base. I normally take about 20 seconds or so. :wave:
----------------------
To store indexes created using SimpleDirectoryReader in AWS S3 storage, you can follow these steps:

  1. Install the LlamaIndex package if you haven't already. You can do this using pip:
Plain Text
!pip install llama-index

  1. Import the necessary modules and set up logging:
Plain Text
import logging
import sys
import dotenv
import s3fs
import os
from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    load_index_from_storage,
    StorageContext,
)

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

  1. Load your environment variables and set up your S3 filesystem:
Plain Text
dotenv.load_dotenv("../../../.env")

AWS_KEY = os.environ["AWS_ACCESS_KEY_ID"]
AWS_SECRET = os.environ["AWS_SECRET_ACCESS_KEY"]
R2_ACCOUNT_ID = os.environ["R2_ACCOUNT_ID"]

assert AWS_KEY is not None and AWS_KEY != ""

s3 = s3fs.S3FileSystem(
    key=AWS_KEY,
    secret=AWS_SECRET,
    endpoint_url=f"<https://{R2_ACCOUNT_ID}.r2.cloudflarestorage.com",>
    s3_additional_kwargs={"ACL": "public-read"},
)

  1. Use SimpleDirectoryReader to load your documents:
Plain Text
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

  1. Create an index from your documents:
Plain Text
index = VectorStoreIndex.from_documents(documents, fs=s3)

  1. Save the index to your S3 storage:
Plain Text
index.set_index_id("vector_index")
index.storage_context.persist("llama-index/storage_demo", fs=s3)

  1. You can then load the index from your S3 storage using the following code:
Plain Text
sc = StorageContext.from_defaults(
    persist_dir="llama-index/storage_demo", fs=s3
)
index2 = load_index_from_storage(sc, "vector_index")

Please replace "./data/paul_graham/" with the path to your directory, and "llama-index/storage_demo" with your desired S3 bucket name.
Sources:
  • SimpleIndexOnS3.ipynb ()
  • LlamaIndex Documentation ()
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
Add a reply
Sign up and join the conversation on Discord