Index error

Hey all, I would love some help(also thanks to the folks helping me so far). So I switched to chroma on aws, and I created a script that loads markdown contents from github to chroma but i am wondering if I am doing it right because when I try to query it I get IndexError: list index out of range:

My code to query it is:

chroma_client = chromadb.Client(Settings(chroma_api_impl="rest",
chroma_server_host="some ip address",
chroma_server_http_port=8000))

print(chroma_client)
collection = chroma_client.get_collection(name="some collection")
print(collection)
index = GPTChromaIndex.from_documents([], chroma_collection=collection)
response = index.query("What is python?")
print(response)

And my script to load the content from github is something like this:

for item in folder:
# If item is a file, and its type is markdown, get its contents
if item.type == "file" and item.name.endswith(".md"):
markdown_content = item.decoded_content.decode('utf-8')

# Add the file's content to the list
new_document = Document(text=markdown_content, doc_id=item.name)
markdown_files.append(new_document)

then I load as an index here

index = GPTChromaIndex.from_documents(markdown_files, chroma_collection=chroma_collection)

Anybody know what I am doing wrong? My backup plan is to sync stuff to s3 but that seems sort of weird.

52 comments

What line of code is causing the error? Do you have the full stack trace handy?

Ok update weird enough when I create the collection and then load to the index it works aka:

markdown_files = get_markdown_files(repo_owner, repo_name, folder_path, access_token)
print(markdown_files)

chroma_client = chromadb.Client(Settings(chroma_api_impl="rest",
chroma_server_host="some ip",
chroma_server_http_port=8000))

chroma_collection = chroma_client.create_collection(name="py_two")

index = GPTChromaIndex.from_documents(markdown_files, chroma_collection=chroma_collection)
response = index.query("What is python?")
print(response)

But when i try to load it from another files(just trying to load the docs from the index i get the error):

@Logan M

openai.api_key = os.getenv('OPENAI_API_KEY')

chroma_client = chromadb.Client(Settings(chroma_api_impl="rest",
chroma_server_host="some ip",
chroma_server_http_port=8000))

print(chroma_client)
collection = chroma_client.get_collection(name="py_two")
print(collection)
index = GPTChromaIndex.from_documents([], chroma_collection=collection)
response = index.query("What is python?")
print(response)

@Logan M at this point maybe I can use the chroma reader, but I am newbie and I don't know how to programmatically create a query vector that is in this code snippet:

from gpt_index.readers.chroma import ChromaReader
from gpt_index.indices import GPTListIndex

The chroma reader loads data from a persisted Chroma collection.

This requires a collection name and a persist directory.

reader = ChromaReader(
collection_name="chroma_collection",
persist_directory="examples/data_connectors/chroma_collection"
)

query_vector=[n1, n2, n3, ...]

documents = reader.load_data(collection_name="demo", query_vector=query_vector, limit=5)
index = GPTListIndex.from_documents(documents)

response = index.query("<query_text>")
display(Markdown(f"<b>{response}</b>"))

I avoided the readers bc I don't know what a query vector is, and I don't how to generate them

I think we should be able to still insert stuff from your markdown files like your original code was trying to do. What line of code is the original error coming from? There should be a big stack trace printed in the terminal 🙏

@Logan M
Traceback (most recent call last):
File "my_path/codebases/newBaseChatFolder/sagemaker.py", line 20, in <module>
index = GPTChromaIndex.from_documents([], chroma_collection=collection)
File "my_path/codebases/newBaseChatFolder/my_project_env/lib/python3.10/site-packages/llama_index/indices/base.py", line 100, in from_documents
return cls(
File "my_path/codebases/newBaseChatFolder/my_project_env/lib/python3.10/site-packages/llama_index/indices/vector_store/vector_indices.py", line 371, in init
super().init(
File "my_path/codebases/newBaseChatFolder/my_project_env/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 54, in init
super().init(
File "my_path/codebases/newBaseChatFolder/my_project_env/lib/python3.10/site-packages/llama_index/indices/base.py", line 69, in init
index_struct = self.build_index_from_nodes(nodes)
File "my_path/codebases/newBaseChatFolder/my_project_env/lib/python3.10/site-packages/llama_index/token_counter/token_counter.py", line 78, in wrapped_llm_predict
f_return_val = f(_self, *args, **kwargs)
File "my_path/codebases/newBaseChatFolder/my_project_env/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 217, in build_index_from_nodes
return self._build_index_from_nodes(nodes)
File

second part: "my_path/codebases/newBaseChatFolder/my_project_env/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 206, in _build_index_from_nodes
self._add_nodes_to_index(index_struct, nodes)
File "my_path/codebases/newBaseChatFolder/my_project_env/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 183, in _add_nodes_to_index
new_ids = self._vector_store.add(embedding_results)
File "my_path/codebases/newBaseChatFolder/my_project_env/lib/python3.10/site-packages/llama_index/vector_stores/chroma.py", line 78, in add
self._collection.add(
File "my_path/codebases/newBaseChatFolder/my_project_env/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 82, in add
ids = validate_ids(maybe_cast_one_to_many(ids))
File "my_path/codebases/newBaseChatFolder/my_project_env/lib/python3.10/site-packages/chromadb/api/types.py", line 71, in maybe_cast_one_to_many
if isinstance(target[0], (int, float)):
IndexError: list index out of range

Try index = GPTChromaIndex([], chroma_collection=collection instead 🤔

No such luck 😦

I changed it to index = GPTChromaIndex([], chroma_collection=collection)

but got the same error

I guess in theory I could create a flask server on boot delete the collection and the load from github -> then create the collection again but I am thinking a reader might make sense

Whoa!

got it to work

ok so weirdly enough i had pass a base document

let me try but for reference

print(chroma_client)
collection = chroma_client.get_collection(name="py_two")
print(collection)
index = GPTChromaIndex.from_documents([Document(text="python")], chroma_collection=collection)
response = index.query("What is python?")
print(response)

When you use the chroma client, are you connecting to an existing collection or making a new one? (Glad there's no more error at least lol)

existing collection

but this is with open ai

I just realized i have to try a custom llm..

But hopfully this helps other folks

So what i am doing here is writing a script to load content from github(could be ran from CI/CD)/ it then pushes to my AWS chroma -> then from a server or different python file it will load from that chroma index -> and then you query

Thanks for your help @Logan M, super duper appreciate it

IT WORKS

with a custom LLM

T.T

Wooo nice 💪💪

@Logan M It looks like GPTChromaIndex is no longer a thing ? Is there an easy way to accomplish the above in the new codebase?

Yea this example is super outdated. All vector indexes are consolidated into VectorStoreIndex now, and you change the underlying vector db by modifying the vector store in the storage context.

To connect to an existing vector db you built with llama-index, you can setup the vector_store object and do index = VectorStoreIndex.from_vector_store(vector_store)

perfecto, thanks logan

do i need to run the chroma server to init the vector store, or can i read it from disk (where its persisted)?

since the chromareader looks to require the server running\

uhhh tbh I'm not sure exactly with chroma.

But here's my guess based on the chroma/llama-index docs

Constructing

Plain Text

from chromadb.config import Settings
chroma_client= chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="/path/to/persist/directory" # Optional, defaults to .chromadb/ in the current directory
))
chroma_collection = chroma_client.create_collection("quickstart")

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

Plain Text

chroma_client= chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="/path/to/persist/directory" # Optional, defaults to .chromadb/ in the current directory
))
chroma_collection = chroma_client.get_collection("quickstart")

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

index = VectorStoreIndex.from_vector_store(vector_store)

This is great, will test tonight - thank you so much!! ♥️

I think the only place I'm struggling is how to re-reference the documents/docstore objects.

That's not possible though, right, without saving that reference off manually? Think we discussed this in the context of document -> index -> nodes before. 🙂

I realized there seem to be two different approaches with llama index for documents. When you build a document summary index, it seems to chunk a single PDF into many doc ids, whereas loading just as a vector store index with typical text chunking + overlap seems to be 1 doc id per file/pdf file.

A vector index should also be creating many nodes per input files too actually

I know it's confusing, but both documents and nodes have a doc_id attribute, that is just that documents/nodes unique ID (confusing naming I know, working on making this better)

If you weren't using chroma, you could do index.ref_doc_info, to see each ingested document id and the doc_id's of the nodes it created

But for vector store integrations, this hasn't been implemented yet, since it's a lot more difficult in those cases due to a bunch of under-the-hood reasons

I actually think I'm going to kill chroma from this entire project - its adding more confusion than its worth.

I liked having that traceability with ref_doc_info.

But that makes sense - so a doc_id can be a node or a document.

so really i'm not 'losing' anything in the persist process aside from the structure i used to create the doc_id nodes/docs within the index, right?

Found this in the discord search, in case anyone else is looking at this and needs clarity.

The three files docstore.json, index_store.json, and vector_store.json are used to persist the storage context in LlamaIndex. Here's what each file stores:

docstore.json: This file stores the Document Store, which primarily contains Node objects. Each node will be assigned an ID. The Document Store is based on the BaseDocumentStore class and its subclasses (source (https://gpt-index.readthedocs.io/en/latest/reference/storage/docstore.html)).

index_store.json: This file stores the Index Store, which is based on the BaseIndexStore class. The Index Store is responsible for managing indices (source (https://gpt-index.readthedocs.io/en/latest/reference/storage/index_store.html)).

vector_store.json: This file stores the Vector Store, which contains the embedding vectors of ingested document chunks (and sometimes the document chunks as well). The Vector Store is based on the VectorStore class and its subclasses (source (https://gpt-index.readthedocs.io/en/latest/reference/storage/vector_store.html)).

Is this a fair updated explanation of each?

Sure. Here are the explanations of the storage files and their uses in LlamaIndex:

docstore.json: This file stores the Document Store, which primarily contains Node objects. Each node will be assigned an ID. The Document Store is based on the BaseDocumentStore class and its subclasses. The Document Store is responsible for storing and retrieving Node objects. Node objects represent the ingested documents. The Document Store can be used to query the ingested documents by ID, text, or other properties.
index_store.json: This file stores the Index Store, which is based on the BaseIndexStore class. The Index Store is responsible for managing indices. An index is a data structure that allows for efficient retrieval of documents based on a set of criteria. The Index Store can be used to create, update, and delete indices. The Index Store can also be used to query the indices for documents that match a set of criteria.
vector_store.json: This file stores the Vector Store, which contains the embedding vectors of ingested document chunks (and sometimes the document chunks as well). The Vector Store is based on the VectorStore class and its subclasses. The Vector Store is responsible for storing and retrieving embedding vectors. Embedding vectors are a type of vector representation of words or phrases that can be used to represent the meaning of words and phrases. The Vector Store can be used to query the embedding vectors for words or phrases.

These three storage files are used to store the data ingested by LlamaIndex. The Document Store stores the ingested documents, the Index Store stores the indices, and the Vector Store stores the embedding vectors.

Yea thats pretty much correct! The only thing I would add is that the docstore is also keeping track of the id's for the original input documents and which nodes map to which document

And thats how the source/citations work right?

mmm not quite haha. The extra_info field of each input document actually gets inherited to each node created from that document, and normally here is where you would store the filename, or other useful info

But the actual text itself that its pulling out as a citation...

sorry, which citation are you talking about here? The new citation query engine? Something else?

Old one, the get formatted sources function if I remember