LlamaIndex

Log inLog into community

Find answers from the community

Updated 4 months ago

How to use the latest ObjectIndex for

How to use the latest ObjectIndex for

At a glance

The community member is trying to use the latest ObjectIndex to work with a PGVectorStore instead of the default VectorStoreIndex to retrieve data from an already created index in their PGVector. They are creating embeddings for an entire database (100+ tables) and storing them in a PGVector table, and now want to retrieve the top 3 results for a given query.

The community members tried passing the vector_store into a StorageContext and using ObjectIndex.from_objects(), but encountered issues with the embeddings being generated again instead of just searching the existing table. They also faced errors related to the node_info and the metadata attribute when trying different approaches.

After several iterations and fixes provided by another community member, Logan M, the issues with the metadata attribute were resolved. However, a new issue arose with the ObjectIndex.from_objects() method, which was resolved by creating the vector index directly from the PGVectorStore and passing it to the ObjectIndex.

The community members also ha

Useful resources

·

How to use the latest 'ObjectIndex' for working with Vector Store(PGVector) instead of default 'VectorStoreIndex'. to retrive data from already created index in my PGVector. self.vector_store = PGVectorStore.from_params( database=DATABASE_NAME, host=DATABASE_HOST, password=DATABASE_PASSWORD, port=DATABASE_PORT, user=DATABASE_USER, table_name=self.collection_table_name) self.obj_index = ObjectIndex.from_objects(self.table_schema_objs, self.table_node_mapping, VectorStoreIndex) self.db_query_engine = SQLTableRetrieverQueryEngine( self.sql_database, self.obj_index.as_retriever(similarity_top_k=3)) So far i have my code like this, But i'm not able to use my 'vector_store' inside the ObjectIndex.from_objects(). How to approach this. My usecase is I'm creating embeddings on entire databse(100+ tables) and storing them in a pgvector table. Now i just want to use that table to retrive the top 3 results. Any help here would be of a great help. I'm just following this example here https://gpt-index.readthedocs.io/en/latest/examples/index_structs/struct_indices/SQLIndexDemo-ManyTables.html

L

v

56 comments

I thiiiiiink it's possible, after looking at the source code

Since you have the vector_store setup, you just need to pass it into a storage context and pass that in

Plain Text

from llama_index import StorageContext

vector_store = ...
storage_context = StorageContext.from_defaults(vector_store=vector_store)

obj_index = ObjectIndex.from_objects(self.table_schema_objs, self.table_node_mapping, VectorStoreIndex, storage_context=storage_context)

Let me try that and see

Hi @Logan M, i tried that, but everytime when i run db_query_engine.query(question), it's generating the embeddings again, instead of just searching from the table name i provided in the PGVecorStore(). Here is my full code:

self.llm = ChatOpenAI(temperature=temperature, model=model_name)
        self.engine = create_engine(db_uri)
        self.sql_database = SQLDatabase(self.engine)
        self.metadata_obj = MetaData()
        self.metadata_obj.reflect(bind=self.engine)
        all_table_names = list(self.metadata_obj.tables.keys())
        self.table_node_mapping = SQLTableNodeMapping(self.sql_database)
        self.table_schema_objs = []
        for table_name in all_table_names:
            self.table_schema_objs.append(SQLTableSchema(table_name=table_name))
        logger.info("sql_database created: %s", self.sql_database)
        self.collection_table_name = f"user_{user_id}_dbconn_{connection_id}"
        logger.info("collection_table_name created: %s", self.collection_table_name)
        self.vector_store = PGVectorStore.from_params(
            database=DATABASE_NAME,
            host=DATABASE_HOST,
            password=DATABASE_PASSWORD,
            port=DATABASE_PORT,
            user=DATABASE_USER,
            table_name=self.collection_table_name)
        
        self.storage_context = StorageContext.from_defaults(vector_store=self.vector_store)
        # self.vector_store_index = VectorStoreIndex.from_vector_store(vector_store=self.vector_store)
        self.obj_index = ObjectIndex.from_objects(self.table_schema_objs, self.table_node_mapping, VectorStoreIndex, storage_context=self.storage_context)
        
        self.db_query_engine = SQLTableRetrieverQueryEngine(
                        self.sql_database,
                        self.obj_index.as_retriever(similarity_top_k=3),
>                     )

My objective here is just retrieve the top 3 results from the already created vector collection table in my PGVectorStore for the specifc query. One thing i'm confused here is, Do i have to create and pass the variables self.table_schema_objs, self.table_node_mapping to the ObjectIndex.from_objects() ? . It's taking alot of time for a database with 300+ tables. Can you please provide any guidance here ?

I think they are currently needed to create/load the index yes. But normally in an app you would only do this once at startup and keep it all in memory

ohhh well i see one issue actually

Once you initially create the object index once, you'll need to avoid using from_objects() to create the object index

Instead, you can instantiate it directly

Plain Text

index = VectorStoreIndex.from_vector_store(vector_store)
object_index = ObjectIndex(index, table_node_mapping)

Hi @Logan M I did make the changes exactly as you mentioned above to query from the exisitng pgvector collection table. But now I'm getting an error. Here is my code:

DATABASE_NAME = os.environ.get('DATABASE_NAME')
DATABASE_USER = os.environ.get('DATABASE_USER')
DATABASE_HOST = os.environ.get('DATABASE_HOST')
DATABASE_PORT = os.environ.get('DATABASE_PORT')
DATABASE_PASSWORD = os.environ.get('DATABASE_PASSWORD')
DATABASE_URL2 = 'mysql+pymysql://magento-ai-user2:23RolsAj3c8R2@68.183.84.131/magento_ai_db1'
engine = create_engine(DATABASE_URL)
sql_database = SQLDatabase(engine)
user_id = 5
connection_id = 59
collection_table_name = f"user_{user_id}_dbconn_{connection_id}"
table_node_mapping1 = SQLTableNodeMapping(sql_database)
vector_store = PGVectorStore.from_params(
        database=DATABASE_NAME,
        host=DATABASE_HOST,
        password=DATABASE_PASSWORD,
        port=DATABASE_PORT,
        user=DATABASE_USER,
        table_name=collection_table_name)
        
vector_store_index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
object_index = ObjectIndex(vector_store_index, table_node_mapping1)
# query_engine = index.as_query_engine()

query_engine = SQLTableRetrieverQueryEngine(
                sql_database,
                object_index.as_retriever(similarity_top_k=3),
            )

response = query_engine.query("Give me all the tables in the database")

And this is the error I have with node.info["name"] : KeyError Traceback (most recent call last)
Cell In[322], line 30
23 # query_engine = index.as_query_engine()
25 query_engine = SQLTableRetrieverQueryEngine(
26 sql_database,
27 object_index.as_retriever(similarity_top_k=3),
28 )
---> 30 response = query_engine.query("Give me all the tables in the database")

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/llama_index/indices/query/base.py:23, in BaseQueryEngine.query(self, str_or_query_bundle)
21 if isinstance(str_or_query_bundle, str):
22 str_or_query_bundle = QueryBundle(str_or_query_bundle)
---> 23 response = self._query(str_or_query_bundle)
24 return response

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/llama_index/token_counter/token_counter.py:78, in llm_token_counter..wrap..wrapped_llm_predict(_self, *args, kwargs) 76 def wrapped_llm_predict(_self: Any, *args: Any, kwargs: Any) -> Any:
77 with wrapper_logic(_self):
---> 78 f_return_val = f(_self, *args, **kwargs)
80 return f_return_val

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/llama_index/indices/struct_store/sql_query.py:260, in BaseSQLTableQueryEngine._query(self, query_bundle)
257 @llm_token_counter("query")
258 def _query(self, query_bundle: QueryBundle) -> Response:
...
61 return SQLTableSchema(
---> 62 table_name=node.node_info["name"], context_str=node.node_info["context"]
63 )

KeyError: 'name'

@Logan M Can you please help me here ?

So, the old version of llama-index did not store the node-info stuff in the vector db, which is why it's failing with a key error on node info

Looking at the code though, the latest version of llama_index should remedy this though, if you upgrade and create the db again 🙏

Appologies for the trouble here, but we are in uncharted territory haha

Okay, Let me upgrade the Llama_index and try again

@Logan M I tried with latest version v0.6.35. Now i'm facing this error initiating the pgvector store with the table:

InvalidRequestError                       Traceback (most recent call last)
Cell In[7], line 25
     22 for table_name in all_table_names:
     23     table_schema_objs.append(SQLTableSchema(table_name=table_name))
---> 25 vector_store = PGVectorStore.from_params(
     26         database=DATABASE_NAME,
     27         host=DATABASE_HOST,
     28         password=DATABASE_PASSWORD,
     29         port=DATABASE_PORT,
     30         user=DATABASE_USER,
     31         table_name=collection_table_name)
     33 storage_context = StorageContext.from_defaults(vector_store=vector_store)
     35 obj_index = ObjectIndex.from_objects(
     36     table_schema_objs,
     37     table_node_mapping,
     38     storage_context=storage_context
     39 )

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/llama_index/vector_stores/postgres.py:80, in PGVectorStore.from_params(cls, host, port, database, user, password, table_name)
     78 """Return connection string from database parameters."""
     79 conn_str = f"postgresql+psycopg2://{user}:{password}@{host}:{port}/{database}"
---> 80 return cls(connection_string=conn_str, table_name=table_name)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/llama_index/vector_stores/postgres.py:60, in PGVectorStore.__init__(self, connection_string, table_name)
...
   1512     _undefer_column_name(
   1513         k, self.column_copies.get(value, value)  # type: ignore
   1514     )

InvalidRequestError: Attribute name 'metadata' is reserved when using the Declarative API.

ha, right, just merged/fixed that a few hours ago 😉 Didn't make it to a pypi release yet

do pip install --upgrade git+https://github.com/jerryjliu/llama_index

Logan, Even after pip3 upgrading from the above git repo, I dont see any verison increase. It's still "0.6.35". Is that supposted to be like that ? And i'm still facing the error: InvalidRequestError: Attribute name 'metadata' is reserved when using the Declarative API

The version will technically be the same, but it should have installed the latest version from source 🤔

It's no longer using the metadata name -- now it is metadata_ 🤔 https://github.com/jerryjliu/llama_index/blob/a28402bbd928a0fed4b912490950d9f4af65fde3/llama_index/vector_stores/postgres.py#L25C8-L25C33

Not sure what happened during the install, maybe uninstall and re-install from git?

If you can get it, we can monkey patch the file manually lol

Monkey patch guide:

pip show llama-index will show where llama-index is installed

Plain Text

(llama-index) loganm@gamingpc:~/llama_index_proper/llama_index/examples/docstore$ pip show llama-index
Name: llama-index
Version: 0.6.35
Summary: Interface between LLMs and your data
Home-page: https://github.com/jerryjliu/llama_index
Author: Jerry Liu
Author-email: 
License: MIT
Location: /home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages
Requires: beautifulsoup4, dataclasses-json, fsspec, langchain, numpy, openai, pandas, sqlalchemy, tenacity, tiktoken, typing-extensions, typing-inspect, urllib3
Required-by: llama-agi, llama-hub

From the location there, we can edit the source file directly. For me, the file path is /home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/llama_index/vector_stores/postgres.py

In that file, you'll want to change:

line 25 to say metadata_ = Column(JSON)

line 115 to say metadata_=node_to_metadata_dict(

line 146 to say node = metadata_dict_to_node(item.metadata_)

line 153 to say metadata=item.metadata_,

I just realized the line 115 was not fixed in my PR, oof

doing that now

Okay, Should i wait until you fix issue with line115. Or can i go ahead to follow the steps you mentioend above ?

I just merged a fix, maybe try the pip install upgrade from git again 🙂

if that doesn't work, then yea at least inspect the file to see if it has those changes... maybe there's something else missing if it still doesn't work 🤔

Okay, Let me reinstall and see

Hi @Logan M I tried it. The metadata issue is resolved, But now we have a different issue with creating with ObjectIndex creation. Here is the error I'm running into:

TypeError                                 Traceback (most recent call last)
Cell In[13], line 35
     25 vector_store = PGVectorStore.from_params(
     26         database=DATABASE_NAME,
     27         host=DATABASE_HOST,
   (...)
     30         user=DATABASE_USER,
     31         table_name=collection_table_name)
     33 storage_context = StorageContext.from_defaults(vector_store=vector_store)
---> 35 obj_index = ObjectIndex.from_objects(
     36     table_schema_objs,
     37     table_node_mapping,
     38     storage_context=storage_context
     39 )
     41 query_engine = SQLTableRetrieverQueryEngine(
     42     sql_database,
     43     obj_index.as_retriever(similarity_top_k=1),
     44 )
     46 # # llm = OpenAI(temperature=0)
     47 # db_chain = SQLDatabaseChain(llm=llm, database=sql_database, return_intermediate_steps=True, verbose=True, top_k=200)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/llama_index/objects/base.py:51, in ObjectIndex.from_objects(cls, objects, object_mapping, index_cls, **index_kwargs)
     49     object_mapping = SimpleObjectNodeMapping.from_objects(objects)
     50 nodes = object_mapping.to_nodes(objects)
...
   2135             "%r is an invalid keyword argument for %s" % (k, cls_.__name__)
   2136         )
   2137     setattr(self, k, kwargs[k])

TypeError: 'id_' is an invalid keyword argument for Datauser_default_user_dbconn_default_connection

Any idea why ?

Here user_default_user_dbconn_default_connection is my collection table name, i gave to the PGVectorStore.

Ugh, imma have to spin up postgres locally and figure this out. Too much back and forth lol

Give me a sec

The tracback is cut off, but lets see if I can reproduce

I know, I'm sorry for the trouble. Did you get a chance to reproduce the issue ? @Logan M

Took way too long to get postgres setup lol and also got distracted

But now I've managed to test, figure out the bug, and am about to commit a fix 🙂

aaaand just merged it, phew

Perfect, Let me reinstall and try it again.

@Logan M Thanks a ton for your quick fixes. Appreciate it very much 🙂

I have two quick questions 1. Is there a way to add custom prompts(a few shot examples) to the given example SQL Index guides ? 2. The extra_info option is not working to extact the sql_query from the response of the SQL index (response = query_engine.query("Give me all the tables in the database")
response.extra_info). Any idea why ?

One way to get the sql_query from the response is this, as it's available in metadata: response.metadata['sql_query']

Yes it is response.metadata now actually. Is there some stale documentation somewhere?

And you want to add few shot examples to the prompt when generating text to sql?

Yes

I saw documentaion of that in this link before, Now it's not working. https://gpt-index.readthedocs.io/en/latest/examples/index_structs/struct_indices/SQLIndexDemo-ManyTables.html

Also, @Logan M I think there is an issue with PGVectorStore. "as_retriver(similarity=5)" is no logner working with object _index. Here is the error:

File "/app/langchain_ai.py", line 70, in __init__
 self.object_index.as_retriever(similarity_top_k=3),
talktodata-api-api-1  |   File "/usr/local/lib/python3.10/site-packages/llama_index/objects/base.py", line 61, in as_retriever
talktodata-api-api-1  |     retriever=self._index.as_retriever(**kwargs),
talktodata-api-api-1  | AttributeError: 'PGVectorStore' object has no attribute 'as_retriever'

I think there are some changes on PGVectorStore an hour ago, that might be causign this. Here is how i'm using this:

self.vector_store = PGVectorStore.from_params(
                                database=DATABASE_NAME,
                                host=DATABASE_HOST,
                                password=DATABASE_PASSWORD,
                                port=DATABASE_PORT,
                                user=DATABASE_USER,
                                table_name=self.collection_table_name)
        self.table_node_mapping = SQLTableNodeMapping(self.sql_database)
        self.object_index = ObjectIndex(self.vector_store, self.table_node_mapping)
        self.db_query_engine = SQLTableRetrieverQueryEngine(
                self.sql_database,
                self.object_index.as_retriever(similarity_top_k=3),
            )

How to fix this ?

You have an error in your usage

You need to create a vector index with the vector store, and pass the index into the object index

Right now you are passing in the bare vector store, hence the error

I think there's an example in this thread actually lol

I will find it

https://discord.com/channels/1059199217496772688/1123309954359640095/1123326061086773388

That message 👍

Hi @Logan M yes, You are right. I was able to solve the problem. ANd Do you have any insight on customizing the prompt with a few shot example while working while generating text to sql ?

yea, you can pass in the template here sql_index.as_query_engine(text_to_sql_prompt=my_prompt)

The default prompt is here for guidance
https://github.com/jerryjliu/llama_index/blob/b7697c006499f7e48e3164152092c07ba0327aa9/llama_index/prompts/default_prompts.py#L171-L198

What is the sql_index here ? Can you point me where should i add the text_to_sql_prompt=my_prompt in my below code ?

engine = create_engine(db_uri)
sql_database = SQLDatabase(engine)
collection_table_name = f"user_{user_id}_dbconn_{connection_id}"
table_node_mapping1 = SQLTableNodeMapping(sql_database)
vector_store = PGVectorStore.from_params(
        database=DATABASE_NAME,
        host=DATABASE_HOST,
        password=DATABASE_PASSWORD,
        port=DATABASE_PORT,
        user=DATABASE_USER,
        table_name=collection_table_name)
        
vector_store_index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
object_index = ObjectIndex(vector_store_index, table_node_mapping1)

query_engine = SQLTableRetrieverQueryEngine(
                sql_database,
                object_index.as_retriever(similarity_top_k=3),
            )

response = query_engine.query("Give me all the tables in the database")

Plain Text

query_engine = SQLTableRetrieverQueryEngine(
                sql_database,
                object_index.as_retriever(similarity_top_k=3),
                text_to_sql_prompt=my_prompt
            )

Super. Thanks!!

Also, I’m very sorry for asking all these primitive questions. I’m having hard time figuring out a way to create table_node_mapping = SQLTableNodeMapping(sql_database) for large databases quickly. The usecase I’m working with should allow to create multiple database connections and switch between them to ask questions in plain English. Currently if the table has 300+ tables, it’s taking a very long time to get the answer, because the table node mapping creation itself takes a lot of time. Do you have any idea on how to approach this problem.

@Logan M Does adding multi threading/parallel processing helps here? Or any workaround do you have in mind ?

Add a reply

Sign up and join the conversation on Discord