Can anyone offer guidance in efficiently

At a glance

Can anyone offer guidance in efficiently querying massive SQL databases? Is a viable option to create an index over the entire schema of the database or to instead create indices over individual or groups of tables? I can't find any good tutorials on LlamaIndex for SQL. Any help would be appreciated

11 comments

LLogan M

I'm not really a SQL expert, but how massive is massive?

If it's 1000s of tables, definitely need a sql index over a few tables at a time (the prompt to the LLM contains the schema of each table)

pplainfilth

Hey @Logan M , thank you so much for getting back to me!
Let's say 100s to 1000s of tables.

My thinking was to create indices over the schema of groups of tables, then given a user question we query the indices to decide which datasets and tables to query, then use a langchain agent to execute queries to the tables.

Would you be able to suggest a general framework or approach to querying large structured databases? I really can't find any tutorials on this.

LLogan M

Yea I don't think I've seena anyone (publically) tackle a problem with this many tables lol

With 100s to 1000s of tables, yea, that's never going to fit into a single LLM prompt for now

I would (if possible) create a vector index (or keyword index) of every table (using either the table schema, or a table summary/description, or both) and then use that index with the user query as input to narrow down the search to a handful of tables

LLogan M

Once you have a handful of queries, you can narrow down the sql index query to only look at those tables

pplainfilth

Thank you that's really helpful.
An additional challenge I'm facing that I'd be curious to know if you have any insight to is the following:
I am using BigQuery and so if I were to make an SQLAlchemy connection to a specific dataset as follows, a connection is made quickly with no problem:

Plain Text

python sqlalchemy_url = f"bigquery://{self.project}/{self.database}?credentials_path={self.service_account_file}"

But the langchain SQL agent only seems to look at the tables in that specific dataset, however, I have many datasets. Even specifically telling the LLM the name of the dataset (different dataset) I want it to query, it stays looking at only the dataset in the url connection.

I've also tried to leave the dataset out of the URL like this:

Plain Text

python sqlalchemy_url = f"bigquery://{self.project}?credentials_path={self.service_account_file}"

But just making the connection seems to take 12 minutes, I think because it is now fetching metadata of all tables in the whole server.

This has me so stumped dude xD Thanks for your help, no pressure to keep assisting though, you're a legend already!

LLogan M

oooo that's an interesting problem. Tbh I've never used big query, so I'm not sure entirely what the right solution here is 🤔

Maybe it helps, but in the SQLDataBase constructor, you can identify include_tables i.e. SQLDatabase(engine, include_tables=['one', ...]), although I'm not sure if that will improve the 12 minute loading time or not 🤔

pplainfilth

Hey man, sorry for the late reply I've just been working on some stuff for my Master's research.

I appreciate that, thank you! Perhaps there is a way I could cache the metadata collected? Will also definitely investigate some ways I could cleverly incluse specific tables. 😄

FFred Bliss

I think sqlalchemy needs to be ditched at some point. Not sure how tightly coupled the db readers are these days with langchain db readers, but i had snowflake working via the native python connector (couldnt get snowpark working cuz of conda stuff) awhile back. think i overrwrote the code in a git pull though

FFred Bliss

but performance difference between sqlalchemy and native python connector, at least in snowflake, is drastic

FFred Bliss

if theyre not tightly coupled, i can try to go rebuild snowflake's again

FFred Bliss

once snowpark python 3.9 is out it should be much, much easier to work with (right now it's 3.8 GA, 3.9-3.10 in preview) - https://docs.snowflake.com/release-notes/2023-06#support-for-python-3-10-in-snowpark-udfs-udtfs-and-stored-procedures-preview

Add a reply

Find answers from the community

Can anyone offer guidance in efficiently