index.json file only numbers, not real data

At a glance

hey guys 😄 i have managed to use the DatabaseReader loader and create an index.json file from my SQL db data. but the content of the json is just numbers and guid. see example file.

i have used the implementation in this demo: https://github.com/jerryjliu/gpt_index/blob/main/examples/data_connectors/DatabaseReaderDemo.ipynb

when i print the "texts" object, the real data is there. but how do I convert this real data to the index.json file?

23 comments

llars

anyone know?

jjerryjliu0

@lars what index are you using?

llars

same as in the demo, GPTSimpleVectorIndex

jjerryjliu0

I see. Could you share your code on how you're using DatabaseReader and the vector index?

llars

Initialize DatabaseReader object with the following parameters:

db = DatabaseReader(
engine = engine,
scheme = "mssql", # Database Scheme
host = "host", # Database Host
port = "1433", # Database Port
user = "user", # Database User
password = "password, # Database Password
dbname = "dbname", # Database Name
)

query = f"""
select TOP(1) *
from Races
"""

with engine.connect() as conn:
result = conn.execute(query)
rows = result.fetchall()

Convert the rows to a list of dictionaries

data = [dict(row) for row in rows]
print("__data: __", data)
print("\n")

Please refer to llama_index.langchain_helpers.sql_wrapper

SQLDatabase.run_sql method

texts = db.sql_database.run_sql(command = query)
print("__Texts: __", texts)

documents = db.loaddata(query = query)print("__documents created__")

Documents must return a list of Document objects

print(documents)

index = GPTSimpleVectorIndex(documents)
print("__index created__")

Save newly created Index to disk

index.save_to_disk('./indexes/takethelead_races.json')

print("index saved to disk")

llars

here you go. I have confirmed that I have connected with the db, when i print the ‘texts’ objects, it prints the real data I want in the json file

jjerryjliu0

if you try printing the contents of the documents, are the texts there?

llars

__data: __ [{'Id': '9C2F3087-31B7-4460-AB98-08DB12008EE8', 'RaceCode': 'FO_20221130_1', 'StartTime': datetime.datetime(2022, 11, 30, 17, 5), 'RaceNumber': 1, 'RaceName': 'BREDDEPREMIERINGSLØP', 'Distance': 2040, 'StartMethod': 'Volte', 'TrackName': 'Forus Travbane', 'OverflyddMap': None, 'Prize': None, 'Leader': None}]

__Texts: __ ("[('9C2F3087-31B7-4460-AB98-08DB12008EE8', 'FO_20221130_1', datetime.datetime(2022, 11, 30, 17, 5), 1, 'BREDDEPREMIERINGSLØP', 2040, 'Volte', 'Forus Travbane', None, None, None)]", {'result': [('9C2F3087-31B7-4460-AB98-08DB12008EE8', 'FO_202211301', datetime.datetime(2022, 11, 30, 17, 5), 1, 'BREDDEPREMIERINGSLØP', 2040, 'Volte', 'Forus Travbane', None, None, None)]})__documents created__

__documents type__: <class 'list'>

__document__: [Document(text='9C2F3087-31B7-4460-AB98-08DB12008EE8', doc_id='e13a08ab-8583-4eea-8fad-1f960ddb1953', embedding=None, extra_info=None)]

INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens

[build_index_from_documents] Total LLM token usage: 0 tokens

INFO:root:> [build_index_from_documents] Total embedding token usage: 23 tokens

[build_index_from_documents] Total embedding token usage: 23 tokens

__index created__

___index saved to disk__

llars

this is the output of print(documents):

[Document(text='9C2F3087-31B7-4460-AB98-08DB12008EE8', doc_id='e13a08ab-8583-4eea-8fad-1f960ddb1953', embedding=None, extra_info=None)]

i also pasted the other outputs in the message above. so the answer is no, the only this that is in documents is the Id GUID in the actual data: 9C2F3087-31B7-4460-AB98-08DB12008EE8

jjerryjliu0

i see...this may be an issue with the loader

jjerryjliu0

the document text should reflect the actual content of what you want to retrieve

llars

okey. in the meantime, i could try to manually convert the texts object to a list of Document objects. also, is there a way to support you in all this? really appriciate all your work 😊 should i open an issue on GitHub with this inofrmation?

jjerryjliu0

np @lars ! thanks for the offer.

The only thing here would be to submit a PR helping to fix the DatabaseReader but only if you want to / have time 🙂

llars

@jerryjliu0 alright, i can try, but I’m pretty fresh in Python. Either way it doesn’t seem like there is something wrong with the loader since I manage to console the real data. And the docs for GPTSimpleVectorIndex says that the embedding numbers should be in the index.json file. So I think the issue is that only the Id from the real data happens to get saved to the index.json file. I can try to debug why that happens

jjerryjliu0

@lars since your print(documents) showed that the Document only contained a hash instead of the actual SQL row content, seems like something to look into re: DatabaseReader

jjerryjliu0

if i have time i'll take a look today

llars

@jerryjliu0 hey again! you were right! i fixed the code 😄 now the nodes_dict in the json file look like this:

"nodes_dict": {
"1543447146048749966": {
"text": "9C2F3087-31B7-4460-AB98-08DB12008EE8, FO_20221130_1, 2022-11-30 17:05:00, 1, BREDDEPREMIERINGSLØP, 2040, Volte, Forus Travbane, None, None, None",
"doc_id": "e4d1550e-ee44-4a3a-8515-dd91d956cc8b",
"embedding": null,
"doc_hash": "18d445990197ab438c194ea203e2c2655f74842791eb6d8fb27e01a2ddefe530",
"extra_info": null,
"index": 0,
"child_indices": [],
"ref_doc_id": "7ff39415-3c58-4064-907b-9f476691e765",
"node_info": {
"start": 0,
"end": 144
}
},
"8320616532129171895": {
"text": "BA07B479-175D-4CBC-AB99-08DB12008EE8, FO_20221130_2, 2022-11-30 17:25:00, 2, BREDDEPREMIERINGSLØP, 2040, Volte, Forus Travbane, None, None, None",
"doc_id": "0087ea5f-2f11-43c1-85fa-983334e9d2cd",
"embedding": null,
"doc_hash": "d09e555835daa1577b2448a227ff7ac452d69901f1ad6dde8ecae6a8cd65ce76",
"extra_info": null,
"index": 0,
"child_indices": [],
"ref_doc_id": "85c82a4f-5615-40c9-9974-02847d67ba91",
"node_info": {
"start": 0,
"end": 144
}
},

as you see, the "text" now consist of all the values of the columns in a db table row, not just the first column in the row.

llars

@jerryjliu0 i changed line 94 in readers/database.py from

this: documents.append(Document(item[0]))

to this:
document_values = ", ".join(str(i) for i in item)
documents.append(Document(document_values))

then all the values will be displayed comma separated.

i did not have permission to create a branch in your project, so I do not know how to make a PR without permission. If I got permission (or another way) I would gladly make a PR if comma separated is preferable 😄

jjerryjliu0

amazing, thanks for helping to suggest the fix! will put it out soon 🙂

jjerryjliu0

https://github.com/jerryjliu/llama_index/pull/728

llars

fantastic! thanks for quick response 🙂

llars

@jerryjliu0 when will this be released?

jjerryjliu0

i think it's already released! 😉 we haven't pubbed 0.4.28 yet though

Add a reply