Find answers from the community

Updated 2 years ago

index.json file only numbers, not real data

At a glance
hey guys 😄 i have managed to use the DatabaseReader loader and create an index.json file from my SQL db data. but the content of the json is just numbers and guid. see example file.

i have used the implementation in this demo: https://github.com/jerryjliu/gpt_index/blob/main/examples/data_connectors/DatabaseReaderDemo.ipynb

when i print the "texts" object, the real data is there. but how do I convert this real data to the index.json file?
l
j
23 comments
anyone know?
@lars what index are you using?
same as in the demo, GPTSimpleVectorIndex
I see. Could you share your code on how you're using DatabaseReader and the vector index?

Initialize DatabaseReader object with the following parameters:

db = DatabaseReader(
engine = engine,
scheme = "mssql", # Database Scheme
host = "host", # Database Host
port = "1433", # Database Port
user = "user", # Database User
password = "password, # Database Password
dbname = "dbname", # Database Name
)

query = f"""
select TOP(1) *
from Races
"""

with engine.connect() as conn:
result = conn.execute(query)
rows = result.fetchall()

Convert the rows to a list of dictionaries

data = [dict(row) for row in rows]
print("__data: __", data)
print("\n")


Please refer to llama_index.langchain_helpers.sql_wrapper

SQLDatabase.run_sql method

texts = db.sql_database.run_sql(command = query)
print("__Texts: __", texts)

documents = db.loaddata(query = query)print("__documents created__")

Documents must return a list of Document objects

print(documents)

index = GPTSimpleVectorIndex(documents)
print("__index created__")

Save newly created Index to disk

index.save_to_disk('./indexes/takethelead_races.json')

print("index saved to disk")
here you go. I have confirmed that I have connected with the db, when i print the ‘texts’ objects, it prints the real data I want in the json file
if you try printing the contents of the documents, are the texts there?
__data: __ [{'Id': '9C2F3087-31B7-4460-AB98-08DB12008EE8', 'RaceCode': 'FO_20221130_1', 'StartTime': datetime.datetime(2022, 11, 30, 17, 5), 'RaceNumber': 1, 'RaceName': 'BREDDEPREMIERINGSLØP', 'Distance': 2040, 'StartMethod': 'Volte', 'TrackName': 'Forus Travbane', 'OverflyddMap': None, 'Prize': None, 'Leader': None}]


__Texts: __ ("[('9C2F3087-31B7-4460-AB98-08DB12008EE8', 'FO_20221130_1', datetime.datetime(2022, 11, 30, 17, 5), 1, 'BREDDEPREMIERINGSLØP', 2040, 'Volte', 'Forus Travbane', None, None, None)]", {'result': [('9C2F3087-31B7-4460-AB98-08DB12008EE8', 'FO_202211301', datetime.datetime(2022, 11, 30, 17, 5), 1, 'BREDDEPREMIERINGSLØP', 2040, 'Volte', 'Forus Travbane', None, None, None)]})__documents created__

__documents type__: <class 'list'>


__document__: [Document(text='9C2F3087-31B7-4460-AB98-08DB12008EE8', doc_id='e13a08ab-8583-4eea-8fad-1f960ddb1953', embedding=None, extra_info=None)]


INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
[build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 23 tokens
[build_index_from_documents] Total embedding token usage: 23 tokens
__index created__

___index saved to disk__
this is the output of print(documents):

[Document(text='9C2F3087-31B7-4460-AB98-08DB12008EE8', doc_id='e13a08ab-8583-4eea-8fad-1f960ddb1953', embedding=None, extra_info=None)]

i also pasted the other outputs in the message above. so the answer is no, the only this that is in documents is the Id GUID in the actual data: 9C2F3087-31B7-4460-AB98-08DB12008EE8
i see...this may be an issue with the loader
the document text should reflect the actual content of what you want to retrieve
okey. in the meantime, i could try to manually convert the texts object to a list of Document objects. also, is there a way to support you in all this? really appriciate all your work 😊 should i open an issue on GitHub with this inofrmation?
np @lars ! thanks for the offer.

The only thing here would be to submit a PR helping to fix the DatabaseReader but only if you want to / have time 🙂
@jerryjliu0 alright, i can try, but I’m pretty fresh in Python. Either way it doesn’t seem like there is something wrong with the loader since I manage to console the real data. And the docs for GPTSimpleVectorIndex says that the embedding numbers should be in the index.json file. So I think the issue is that only the Id from the real data happens to get saved to the index.json file. I can try to debug why that happens
@lars since your print(documents) showed that the Document only contained a hash instead of the actual SQL row content, seems like something to look into re: DatabaseReader
if i have time i'll take a look today
@jerryjliu0 hey again! you were right! i fixed the code 😄 now the nodes_dict in the json file look like this:

"nodes_dict": {
"1543447146048749966": {
"text": "9C2F3087-31B7-4460-AB98-08DB12008EE8, FO_20221130_1, 2022-11-30 17:05:00, 1, BREDDEPREMIERINGSLØP, 2040, Volte, Forus Travbane, None, None, None",
"doc_id": "e4d1550e-ee44-4a3a-8515-dd91d956cc8b",
"embedding": null,
"doc_hash": "18d445990197ab438c194ea203e2c2655f74842791eb6d8fb27e01a2ddefe530",
"extra_info": null,
"index": 0,
"child_indices": [],
"ref_doc_id": "7ff39415-3c58-4064-907b-9f476691e765",
"node_info": {
"start": 0,
"end": 144
}
},
"8320616532129171895": {
"text": "BA07B479-175D-4CBC-AB99-08DB12008EE8, FO_20221130_2, 2022-11-30 17:25:00, 2, BREDDEPREMIERINGSLØP, 2040, Volte, Forus Travbane, None, None, None",
"doc_id": "0087ea5f-2f11-43c1-85fa-983334e9d2cd",
"embedding": null,
"doc_hash": "d09e555835daa1577b2448a227ff7ac452d69901f1ad6dde8ecae6a8cd65ce76",
"extra_info": null,
"index": 0,
"child_indices": [],
"ref_doc_id": "85c82a4f-5615-40c9-9974-02847d67ba91",
"node_info": {
"start": 0,
"end": 144
}
},

as you see, the "text" now consist of all the values of the columns in a db table row, not just the first column in the row.
@jerryjliu0 i changed line 94 in readers/database.py from

this: documents.append(Document(item[0]))

to this:
document_values = ", ".join(str(i) for i in item)
documents.append(Document(document_values))

then all the values will be displayed comma separated.

i did not have permission to create a branch in your project, so I do not know how to make a PR without permission. If I got permission (or another way) I would gladly make a PR if comma separated is preferable 😄
amazing, thanks for helping to suggest the fix! will put it out soon 🙂
fantastic! thanks for quick response 🙂
@jerryjliu0 when will this be released?
i think it's already released! 😉 we haven't pubbed 0.4.28 yet though
Add a reply
Sign up and join the conversation on Discord