Large indexing

ii_mush

Hello everyone, I need some input to understand how feasible is a personal project I wanted to start. I have 10 years worth of personal journaling and wanted to index them and query upon them. The data is plain text (but I can structure it somehow) in a couple of files (each by journal category). I've tried a TreeIndex, but due to a coding mistake after generating it I didn't manage to save it and lost it, it costed me around 14€ of API tokens (was 250~ chunks of data). And I also noticed that querying it was really expensive... does the cost of a query scale app rather quickly depending on the size of the index?

13 comments

LLogan M

Yea, the tree index will be pretty expensive to construct.

To keep costs a little lower, I would recommend maybe a vector index for each section/category, and then wrapping that with a list or keyword index

ii_mush

It still was REALLY expensive, I did a Simple VectorIndex over just the last couple of years of journal, it was around 50 chunks, and it was around 1€ per query plus 5€ to build the vector, is this normal pricing or it is optimizable?

LLogan M

It really depends on the size of the text 🤔

Constructing a vector index should be pretty cheap. It's something like $0.0004/1k tokens (1k tokens is about 600 words or so I think)

Query cost is dependent on the top_k and how big each chunk is. I'm pretty surprised a vector index query was $1, unless you had top_k set to a really big number maybe?

LLogan M

There are also logs that print out the LLM and embedding token usage.

We also have the ability to estimate the token usage before actually running it too https://gpt-index.readthedocs.io/en/latest/how_to/cost_analysis.html

ii_mush

maybe it was the way data was structured? I basically picked up from where I left on the graham lee example, didn't add parameters...perhaps only the "mode='three_summarize'" but I'm not sure. Also, I did read it was 0.02/1k token (and it more or less compares with how much I paid), maybe I should switch the model?