Find answers from the community

Updated 2 years ago

Large indexing

Hello everyone, I need some input to understand how feasible is a personal project I wanted to start. I have 10 years worth of personal journaling and wanted to index them and query upon them. The data is plain text (but I can structure it somehow) in a couple of files (each by journal category). I've tried a TreeIndex, but due to a coding mistake after generating it I didn't manage to save it and lost it, it costed me around 14€ of API tokens (was 250~ chunks of data). And I also noticed that querying it was really expensive... does the cost of a query scale app rather quickly depending on the size of the index?
L
i
13 comments
Yea, the tree index will be pretty expensive to construct.

To keep costs a little lower, I would recommend maybe a vector index for each section/category, and then wrapping that with a list or keyword index
It still was REALLY expensive, I did a Simple VectorIndex over just the last couple of years of journal, it was around 50 chunks, and it was around 1€ per query plus 5€ to build the vector, is this normal pricing or it is optimizable?
It really depends on the size of the text πŸ€”

Constructing a vector index should be pretty cheap. It's something like $0.0004/1k tokens (1k tokens is about 600 words or so I think)

Query cost is dependent on the top_k and how big each chunk is. I'm pretty surprised a vector index query was $1, unless you had top_k set to a really big number maybe?
There are also logs that print out the LLM and embedding token usage.

We also have the ability to estimate the token usage before actually running it too https://gpt-index.readthedocs.io/en/latest/how_to/cost_analysis.html
maybe it was the way data was structured? I basically picked up from where I left on the graham lee example, didn't add parameters...perhaps only the "mode='three_summarize'" but I'm not sure. Also, I did read it was 0.02/1k token (and it more or less compares with how much I paid), maybe I should switch the model?
yep, I got pretty aware of cost analysis now πŸ˜„
the MockLLMPredictor is now my best friend πŸ˜…
Hahaha yea!

Yea the LLM will be more expensive than the embedding model.

Currently, chatgpt works ok-ish and is 10x cheaper than davinci
Ohhhh maybe remove the tree summarize unless you are trying to make a summary
ok, so I should look into the docs on how to switch the model and won't be tree summarizing
thanks, you've been really kind and helpful πŸ™
:dotsHARDSTYLE:
Add a reply
Sign up and join the conversation on Discord