Find answers from the community

Updated 2 years ago

Good reads

At a glance
First timer here. Love this project, I've been using the gpt apis for a couple of years now and I'm trying to wrap my head around some of the more less documented ways of using Llama Index.

I'm working on project where I'm creating a vector index from a list of books in a good reads collection. My ultimate goal is to be able to build queries around that index like "What are all the stephen king mysteries I've read?"

I've downloaded my goodreads data, and then created a single document that has a newline separated list of all the books and metadata (see below)

I've created an index by using SimpleDirectoryReader, LLMPredictor (with ChatOpenAI), PromptHelper and GPTVectorStoreIndex.

And then querying it.

What I'm finding is that there are a lot of books (like 800, 900 or so) and it seems like the index isn't accounting for all of them when I ask. Even a simple query (e.g. list all of my stephen king books) will return only 2 or 3, not the total number (20). I've also tried breaking up the documents, so instead of one document, I'll store every book as a unique file. That doesn't make a difference. The only way I can get it to work is by testing reducing the number of books in the dataset by a significant amount. Then it works kind of as I was expecting.

Is there a different index type I should be using? A totally different approach? Any help, directional or otherwise would be super amazing. I love this project and I just want to understand some of the nuances better.

```
My Book Collection:

title: It
author: steven king
genres: horror
released: 1986

...
L
o
2 comments
So, the default top k is 2. What this means is that after your document is chunked and embedded (default chunk size is 1024), only the top 2 chunks are returned for a query

So a few options. You can increase the top k (index.as_query_engine(similarity_top_k=5))

You could use a keyword index to ensure all the relevant chunks are always retrieved

Although most promising I think is a newer feature. Some of your questions seem very sql oriented. You could create a small database (maybe using sqlite) and use our new feature that combines text2sql and semantic search

Notebook: https://gpt-index.readthedocs.io/en/latest/examples/query_engine/SQLAutoVectorQueryEngine.html

Youtube: https://youtu.be/ZIvcVJGtCrY
Awesome thanks so much for the info. Yeah it is a little sql-y. Excited to try these approaches out and thanks for the explanation!
Add a reply
Sign up and join the conversation on Discord