Find answers from the community

Updated 3 months ago

Vector search

I'm trying to index a taxonomy using GPTVectorStoreIndex. I have build a bunch of documents containing only one line each, like:

Doc1: The name of group 00.01 is appliances.
Doc2: The name of group 00.03 is housing.
Doc1: The name of group 02.02 is animals.

From this I would like to be able to ask: What is the name of group 00.01.

My issue is that for some numbers it works and with others I gpt just answers it can find anything. It seems kind of random. Is this not supposed work?

I have also tried to have everything in one document with one line per group. But again results are a bit random hit or miss.
T
r
L
8 comments
The vector index uses semantic search so it will return the most semantically similar results
It doesn't perform well with that sort of data/queries
Thanks - I would say it should work with semantic search. Wouldn't the number mean something distinguishable also in an embedding?
And I find it a bit odd that it works in maybe more than half of the instances
Numbers don't really carry a semantic meaning
My understanding about word embeddings is that they work with characters like:

"Currently, most NLP models treat numbers in text in the same way as other tokens—they embed them as distributed vectors"
Right but they are embedded in context. Teemu is right, they won't capture these specific details well, only the general semantics

If you need to capture keywords, you can use hybrid search with BM25 (or the hybrid search offered by a vector db if you are using one)

https://gpt-index.readthedocs.io/en/stable/examples/retrievers/bm25_retriever.html#advanced-hybrid-retriever-re-ranking

(I use a rather old re-ranker in that example, today I would use BAAI/bge-base-reranker)
Well thanks for your input. It does not reflect in my results since I'm able to a certain extent to make numbers/ids work with a few hundred elements. Only when I scale up to thousands it breaks down.

I'm using GPTVectorStoreIndex in conjunction with SentenceSplutter and simplenodeparser
Add a reply
Sign up and join the conversation on Discord