The community member is struggling to filter nodes by strings, as the metadata search examples do not include string search. They have tried various metadata approaches without success. The community member wonders if there is a way to first use keyword search to narrow down the available nodes, and then perform vector search. The comments suggest using a vector database that supports text search, or a hybrid method like BM25. Some community members share their experiences with Postgres and BM25, and mention that tagging everything with standard metadata is a potential solution, though not ideal. There is no explicitly marked answer.
So.. it seems there is no way to filter nodes down by strings, because all of the metadata search examples do not include string search (e.g. your metadata is an email subject and you want to find one string in the subject). Is that correct? I have tried every metadata approach and have been unsuccessful.
Is there some sort of approach where we can first use keyword search (e.g. a PO number, a string, etc) to narrow down the available nodes and THEN do the vector search? Am I just stupid?
I switched to postgres because I saw someone say that the docs are persisted when inserted into postgres; I flagged store_nodes_override=True, and I still cant retrieve them from postgre
@Logan M I actually have the same question and am also using Postgres (pgvector). It works GREAT for semantic search. However, I also want to combine this with a keywords search. I see that the vector table contains a "text" column with the raw document contents. I'd imagine there's a way to customize the retriever to also do a keyword search over this column? (ideally both keyword and semantic search separately and then combine rated/ranked results somehow)
i tried BM25 but I dont understand. it needs nodes, but I cant get the nodes back out of weaviate. it takes 3 hours to process all of my docs, so I cant wait for 3 hours to generate the nodes.
I standard set of fields to filter with is a very good approach imo
yea bm25 is a static encoding. If any docs are added, the entire thing needs to be re-computed.
Its using the rank-bm25 library under the hood, which doesn't provide a way to save/load it. Been meaning to update it to a faster library that supports saving/loading.