Find answers from the community

Updated 2 months ago

Hello working with weaviate for first

Hello, working with weaviate for first time and wondering if anyone has insight. I'm not seeing any embeddings in the document.

Here is the basis of the code
Plain Text
prompt_helper = PromptHelper(
context_window=context_window,
num_output=output_tokens,
chunk_overlap_ratio=0.1,
chunk_size_limit=chunk_size)
node_parser = SimpleNodeParser.from_defaults(text_splitter=text_splitter)
service_context= ServiceContext.from_defaults(prompt_helper=prompt_helper, node_parser=node_parser)
vector_store = WeaviateVectorStore(weaviate_client=client, index_name="Minutes")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(documents, llm=OpenAI(
             temperature=0.0, model="gpt-4", max_tokens=output_tokens),
             service_context=service_context,
             storage_context=storage_context)

A document in weaviate (doesn't have embedding)
Plain Text
          "_node_content": "{\"id_\": \"b1d58d70-4a66-4236-b3f1-be6b2c80c592\", \"embedding\": null, \"metadata\": {\"name\": \"City Council\", \"uuid\": \"5D95AE16-32E0-4256-A9A9-1F9D311ABF33\", \"date\": \"9/5/2023\"}, \"excluded_embed_metadata_keys\": [], \"excluded_llm_metadata_keys\": [], \"relationships\": {\"1\": {\"node_id\": \"ee9091d2-736d-4c0d-91eb-7bd5d17db2b7\", \"node_type\": null, \"metadata\": {\"name\": \"City Council\", \"uuid\": \"5D95AE16-32E0-4256-A9A9-1F9D311ABF33\", \"date\": \"9/5/2023\"}, \"hash\": \"01b122ab1f744ceca5694e65ddf8cbb7bdc03561c2af3cc47b0439763b1ddb21\"}}, \"hash\": \"01b122ab1f744ceca5694e65ddf8cbb7bdc03561c2af3cc47b0439763b1ddb21\", \"text\": \"\", \"start_char_idx\": null, \"end_char_idx\": null, \"text_template\": \"{metadata_str}\\n\\n{content}\", \"metadata_template\": \"{key}: {value}\", \"metadata_seperator\": \"\\n\"}",
L
b
34 comments
Im 99.99% sure the embedding is just hidden

I imagine if you query the index, it will work fine (and the source nodes should have embeddings attached to them as well)
there is embeddings in the doc I got from weaviate above it's just null so that's why I got worried again.
but with debug on
I do see a bunch of embeddings make
NodeWithScore but not sure where in the code that connects to like weaviate query
Plain Text
query_engine = index.as_query_engine()
response = query_engine.query("test")
print(response.source_nodes[0].node.embedding)


That's all you need to do -- should work fine πŸ‘
I've used weaviate tons -- you are doing everything correctly
Do you use cloud hosted?
okay, soooo... help me explain this:
Query is this:
List any discussions containing sewer in these agenda minutes.
response:
Plain Text
There is no discussion containing the word "sewer" in these agenda minutes.


graphql:
Plain Text
{
  Get {
    Minutes(bm25:{query:"Sewer", properties:["text"]}) {}
  }
}

this returns documents (3 of them, that seemingly have same text)
(sewer lowercase does not work)
So the graphql is querying with bm25 (fancier keyword search)

The default query method is just using embeddings to retrieve the top k nodes (the default is 2)

Most likely, the query embeddings didn't find anything helpful. If you check the source nodes on the response, you can see what it retrieved to help debug

Plain Text
response = query_engine.query("...")
for source_node in response.source_nodes:
  print(source_node.node.text)
Weaviate also has a hybrid mode that combines embeddings with bm25
how does vector do querying w/ embedding?
I mean if bm25 can find Sewer, embedding vector should be able to find sewer right?
thats why i got sus of embedding haha
So here's a quick technical explanation

Embeddings are a way to capture semantics from text. In llama-index, the default embedding model is text-embedding-ada-002 from openai, which takes a piece of text and turns it into a list of 1536 numbers

Then, at query time, the query text is also embedded. That list of numbers is compared to the existing embeddings in your index, using some similarity function like cosine similarity (which is essentially just a dot product)

Embeddings aren't perfect -- they capture the general intent of text, but are less helpful for exact keywords. If you run the code I pasted above, you'll see the text that the index thinks is most relevant (I HIGHLY recommend printing that to help debug for now -- it can help you track down if your documents need better parsing or not too). The response might be either the LLM being dumb, or the retrieved text not being helpful

Contrasted to BM25, which basically just figures out the main keywords in a query, and then fetches nodes with the same keywords. This skips the intent aspect.

The hybrid approach I linked above basically fuses these two approaches
Sorry for the wall of text lol but hopefully that explains a few things
Will catch up soon. Thank you so much.
that was perfect, what is 1536, why so exact
sewer is not in any of the source nodes. I tried two different text splitters
Plain Text
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
          separators=["\n\n", "\n"],
            chunk_size=chunk_size,
            chunk_overlap=100)

and
Plain Text
text_splitter = SentenceSplitter(chunk_size=chunk_size)
hmm is it not totally unreasonable that bm25 would find the 1 document with sewer while just embeddings would not?
tried hybrid mode with alpha=0.0 and still a no go.
Tbh not 100% sure what alpha is, maybe play with that lol
Nah that sounds about right. Your query was very short, and the only word that really matters is sewer. Could try playing with the query, being a bit more descriptive (what else besides sewer could be mentioned?)
1536 is just what it was trained on I guess. Every model is different -- others might be 768, 1024, etc. Under the hood, it's just the model architecture that is mapping text into X number of dimensions
Tbh it might be worth trying a different embedding model too and re-building the index (if you switch embedding models, the entire index needs to use the same model)

BAAI has worked well for me personally, if you want to try running something locally

Plain Text
service_context = ServiceContext.from_defaults(..., embed_model="local:BAAI/bge-small-en-v1.5")
index = VectorStoreIndex.from_documents(..., service_context=service_context)
wish I could be more descriptive in my own head haha
I think I'm starting to be able to read python code.
Is there no metadata filter for NotEqual?
what's easiest way to implement Operator: 'NotEqual' Into weaviate vector store?
maybe extend WeaviateVectorStore and add overwrite def query?
that might be the easiest way πŸ˜…
Add a reply
Sign up and join the conversation on Discord