Find answers from the community

Updated 6 months ago

example.py

Can someone help me out? I'm trying to import some ocr output into weaviate and then run queries on it, however after getting the data in, every query comes back as Empty response minimized gist here https://gist.github.com/cam-narzt/27b15754de1ae07fa1f49589fc30e616 and thanks in advance to anyone that takes a look
L
C
t
99 comments
(maybe the force-reinstall is important here)
I use containers so it was a fresh clean install for both tests.
ok i got it to run with pydantic.v1 but i'm still getting Empty Response as the response.
Ok, that tells me then there is another issue
probably you aren't actually retrieving any nodes maybe?

For example, I would try

Plain Text
print(index.as_retriever().retrieve("Query"))


And see if it returns anything
I tried running your code (minus weaviate, and minus ollama (because open-source is kind of 💩 with structured outputs) and it worked fine 👀

I used pydantic.v1, and also had to add a docstring to the output class
hmm you're right no nodes are returned, i guess that means I need to tweak my query?
All of the nodes are bits of text extracted from an image using OCR, that I'd like the index to consider to be all part of an invoice (you probably gathered that), I have tried several increasingly broad queries to try to get any nodes returned and so far nothing
I think that might also mean some kind of connection issue somehow with weaviate 🤔 it should always return a top-k no matter the query
hmm it seems to be talking to weaviate, at least I see some logs from it, not much though since weaviate isn't very verbose afaict
Plain Text
weaviate-db       | {"level":"info","msg":"Created shard i_01db429e_00e5_40bc_9aeb_ef0416c327cd_aAgIUCPzZOQG in 856.575µs","time":"2024-06-13T23:32:34Z"}
weaviate-db       | {"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-06-13T23:32:34Z","took":58339}
and about 26 logs from ollama saying POST "/api/embeddings"
Yea that would be ollama embedding your stuff 🤔
I forget if weaviate has a dashboard or UI for indexes

Does the github gist still represent the code you are running?
it wasn't too far off, i've updated it with the changes that i've kept around because they at least provide more info about what's going on
So I talked to the weaviate folks and they helped me to see that the nodes I pass to the VectorStoreIndex constructor are not added to weaviate. Any idea why? There definitely are nodes being passed in, I printed them to be sure.
I honestly have no idea 👀 I'm not sure how used you are to hacking around python packages, but I would honestly add a few prints/breakpoints inside llama-index to debug if it was me lol
VectorStoreIndex(nodes = nodes, storage_context = storage_context) will 100% call vector_store.add() at some point
I can try writing a google colab that works, but it will look very similar to your gist haha
I was looking through the code and saw this in a comment:

VectorStoreIndex only stores nodes in document store if vector store does not store text
can you expand on what that means?
It stores the node in the vector store, as extra metadata. This just simplifies storage, since all you need in most cases is the vector store
Also btw, an example here I just ran of using weaviate on my cluster, seems to work fine here
https://colab.research.google.com/drive/1ihil1FYdtWC2aGg_FOw7aVUMv5LnoeGm?usp=sharing
maybe you can spot something different compared to what you are doing
nope other than my using ollama embedding class instead of the huggingface embedding class
i'll try using the huggingface class and see if that changes anything
Yea I wouldve used ollama in the colab, but not easy to spin up on there. If that fixes it, would be a little spooky 😅
well, that didn't change anything either, still nothing added to weaviate
makes no sense, the code is pretty much identical
We can take this one step further to debug
Plain Text
vector_store = WeaviateVectorStore(
    weaviate_client=client, index_name="LlamaIndex"
)

node = TextNode(text="test")
node.embedding = embed_model.get_text_embedding(node.text)

vector_store.add([node])

from llama_index.core.vector_stores import VectorStoreQuery
query = VectorStoreQuery(
  query_embedding=embed_model.get_query_embedding("test"),
  similarity_top_k=1,
)
result = vector_store.query(query)
print(len(result.nodes))
This is as about low level as we can get, without actually using the weaviate client itself
Ok, i'll try that out in a sec. I've made a docker environment that repros the problem so that if you can run docker you can play with it like i see it
it's in the gist, just put all the files in one dir and run docker compose up
yeah the low level manual query construction and node addition didn't result in any difference. =/
so i guess when vector_store.add([node]) is called either directly or indirectly, somehow weaviate isn't actually adding the node
i'll ask the weaviate folks what's up with that
Well, at least we've narrowed down the issue!
That's wild that it isn't inserting for you 😳 I'm really curious what the issue is
It must be some silent failure
I found a way to increase weaviate's log verbosity, and it looks like when I call vector_store.add weaviate is only receiving some get queries for the schema:
Plain Text
api-1             | adding node: Node ID: fc31a805-e412-41fe-8db4-6e8700286881
api-1             | Text: text one here
weaviate-db       | {"action":"restapi_request","level":"debug","method":"GET","msg":"received HTTP request","time":"2024-06-21T18:04:47Z","url":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/v1/schema","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}}
weaviate-db       | {"level":"debug","msg":"server.query","time":"2024-06-21T18:04:47Z","type":2}
weaviate-db       | {"action":"restapi_request","level":"debug","method":"GET","msg":"received HTTP request","time":"2024-06-21T18:04:47Z","url":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/v1/nodes","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}}
weaviate-db       | {"level":"debug","msg":"server.query","time":"2024-06-21T18:04:47Z","type":1}
api-1             | added node
so it looks as though no data is POSTed or PUT
which sounds like the nodes are not being sent over
I see it is called as a batch in the second link, perhaps the batch is not executed?
self._client.batch.dynamic() sounds like it probably is intended to run the batch in a destructor when it goes out of scope, maybe it doesn't happen?
yeah dynamic requires more configuration than is done in the code you linked (might be done elsewhere but it seems unlikely) https://github.com/weaviate/weaviate-python-client/blob/eab4389f7289b8c705c505241dd84f44ce5929ac/weaviate/batch/crud_batch.py#L252-L263
(this still doesn't explain why it works for me lol, but will take a look)
Yea tbh I have no idea whats up, but the fact is a fresh install on google colab works for me.

What version of weaviate do you have installed, out of curiosity? pip freeze | grep weaviate should get the helpful versions
the python module is weaviate-client 4.6.5
the weaviate server version is 1.25.4
if you bump the weaviate log level to info or debug what does your weaviate log when you run vector_store.add?
Theres a ton of logs, but I see this after calling .add()

Plain Text
DEBUG:httpcore.http11:send_request_headers.started request=<Request [b'GET']>
DEBUG:httpcore.http11:send_request_headers.complete
DEBUG:httpcore.http11:send_request_body.started request=<Request [b'GET']>
DEBUG:httpcore.http11:send_request_body.complete
DEBUG:httpcore.http11:receive_response_headers.started request=<Request [b'GET']>
DEBUG:httpcore.http11:receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Server', b'nginx/1.24.0'), (b'Date', b'Sat, 22 Jun 2024 20:27:41 GMT'), (b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Access-Control-Allow-Headers', b'Content-Type, Authorization, Batch, X-Openai-Api-Key, X-Openai-Organization, X-Openai-Baseurl, X-Anyscale-Baseurl, X-Anyscale-Api-Key, X-Cohere-Api-Key, X-Cohere-Baseurl, X-Huggingface-Api-Key, X-Azure-Api-Key, X-Google-Api-Key, X-Google-Vertex-Api-Key, X-Google-Studio-Api-Key, X-Palm-Api-Key, X-Jinaai-Api-Key, X-Aws-Access-Key, X-Aws-Secret-Key, X-Voyageai-Baseurl, X-Voyageai-Api-Key, X-Mistral-Baseurl, X-Mistral-Api-Key, X-OctoAI-Api-Key'), (b'Access-Control-Allow-Methods', b'*'), (b'Access-Control-Allow-Origin', b'*'), (b'Vary', b'Origin'), (b'Via', b'1.1 google'), (b'Alt-Svc', b'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000')])
INFO:httpx:HTTP Request: GET https://test-cluster-skukh8ss.weaviate.network/v1/schema "HTTP/1.1 200 OK"
DEBUG:httpcore.http11:receive_response_body.started request=<Request [b'GET']>
DEBUG:httpcore.http11:receive_response_body.complete
DEBUG:httpcore.http11:response_closed.started
DEBUG:httpcore.http11:response_closed.complete
DEBUG:httpcore.http11:send_request_headers.started request=<Request [b'GET']>
DEBUG:asyncio:Using selector: KqueueSelector
DEBUG:httpcore.http11:send_request_headers.complete
DEBUG:httpcore.http11:send_request_body.started request=<Request [b'GET']>
DEBUG:httpcore.http11:send_request_body.complete
DEBUG:httpcore.http11:receive_response_headers.started request=<Request [b'GET']>
DEBUG:httpx:load_ssl_context verify=True cert=None trust_env=True http2=False
DEBUG:httpx:load_verify_locations cafile='/Users/loganmarkewich/Library/Caches/pypoetry/virtualenvs/agentfile-MFMS50kK-py3.11/lib/python3.11/site-packages/certifi/cacert.pem'
DEBUG:grpc._cython.cygrpc:Using AsyncIOEngine.POLLER as I/O engine
DEBUG:httpcore.http11:receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Server', b'nginx/1.24.0'), (b'Date', b'Sat, 22 Jun 2024 20:27:41 GMT'), (b'Content-Type', b'application/json'), (b'Content-Length', b'156'), (b'Access-Control-Allow-Headers', b'Content-Type, Authorization, Batch, X-Openai-Api-Key, X-Openai-Organization, X-Openai-Baseurl, X-Anyscale-Baseurl, X-Anyscale-Api-Key, X-Cohere-Api-Key, X-Cohere-Baseurl, X-Huggingface-Api-Key, X-Azure-Api-Key, X-Google-Api-Key, X-Google-Vertex-Api-Key, X-Google-Studio-Api-Key, X-Palm-Api-Key, X-Jinaai-Api-Key, X-Aws-Access-Key, X-Aws-Secret-Key, X-Voyageai-Baseurl, X-Voyageai-Api-Key, X-Mistral-Baseurl, X-Mistral-Api-Key, X-OctoAI-Api-Key'), (b'Access-Control-Allow-Methods', b'*'), (b'Access-Control-Allow-Origin', b'*'), (b'Vary', b'Origin'), (b'Via', b'1.1 google'), (b'Alt-Svc', b'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000')])
INFO:httpx:HTTP Request: GET https://test-cluster-skukh8ss.weaviate.network/v1/nodes "HTTP/1.1 200 OK"
DEBUG:httpcore.http11:receive_response_body.started request=<Request [b'GET']>
DEBUG:httpcore.http11:receive_response_body.complete
DEBUG:httpcore.http11:response_closed.started
DEBUG:httpcore.http11:response_closed.complete
Interesting so it only does GETs for you too, yet the nodes are added.
Yea I can call retriever after and it works

I think its using grpc to insert the nodes?
Ah, perhaps. Maybe my grpc is broken…
I’ll investigate. Thanks
it seems like the grpc starts and listens to the same ip set as the rest endoint (weaviate annoyingly prints only the ipv6 addr but it should be listening to ipv4 and ipv6)
Plain Text
{"action":"graphql_rebuild",   "level":"debug","msg":"successfully rebuild graphql schema", "time":"2024-06-23T16:51:40Z"}
{"action":"grpc_startup",      "level":"info", "msg":"grpc server listening at [::]:50051", "time":"2024-06-23T16:51:40Z"}
{"address":"172.18.0.3:8300",  "level":"info", "msg":"current Leader",                      "time":"2024-06-23T16:51:40Z"}
{"action":"restapi_management","level":"info", "msg":"Serving weaviate at http://[::]:8080","time":"2024-06-23T16:51:40Z"}


and i have the ports listed correctly for the client in app.py as you can see in the gist
aha! try adding a metadata field that holds an array of arrays of numbers, like my coords one, i think it can't handle it
If i inline as much code as possible and don't batch I get ValueError: Invalid query, got errors: creating primitive value for coordinates: proto: invalid type: map[string]interface {}
or rather that's after switching from a List[List[float]] to a more structured object with fields but still, I'm pretty sure that's my problem
how does one use non primitive types in the metadata? just json encode everything?
Yea I would json.dumps the string -- that is a super sneaky error 😅
strangely the error is continuing after json.dumpsing the coords into a string, it's very confusing
the type of coordinates should now be string since that's what I'm assigning to it
but the error still says map[string]interface {}
maybe a caching issue in docker, i'll wipe everything and build from scratch
nope the error persists after completely blowing away everything and building from scratch
here's the full error and traceback
I've updated the gist to my current code
come to think of it, map[string]interface {} is a go type
I'm sure their server (and event client) code is using go 🤔 Lemme try and replicate this myself by adding similar metadata
if instead of serializing to json i serialize to a custom format it doesn't error anymore
there must be some opportunistic json parsing happening in their code, since it should come out as a string even if it round trips the entire metadata through json
woah, very spooky
would it be a good idea to file a bug report that using the batch processing in the weaviate client suppresses some errors? I didn't find out that the metadata had to be json serializable until I bypassed the batch processing by inlining a bunch of code
they say so in the docs, but i'm guessing that's not wanted behaviour for llama index
i've already filed a bug with weaviate that they parse the metadata fields as json but then can't handle the output of the parse
We could create a ticket, but I'm not sure what the solution is 🤔 Just.. not using batch?
I'm not sure exactly, the docs mention being able to check for failed objects, perhaps this info should be checked and an exception raised if something failed? I don't know if the actual python error would be included. https://weaviate.io/developers/weaviate/client-libraries/python#batch-imports
Hi! I think that I have the same issue. I was using llama_index==0.10.38 with weaviate-client==4.6.5 and llama-index-vector-stores-weaviate==1.0.0. Then I updated to llama_index 0.10.50 and now the retriever gives an empty list. Here is the most important part from my code:
"vector_store = WeaviateVectorStore(
weaviate_client=client, index_name="Test"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_store_index = VectorStoreIndex.from_documents(
documents=[llama_index_doc],
storage_context=storage_context,
embed_model=embed_model,
transformations=[node_parser],
show_progress=True
)
retriever = vector_store_index.as_retriever(similarity_top_k=10)
nodes = retriever.retrieve("neural")
len(nodes)"
I tried downgrade the llama_index but again the retriever gives empty nodes list.
I’d recommend temporarily inlining code from llama index so that you can avoid the batch processing. You’ll probably find some python errors are being silenced. At least that was why mine was failing silently to add nodes to the index.
Thanks for your response. Can you please tell me how you add more verbosity in the weaviate logs?
It is interesting that for me this code works. When I add a node (or a list of nodes) using vector_store.add([node]) and then call the retriever then it fetches the most similar nodes. But if I make a vector store index (using index = VectorStoreIndex.from_vector_store(vector_store=vector_store, embed_model=embed_model)) and insert a document (index.insert(llama_index_doc)) and then finally call the retriever then it "sees" only the nodes I added before and not the nodes from the document insertion.
are you sure? Remember that the default top-k is only 2, so the two top nodes could be not from the document you inserted
I added an env var to my weaviate container LOG_LEVEL=info that set the log level. However the logs were useless even then because they didn’t log the grpc stuff which is what I needed to know about. I think Logan changed the python log level as his logs look like they come from the client code.
Yea for my logs, I set the root python logger to debug 👀
Hello! I finally managed to find what goes wrong in my case! So, when I inserted the node with this custom code for adding a node and then performed a query with the VectorStoreQuery object, the node was retrieved. But when I repeated the process with nodes extracted from a document in which I had added some metadata, it did not work. So I tried again by defining a Document without metadata and then using a node_parser to get the nodes from the document. In this case, the retriever was able to retrieve these nodes! Even when I performed vector_store.insert(document). But the document must not contain any metadata. If it contains metadata, then the data is not inserted in the Weaviate collection (I used this method to read the collection's contents: https://weaviate.io/developers/weaviate/manage-data/read-all-objects). Well, that does not solve my problem as I need metadata for my documents, but I think that at least it is useful for fixing the issue. I can share code with you to try it. Do we need to open an issue for that? If yes, in which repository? Weaviate or LlamaIndex?"
Here is a notebook to showcase this issue. I am sending also the html from the notebook if you just want to see the outputs.
Hi @Logan M, sorry to insist, but could I remind you please to take a look if and when you have time? I think is interesting what happens with the metadata and the Document. I thought that maybe it is a different bug than the one described by @CamJN, because of the code for adding nodes was working for me. But eventually I think that is the same issue with this one (https://github.com/weaviate/weaviate/issues/5202)
sorry @ter_ilias, I'm not sure what the exact issue is, but it like its mostly weaviates handling causing the issue? And so the bug needs to be fixed there? (I am far from a weaviate expert)
@Logan M I had another idea about what to do about the silenced errors. What about having a debug-flag environment variable that when set skips using the batching? So if something is silently failing in the python you can set the environment variable and get the error.
Thank you very much for your answer! I will open an issue at LlamaIndex and mention it in the Weaviate issue that @CamJN has opened so that the Weaviate team can review it. Thank you!
Hi! I had a communication with Duda from Weaviate Community Tech Support, and he informed me of the cause of my issue. The metadata names in Weaviate must not contain space characters. One of my metadata names had a space between the words, which is why the insertion was not complete. If you want, you can review the explanation here: https://github.com/run-llama/llama_index/issues/14504. Do you @CamJN think that is related with the issue that you have opened in Weaviate? (https://github.com/weaviate/weaviate/issues/5202) Maybe I am wrong as your problem was not related with the metadata, right?
My problem was related to the metadata, but it was about fields being json strings not about the fields names
ok, do you think that should I declare it as a new comment in your issue? (that my issue in my comment is unrelated)
Might as well
Add a reply
Sign up and join the conversation on Discord