all three are the same, we just store all three to make our backend logic a little more simple lol
Depending on the vector store backend you use, sometimes not all information from the node can be properly stored in the vector store. Although the latest version of llama index really improves this process a lot
Which version? I'm on 0.6.33 and looks like I'm only getting one relationship stored on both Weaviate and Qdrant
try the latest pip install --upgrade llama-index
There was a giant refactor I did under the hood for the node/document objects
they are all pydantic objects now, makes serializing much easier
Btw looks like you might've broken things for anyone importing from llama-index.data_structs
Oh I see what you did, it's all in schema.py
now
Love the templates on text and metadata
@Logan M Oh and excluded_embed_metadata_keys
and excluded_llm_metadata_keys
solve a problem I was working on this morning
Thanks for noticing! I hope it's useful, because it was a ton of work to change these π
Yeah I had been overriding the Node and using that for indexing to keep all the necessary but search-irrelevant metadata from messing up the vectors:
class WalletNode(Node):
index_extra_info_fields = ["user"]
@property
def extra_info_str(self) -> Optional[str]:
"""Extra info string."""
if self.extra_info is None:
return None
return "\n".join([f"{k}: {str(v)}" for k, v in self.extra_info.items() if k in self.index_extra_info_fields])
@Logan M Hate to be the bearer of bad news...
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[11], line 70
68 matching_data = await self._retrieve_matching(data)
69 updated_data = await self._filter_to_updated(data, matching_data)
---> 70 await self._store(updated_data)
Cell In[11], line 216, in SlackIndexer._store(self, data)
213
215 logging.debug("Adding documents")
--> 216 index = VectorStoreIndex.from_documents(
217 nodes, storage_context=storage_context, service_context=service_context
218 )
File ~/superwallet/env/lib/python3.11/site-packages/llama_index/indices/base.py:93, in BaseIndex.from_documents(cls, documents, storage_context, service_context, **kwargs)
91 with service_context.callback_manager.as_trace("index_construction"):
92 for doc in documents:
---> 93 docstore.set_document_hash(doc.get_doc_id(), doc.hash)
94 nodes = service_context.node_parser.get_nodes_from_documents(documents)
96 return cls(
97 nodes=nodes,
98 storage_context=storage_context,
99 service_context=service_context,
100 **kwargs,
101 )
AttributeError: 'TextNode' object has no attribute 'get_doc_id'
Nodes are not documents π
Try this instead
index = VectorStoreIndex(nodes, storage_context=storage_context, service_context=service_context)
Hmm, now I'm seeing this. Looking into fixing things from my side, but figured you might want to know.
Here's the data from qdrant btw:
{
"id": "00031ce4-d863-4da2-b842-132e59982433",
"payload":
{
"user": "...",
"wallet_id": "...",
"time": "1677017733",
"hash": -5698642479903243303,
"_node_content": "{
'id_': '00031ce4-d863-4da2-b842-132e59982433',
'embedding': null,
'metadata':
{
'user': '...',
'wallet_id': '...',
'time': '1677017733',
'hash': -5698642479903243303
},
'excluded_embed_metadata_keys':
[
'hash',
'wallet_id',
'time'
],
'excluded_llm_metadata_keys':
[
'hash',
'wallet_id'
],
'relationships':
{
'3': 'c9afb101-c784-56f4-abb0-cbb14940d795',
'2': '909f84a0-163f-5da8-8055-c5dd955d22d3'
},
'hash': '4a2a50237ce0e424d21e7876675532043f7c32f6edb5d3aef013fa457b9444ae',
'text': '...',
'start_char_idx': null,
'end_char_idx': null,
'text_template': '{metadata_str}\\\\n\\\\n{content}',
'metadata_template': '{key}: {value}',
'metadata_seperator': '\\\\n'
}",
"document_id": "None",
"doc_id": "None",
"ref_doc_id": "None"
},
"vector": null
}
hmm I tested qdrant too π€ π
Although it looks like it's hitting the legacy fallback
Did you create the qdrant index with the new version? or the previous version?
Recreated it from scratch with the new version
hmmm, give me a sec, will spin up my example locally
Oh also, while I get this running, how did you create the nodes?
Specifically the relationships too it looks like
hmm, yea locally its working for me
I load documents, pass them into from_documents
, and the responses have all the relationships (using qdrant of course)
Had to construct the nodes manually because the data in the documents has to be linearized:
node = TextNode(text=elem["text"], doc_id=doc_id, metadata={"user": elem["sender"], "wallet_id": wallet_id, "hash": elem["hash"]}, excluded_llm_metadata_keys=["hash", "wallet_id"], excluded_embed_metadata_keys=["hash", "source", "wallet_id", "time"])
if next_node is not None:
node.relationships[NodeRelationship.NEXT] = hash_to_uuid(next_elem["hash"])
if previous_node is not None:
node.relationships[NodeRelationship.PREVIOUS] = hash_to_uuid(previous_elem["hash"])
nodes.append(node)
the relationships structure changed slightly
instead of pointing to a single string ID
you need a
RelatedNodeInfo
object
from llama_index.schema import RelatedNodeInfo
node.relationships[NodeRelationship.NEXT] = RelatedNodeInfo(node_id=hash_to_uuid(next_elem["hash"]))
You can also add some other info as well
let me double check i added this to the docs lol
Thanks for bringing this up by the way, better to catch the pain points early on haha
Sure, and while we're at it, looks like if I don't manually supply the
doc_id
during node creation I'll often have situations like this, where the id of the qdrant element is set but the
doc_id
isn't:
{
"id": "e2076259-baa5-4867-a0d7-ee25e55d6254",
"payload":
{
"user": "...",
"wallet_id": "...",
"time": "1683243354.121519",
"_node_content":
{
"id_": "e2076259-baa5-4867-a0d7-ee25e55d6254",
"embedding": null,
"metadata":
{
"user": "...",
"wallet_id": "...",
},
"excluded_embed_metadata_keys":
[
"wallet_id",
"time",
],
"excluded_llm_metadata_keys":
[
"wallet_id"
],
"relationships":
{
"3":
{
"node_id": "5fb09ad9-bb4d-4758-9b07-4b75cd93babd",
"node_type": null,
"metadata":
{},
"hash": null
},
"2":
{
"node_id": "d897b202-13ec-4d9e-9a53-46fbd99a436c",
"node_type": null,
"metadata":
{},
"hash": null
}
},
"hash": "eac7e0b8286fa77cd3ad89a790b2e7e1a9d1eff8f42c7a63c4a11697638b80b1",
"text": "24. Omega: \\u03a9\\u03c9",
"start_char_idx": null,
"end_char_idx": null,
"text_template": "{metadata_str}\\n\\n{content}",
"metadata_template": "{key}: {value}",
"metadata_seperator": "\\n"
},
"document_id": "None",
"doc_id": "None",
"ref_doc_id": "None"
}
}
hmm, you mean the id's at the bottom right?
Those actually come from node.ref_doc_id
, so if you didn't set up the SOURCE
relationship, they will be none
And then I get errors like this when fetching adjacent nodes:
ValueError: doc_id e20a6cd9-e06d-4121-a9af-213851208748 not found.
These two things seem unrelated. Let me check that value error
Did you hit this error when querying the docstore? By default, the docstore is not used when using most vector store integrations
Yup, that node postprocessor is slightly incompatiable with a vector store integration, at least with default settings
since it's using the docstore directly
You can remedy this slightly when creating the index. But using/managing the docstore and index store will get a little complicated
Since I'm assuming you are using a vector store to avoid writing/loading from disk
Which is fair, we do have a mongodb integration for the index store and docstore.... but like I said, its getting complicated π
I might just rewrite the postprocessor
Unless there's another way around it you can think of
Btw, the index_struct doesn't seem to be very large. What's it actually storing? I was thinking through different ways of keeping track of all the data until I looked at it and realized how small it is
The index struct is basically just keeping track of node_id's that it has access to, thats about it lol
the actual data/text is in the docstore.
Unless you use a vector store integration, then everything gets tossed in there and the other two arent used
The beauty of open source β€οΈ
Yeah this is why I fell in love with this industry. The more amazing things people build, the more there is for the whole world to use to make more amazing things
Is it possible to do a filter-only search with a vector_store? No embeddings. Would make it much easier for the postprocessor
Hmmm I'm not sure. Right now using the API's in llama-index, I don't think so. But maybe it's possible using a lower-level api (like the vector store client directly)
@Logan M I'm basically trying to download the relevant portions of the vector db to create a temporary local docstore to use for the requests, by fetching only by filter. I can't dodge around llama-index because they need to be retrieved as llama-index Node objects. Any ideas?
Implement a custom retriever? π
Already doing that lol. The big missing piece is how to get Nodes as the output
When nodes are stored in the vector store, most vector stores will store the entire node as a serialized json (as I'm sure you've noticed)
You could use that JSON to re-create the nodes right?
We have this function under the hood for this
Oh perfect that's the piece I was looking for
Curious to see what you build here btw
Most people don't realize that you can customize this much of llama-index, so we are planning to both advertise this lower level stuff more, as well as look for ways to make it nicer for developers π
What are you guys thinking of doing with this longer-term
Just adjusting the "brand" positioning I suppose
For example, our current black-box pre-made pipeline are great for getting started quickly. But for 99% of applications, you are going to have to customize various things
So by promoting lower level features (hey here's an LLM, here's a prompt, here's a retriever, etc.) and showing how you can customize and combine these, hopefully more people get some use out of it and build cooler stuff with it. And at the same time, hopefully we make these pieces easy to use, customize, and stitch together
Plus, I think a lot of peoples workflows also start with playing around with low-level stuff first anyways
Hey @Logan M I've been overriding the
get_metadata_str
method on my nodes because of these 7 lines:
return self.metadata_seperator.join(
[
self.metadata_template.format(key=key, value=str(value))
for key, value in self.metadata.items()
if key in usable_metadata_keys
]
)
This forces all metadata elements to be rendered identically, which is often not ideal, as many of our most effective context templates look something like "user {x} recorded transaction {y} at time {z}". I don't need anything, but food for thought.
The only downside is it requires the overridden template string to have knowledge of the metadata inside the nodes