Find answers from the community

Updated 3 months ago

What s the difference between `doc id`

What's the difference between doc_id, ref_doc_id, and document_id as it looks like all vectors are stored with all three, but the difference isn't documented? Additionally, I'm only ever seeing one relationship (id="1") stored for every node, despite next and previous being set on the node object.
L
T
71 comments
all three are the same, we just store all three to make our backend logic a little more simple lol

Depending on the vector store backend you use, sometimes not all information from the node can be properly stored in the vector store. Although the latest version of llama index really improves this process a lot
Which version? I'm on 0.6.33 and looks like I'm only getting one relationship stored on both Weaviate and Qdrant
try the latest pip install --upgrade llama-index
There was a giant refactor I did under the hood for the node/document objects
they are all pydantic objects now, makes serializing much easier
Btw looks like you might've broken things for anyone importing from llama-index.data_structs
Oh I see what you did, it's all in schema.py now
Love the templates on text and metadata
@Logan M Oh and excluded_embed_metadata_keys and excluded_llm_metadata_keys solve a problem I was working on this morning
Thanks for noticing! I hope it's useful, because it was a ton of work to change these πŸ˜†
Yeah I had been overriding the Node and using that for indexing to keep all the necessary but search-irrelevant metadata from messing up the vectors:

Plain Text
class WalletNode(Node):
    index_extra_info_fields = ["user"]

    @property
    def extra_info_str(self) -> Optional[str]:
        """Extra info string."""
        if self.extra_info is None:
            return None

        return "\n".join([f"{k}: {str(v)}" for k, v in self.extra_info.items() if k in self.index_extra_info_fields])
@Logan M Hate to be the bearer of bad news...

Plain Text
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[11], line 70
     68 matching_data = await self._retrieve_matching(data)
     69 updated_data = await self._filter_to_updated(data, matching_data)
---> 70 await self._store(updated_data)

Cell In[11], line 216, in SlackIndexer._store(self, data)
    213 
    215 logging.debug("Adding documents")
--> 216 index = VectorStoreIndex.from_documents(
    217     nodes, storage_context=storage_context, service_context=service_context
    218 )

File ~/superwallet/env/lib/python3.11/site-packages/llama_index/indices/base.py:93, in BaseIndex.from_documents(cls, documents, storage_context, service_context, **kwargs)
     91 with service_context.callback_manager.as_trace("index_construction"):
     92     for doc in documents:
---> 93         docstore.set_document_hash(doc.get_doc_id(), doc.hash)
     94     nodes = service_context.node_parser.get_nodes_from_documents(documents)
     96     return cls(
     97         nodes=nodes,
     98         storage_context=storage_context,
     99         service_context=service_context,
    100         **kwargs,
    101     )

AttributeError: 'TextNode' object has no attribute 'get_doc_id'
Nodes are not documents πŸ˜‰

Try this instead

Plain Text
index = VectorStoreIndex(nodes, storage_context=storage_context, service_context=service_context)
Hmm, now I'm seeing this. Looking into fixing things from my side, but figured you might want to know.

Here's the data from qdrant btw:
Plain Text
{
    "id": "00031ce4-d863-4da2-b842-132e59982433",
    "payload":
    {
        "user": "...",
        "wallet_id": "...",
        "time": "1677017733",
        "hash": -5698642479903243303,
        "_node_content": "{
            'id_': '00031ce4-d863-4da2-b842-132e59982433',
            'embedding': null,
            'metadata':
            {
                'user': '...',
                'wallet_id': '...',
                'time': '1677017733',
                'hash': -5698642479903243303
            },
            'excluded_embed_metadata_keys':
            [
                'hash',
                'wallet_id',
                'time'
            ],
            'excluded_llm_metadata_keys':
            [
                'hash',
                'wallet_id'
            ],
            'relationships':
            {
                '3': 'c9afb101-c784-56f4-abb0-cbb14940d795',
                '2': '909f84a0-163f-5da8-8055-c5dd955d22d3'
            },
            'hash': '4a2a50237ce0e424d21e7876675532043f7c32f6edb5d3aef013fa457b9444ae',
            'text': '...',
            'start_char_idx': null,
            'end_char_idx': null,
            'text_template': '{metadata_str}\\\\n\\\\n{content}',
            'metadata_template': '{key}: {value}',
            'metadata_seperator': '\\\\n'
        }",
        "document_id": "None",
        "doc_id": "None",
        "ref_doc_id": "None"
    },
    "vector": null
}
hmm I tested qdrant too πŸ€” πŸ˜… Although it looks like it's hitting the legacy fallback
Did you create the qdrant index with the new version? or the previous version?
Recreated it from scratch with the new version
hmmm, give me a sec, will spin up my example locally
Oh also, while I get this running, how did you create the nodes?
Specifically the relationships too it looks like
hmm, yea locally its working for me

I load documents, pass them into from_documents, and the responses have all the relationships (using qdrant of course)
Had to construct the nodes manually because the data in the documents has to be linearized:

Plain Text
node = TextNode(text=elem["text"], doc_id=doc_id, metadata={"user": elem["sender"], "wallet_id": wallet_id, "hash": elem["hash"]}, excluded_llm_metadata_keys=["hash", "wallet_id"], excluded_embed_metadata_keys=["hash", "source", "wallet_id", "time"])

if next_node is not None:
    node.relationships[NodeRelationship.NEXT] = hash_to_uuid(next_elem["hash"])
if previous_node is not None:
    node.relationships[NodeRelationship.PREVIOUS] = hash_to_uuid(previous_elem["hash"])

nodes.append(node)
ah there it is
the relationships structure changed slightly
instead of pointing to a single string ID
you need a RelatedNodeInfo object

Plain Text
from llama_index.schema import RelatedNodeInfo

node.relationships[NodeRelationship.NEXT] = RelatedNodeInfo(node_id=hash_to_uuid(next_elem["hash"]))


You can also add some other info as well
Attachment
image.png
let me double check i added this to the docs lol
crap I didnt haha
will patch that
Thanks for bringing this up by the way, better to catch the pain points early on haha
Sure, and while we're at it, looks like if I don't manually supply the doc_id during node creation I'll often have situations like this, where the id of the qdrant element is set but the doc_id isn't:

Plain Text
{
    "id": "e2076259-baa5-4867-a0d7-ee25e55d6254",
    "payload":
    {
        "user": "...",
        "wallet_id": "...",
        "time": "1683243354.121519",
        "_node_content":
        {
            "id_": "e2076259-baa5-4867-a0d7-ee25e55d6254",
            "embedding": null,
            "metadata":
            {
                "user": "...",
                "wallet_id": "...",
            },
            "excluded_embed_metadata_keys":
            [
                "wallet_id",
                "time",
            ],
            "excluded_llm_metadata_keys":
            [
                "wallet_id"
            ],
            "relationships":
            {
                "3":
                {
                    "node_id": "5fb09ad9-bb4d-4758-9b07-4b75cd93babd",
                    "node_type": null,
                    "metadata":
                    {},
                    "hash": null
                },
                "2":
                {
                    "node_id": "d897b202-13ec-4d9e-9a53-46fbd99a436c",
                    "node_type": null,
                    "metadata":
                    {},
                    "hash": null
                }
            },
            "hash": "eac7e0b8286fa77cd3ad89a790b2e7e1a9d1eff8f42c7a63c4a11697638b80b1",
            "text": "24. Omega: \\u03a9\\u03c9",
            "start_char_idx": null,
            "end_char_idx": null,
            "text_template": "{metadata_str}\\n\\n{content}",
            "metadata_template": "{key}: {value}",
            "metadata_seperator": "\\n"
        },
        "document_id": "None",
        "doc_id": "None",
        "ref_doc_id": "None"
    }
}
hmm, you mean the id's at the bottom right?
Those actually come from node.ref_doc_id, so if you didn't set up the SOURCE relationship, they will be none
And then I get errors like this when fetching adjacent nodes:
Plain Text
ValueError: doc_id e20a6cd9-e06d-4121-a9af-213851208748 not found.
These two things seem unrelated. Let me check that value error
Did you hit this error when querying the docstore? By default, the docstore is not used when using most vector store integrations
Yup, that node postprocessor is slightly incompatiable with a vector store integration, at least with default settings
since it's using the docstore directly
You can remedy this slightly when creating the index. But using/managing the docstore and index store will get a little complicated
Since I'm assuming you are using a vector store to avoid writing/loading from disk
Which is fair, we do have a mongodb integration for the index store and docstore.... but like I said, its getting complicated πŸ˜…
I might just rewrite the postprocessor
Unless there's another way around it you can think of
Btw, the index_struct doesn't seem to be very large. What's it actually storing? I was thinking through different ways of keeping track of all the data until I looked at it and realized how small it is
The index struct is basically just keeping track of node_id's that it has access to, thats about it lol
the actual data/text is in the docstore.

Unless you use a vector store integration, then everything gets tossed in there and the other two arent used
The beauty of open source ❀️
Yeah this is why I fell in love with this industry. The more amazing things people build, the more there is for the whole world to use to make more amazing things
Is it possible to do a filter-only search with a vector_store? No embeddings. Would make it much easier for the postprocessor
Hmmm I'm not sure. Right now using the API's in llama-index, I don't think so. But maybe it's possible using a lower-level api (like the vector store client directly)
@Logan M I'm basically trying to download the relevant portions of the vector db to create a temporary local docstore to use for the requests, by fetching only by filter. I can't dodge around llama-index because they need to be retrieved as llama-index Node objects. Any ideas?
Implement a custom retriever? πŸ˜…
You can see here, you can actually customize both the retirever and response synthesizer seperately

https://gpt-index.readthedocs.io/en/latest/guides/primer/usage_pattern.html#low-level-api
Already doing that lol. The big missing piece is how to get Nodes as the output
When nodes are stored in the vector store, most vector stores will store the entire node as a serialized json (as I'm sure you've noticed)

You could use that JSON to re-create the nodes right?
We have this function under the hood for this
Attachment
image.png
Oh perfect that's the piece I was looking for
Curious to see what you build here btw

Most people don't realize that you can customize this much of llama-index, so we are planning to both advertise this lower level stuff more, as well as look for ways to make it nicer for developers πŸ™‚
What are you guys thinking of doing with this longer-term
Just adjusting the "brand" positioning I suppose

For example, our current black-box pre-made pipeline are great for getting started quickly. But for 99% of applications, you are going to have to customize various things

So by promoting lower level features (hey here's an LLM, here's a prompt, here's a retriever, etc.) and showing how you can customize and combine these, hopefully more people get some use out of it and build cooler stuff with it. And at the same time, hopefully we make these pieces easy to use, customize, and stitch together

Plus, I think a lot of peoples workflows also start with playing around with low-level stuff first anyways
Hey @Logan M I've been overriding the get_metadata_str method on my nodes because of these 7 lines:

Plain Text
return self.metadata_seperator.join(
    [
        self.metadata_template.format(key=key, value=str(value))
        for key, value in self.metadata.items()
        if key in usable_metadata_keys
    ]
)


This forces all metadata elements to be rendered identically, which is often not ideal, as many of our most effective context templates look something like "user {x} recorded transaction {y} at time {z}". I don't need anything, but food for thought.
The only downside is it requires the overridden template string to have knowledge of the metadata inside the nodes
Add a reply
Sign up and join the conversation on Discord