Find answers from the community

Updated 3 months ago

What s the difference between `doc id`

What's the difference between doc_id, ref_doc_id, and document_id as it looks like all vectors are stored with all three, but the difference isn't documented? Additionally, I'm only ever seeing one relationship (id="1") stored for every node, despite next and previous being set on the node object.

71 comments

LLogan M

all three are the same, we just store all three to make our backend logic a little more simple lol

Depending on the vector store backend you use, sometimes not all information from the node can be properly stored in the vector store. Although the latest version of llama index really improves this process a lot

TTurtles

Which version? I'm on 0.6.33 and looks like I'm only getting one relationship stored on both Weaviate and Qdrant

LLogan M

try the latest pip install --upgrade llama-index

LLogan M

There was a giant refactor I did under the hood for the node/document objects

LLogan M

they are all pydantic objects now, makes serializing much easier

TTurtles

Oh nice

TTurtles

Btw looks like you might've broken things for anyone importing from llama-index.data_structs

TTurtles

Oh I see what you did, it's all in schema.py now

TTurtles

Love the templates on text and metadata

TTurtles

@Logan M Oh and excluded_embed_metadata_keys and excluded_llm_metadata_keys solve a problem I was working on this morning

LLogan M

Thanks for noticing! I hope it's useful, because it was a ton of work to change these 😆

TTurtles

Yeah I had been overriding the Node and using that for indexing to keep all the necessary but search-irrelevant metadata from messing up the vectors:

Plain Text

class WalletNode(Node):
    index_extra_info_fields = ["user"]

    @property
    def extra_info_str(self) -> Optional[str]:
        """Extra info string."""
        if self.extra_info is None:
            return None

        return "\n".join([f"{k}: {str(v)}" for k, v in self.extra_info.items() if k in self.index_extra_info_fields])

TTurtles

@Logan M Hate to be the bearer of bad news...

Plain Text

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[11], line 70
     68 matching_data = await self._retrieve_matching(data)
     69 updated_data = await self._filter_to_updated(data, matching_data)
---> 70 await self._store(updated_data)

Cell In[11], line 216, in SlackIndexer._store(self, data)
    213 
    215 logging.debug("Adding documents")
--> 216 index = VectorStoreIndex.from_documents(
    217     nodes, storage_context=storage_context, service_context=service_context
    218 )

File ~/superwallet/env/lib/python3.11/site-packages/llama_index/indices/base.py:93, in BaseIndex.from_documents(cls, documents, storage_context, service_context, **kwargs)
     91 with service_context.callback_manager.as_trace("index_construction"):
     92     for doc in documents:
---> 93         docstore.set_document_hash(doc.get_doc_id(), doc.hash)
     94     nodes = service_context.node_parser.get_nodes_from_documents(documents)
     96     return cls(
     97         nodes=nodes,
     98         storage_context=storage_context,
     99         service_context=service_context,
    100         **kwargs,
    101     )

AttributeError: 'TextNode' object has no attribute 'get_doc_id'

LLogan M

Nodes are not documents 😉

Try this instead

Plain Text

index = VectorStoreIndex(nodes, storage_context=storage_context, service_context=service_context)

TTurtles

Hmm, now I'm seeing this. Looking into fixing things from my side, but figured you might want to know.

Here's the data from qdrant btw:

Plain Text

{
    "id": "00031ce4-d863-4da2-b842-132e59982433",
    "payload":
    {
        "user": "...",
        "wallet_id": "...",
        "time": "1677017733",
        "hash": -5698642479903243303,
        "_node_content": "{
            'id_': '00031ce4-d863-4da2-b842-132e59982433',
            'embedding': null,
            'metadata':
            {
                'user': '...',
                'wallet_id': '...',
                'time': '1677017733',
                'hash': -5698642479903243303
            },
            'excluded_embed_metadata_keys':
            [
                'hash',
                'wallet_id',
                'time'
            ],
            'excluded_llm_metadata_keys':
            [
                'hash',
                'wallet_id'
            ],
            'relationships':
            {
                '3': 'c9afb101-c784-56f4-abb0-cbb14940d795',
                '2': '909f84a0-163f-5da8-8055-c5dd955d22d3'
            },
            'hash': '4a2a50237ce0e424d21e7876675532043f7c32f6edb5d3aef013fa457b9444ae',
            'text': '...',
            'start_char_idx': null,
            'end_char_idx': null,
            'text_template': '{metadata_str}\\\\n\\\\n{content}',
            'metadata_template': '{key}: {value}',
            'metadata_seperator': '\\\\n'
        }",
        "document_id": "None",
        "doc_id": "None",
        "ref_doc_id": "None"
    },
    "vector": null
}

LLogan M

hmm I tested qdrant too 🤔 😅 Although it looks like it's hitting the legacy fallback

LLogan M

Did you create the qdrant index with the new version? or the previous version?

TTurtles

Recreated it from scratch with the new version

LLogan M

hmmm, give me a sec, will spin up my example locally

LLogan M

Oh also, while I get this running, how did you create the nodes?

LLogan M

Specifically the relationships too it looks like

LLogan M

hmm, yea locally its working for me

I load documents, pass them into from_documents, and the responses have all the relationships (using qdrant of course)

TTurtles

Had to construct the nodes manually because the data in the documents has to be linearized:

Plain Text

node = TextNode(text=elem["text"], doc_id=doc_id, metadata={"user": elem["sender"], "wallet_id": wallet_id, "hash": elem["hash"]}, excluded_llm_metadata_keys=["hash", "wallet_id"], excluded_embed_metadata_keys=["hash", "source", "wallet_id", "time"])

if next_node is not None:
    node.relationships[NodeRelationship.NEXT] = hash_to_uuid(next_elem["hash"])
if previous_node is not None:
    node.relationships[NodeRelationship.PREVIOUS] = hash_to_uuid(previous_elem["hash"])

nodes.append(node)

LLogan M

ah there it is

LLogan M

the relationships structure changed slightly

LLogan M

instead of pointing to a single string ID

LLogan M

you need a RelatedNodeInfo object

Plain Text

from llama_index.schema import RelatedNodeInfo

node.relationships[NodeRelationship.NEXT] = RelatedNodeInfo(node_id=hash_to_uuid(next_elem["hash"]))

You can also add some other info as well

Attachment

TTurtles

Ah gotcha

LLogan M

let me double check i added this to the docs lol

crap I didnt haha

will patch that

Thanks

https://github.com/jerryjliu/llama_index/pull/6606

LLogan M

🙂

LLogan M

Thanks for bringing this up by the way, better to catch the pain points early on haha

TTurtles

Sure, and while we're at it, looks like if I don't manually supply the doc_id during node creation I'll often have situations like this, where the id of the qdrant element is set but the doc_id isn't:

Plain Text

{
    "id": "e2076259-baa5-4867-a0d7-ee25e55d6254",
    "payload":
    {
        "user": "...",
        "wallet_id": "...",
        "time": "1683243354.121519",
        "_node_content":
        {
            "id_": "e2076259-baa5-4867-a0d7-ee25e55d6254",
            "embedding": null,
            "metadata":
            {
                "user": "...",
                "wallet_id": "...",
            },
            "excluded_embed_metadata_keys":
            [
                "wallet_id",
                "time",
            ],
            "excluded_llm_metadata_keys":
            [
                "wallet_id"
            ],
            "relationships":
            {
                "3":
                {
                    "node_id": "5fb09ad9-bb4d-4758-9b07-4b75cd93babd",
                    "node_type": null,
                    "metadata":
                    {},
                    "hash": null
                },
                "2":
                {
                    "node_id": "d897b202-13ec-4d9e-9a53-46fbd99a436c",
                    "node_type": null,
                    "metadata":
                    {},
                    "hash": null
                }
            },
            "hash": "eac7e0b8286fa77cd3ad89a790b2e7e1a9d1eff8f42c7a63c4a11697638b80b1",
            "text": "24. Omega: \\u03a9\\u03c9",
            "start_char_idx": null,
            "end_char_idx": null,
            "text_template": "{metadata_str}\\n\\n{content}",
            "metadata_template": "{key}: {value}",
            "metadata_seperator": "\\n"
        },
        "document_id": "None",
        "doc_id": "None",
        "ref_doc_id": "None"
    }
}

LLogan M

hmm, you mean the id's at the bottom right?

TTurtles

Yeah

LLogan M

Those actually come from node.ref_doc_id, so if you didn't set up the SOURCE relationship, they will be none

LLogan M

Attachment

TTurtles

And then I get errors like this when fetching adjacent nodes:

Plain Text

ValueError: doc_id e20a6cd9-e06d-4121-a9af-213851208748 not found.

LLogan M

These two things seem unrelated. Let me check that value error

LLogan M

Did you hit this error when querying the docstore? By default, the docstore is not used when using most vector store integrations

LLogan M

Yup, that node postprocessor is slightly incompatiable with a vector store integration, at least with default settings

LLogan M

since it's using the docstore directly

LLogan M

😦

LLogan M

You can remedy this slightly when creating the index. But using/managing the docstore and index store will get a little complicated

LLogan M

Since I'm assuming you are using a vector store to avoid writing/loading from disk

LLogan M

Which is fair, we do have a mongodb integration for the index store and docstore.... but like I said, its getting complicated 😅

TTurtles

I might just rewrite the postprocessor

TTurtles

Unless there's another way around it you can think of

TTurtles

Btw, the index_struct doesn't seem to be very large. What's it actually storing? I was thinking through different ways of keeping track of all the data until I looked at it and realized how small it is

LLogan M

The index struct is basically just keeping track of node_id's that it has access to, thats about it lol

LLogan M

the actual data/text is in the docstore.

Unless you use a vector store integration, then everything gets tossed in there and the other two arent used

LLogan M

The beauty of open source ❤️

TTurtles

Yeah this is why I fell in love with this industry. The more amazing things people build, the more there is for the whole world to use to make more amazing things

TTurtles

Is it possible to do a filter-only search with a vector_store? No embeddings. Would make it much easier for the postprocessor

LLogan M

Hmmm I'm not sure. Right now using the API's in llama-index, I don't think so. But maybe it's possible using a lower-level api (like the vector store client directly)

TTurtles

@Logan M I'm basically trying to download the relevant portions of the vector db to create a temporary local docstore to use for the requests, by fetching only by filter. I can't dodge around llama-index because they need to be retrieved as llama-index Node objects. Any ideas?

LLogan M

Implement a custom retriever? 😅

LLogan M

You can see here, you can actually customize both the retirever and response synthesizer seperately

https://gpt-index.readthedocs.io/en/latest/guides/primer/usage_pattern.html#low-level-api

LLogan M

Here's an example of a custom retriever too https://gpt-index.readthedocs.io/en/latest/examples/query_engine/CustomRetrievers.html#define-custom-retriever

TTurtles

Already doing that lol. The big missing piece is how to get Nodes as the output

LLogan M

When nodes are stored in the vector store, most vector stores will store the entire node as a serialized json (as I'm sure you've noticed)

You could use that JSON to re-create the nodes right?

LLogan M

We have this function under the hood for this

Attachment

TTurtles

Oh perfect that's the piece I was looking for

LLogan M

Curious to see what you build here btw

Most people don't realize that you can customize this much of llama-index, so we are planning to both advertise this lower level stuff more, as well as look for ways to make it nicer for developers 🙂

TTurtles

What are you guys thinking of doing with this longer-term

LLogan M

Just adjusting the "brand" positioning I suppose

For example, our current black-box pre-made pipeline are great for getting started quickly. But for 99% of applications, you are going to have to customize various things

So by promoting lower level features (hey here's an LLM, here's a prompt, here's a retriever, etc.) and showing how you can customize and combine these, hopefully more people get some use out of it and build cooler stuff with it. And at the same time, hopefully we make these pieces easy to use, customize, and stitch together

Plus, I think a lot of peoples workflows also start with playing around with low-level stuff first anyways

TTurtles

Hey @Logan M I've been overriding the get_metadata_str method on my nodes because of these 7 lines:

Plain Text

return self.metadata_seperator.join(
    [
        self.metadata_template.format(key=key, value=str(value))
        for key, value in self.metadata.items()
        if key in usable_metadata_keys
    ]
)

This forces all metadata elements to be rendered identically, which is often not ideal, as many of our most effective context templates look something like "user {x} recorded transaction {y} at time {z}". I don't need anything, but food for thought.

TTurtles

The only downside is it requires the overridden template string to have knowledge of the metadata inside the nodes

Add a reply