Find answers from the community

Updated 6 months ago

@Logan M Big Docs are continuing to

@Logan M Big Docs are continuing to plague me with issues. When I create a DocumentSummaryIndex, this line grabs the first node's metadata and that ends up exceeding pinecone limits. Shouldnt this also add the exclude llm/embed field lists? I did try to add that but that embed exclusion filter seems to happen somewhere else...
r
L
30 comments
one sec let me grab the line from the main repo
lol oh pinecone
thats the most silly limit
(Have you considered not using pinecone?)

But for real, the solution here is not including so much metadata in your documents/nodes I think (Just guessing, without seeing the full error or full context)
Ah, I see this line of code isn't inheriting the exlcude fields yea. But that won't solve the pinecone issue, because it will still get inserted into pinecone
will need to do something like this
Plain Text
            excluded_embed_metadata_keys = doc_id_to_nodes.get(doc_id, [TextNode()])[0].excluded_embed_metadata_keys
            excluded_llm_metadata_keys = doc_id_to_nodes.get(doc_id, [TextNode()])[0].excluded_llm_metadata_keys
            summary_node_dict[doc_id] = TextNode(
                text=summary_response.response,
                relationships={
                    NodeRelationship.SOURCE: RelatedNodeInfo(node_id=doc_id)
                },
                metadata=metadata,
                excluded_embed_metadata_keys = excluded_embed_metadata_keys,
                excluded_llm_metadata_keys = excluded_llm_metadata_keys
            )
and then this
Plain Text
                for k in node_with_embedding.excluded_embed_metadata_keys:
                    if node_with_embedding.metadata.get(k):
                        node_with_embedding.metadata.pop(k)
                summary_nodes_with_embedding.append(node_with_embedding)
ha. we may need to consider using something other than pinecone for thos
the other bit that goes against this is that the _node_content bit that duplicates everything in the metadata, essentially doubling things up
This could be solved by removing the metadata for node_content and adding it back back when querying. (Not every DB would support a flow like this, but pinecone might)
one thought along those lines is that we dont store any metadata and just the node reference as the metadata that gets uploaded to pinecone, and then when the query fetches the nodes in pinecone pull the full node in from docstorage?
essentially copy the node sans metadata
but yeah, at that point maybe it is just simpler to go with qdrant or pg_vector
yea that would be a nice option, but would require the user to have a second dedicated storage layer (the docstore). Maybe not a bad thing, but annoying lol
we are doing this on the backend, and already use redis and postgres, so doing the docstore isnt an issue
Where do you think that option can go? Kv store base?
Or pinecone specific
Basically, vector stores have this attribute "stores_text" -- if its true, outside classes know to shove all the node info into the vector store. If its false, outside classes assume there is some docstore that has the actual content, and the vector store only has the ID.

So, if this field was configurable for a vector store (say pinecone), and also implemented in the vector store to handle both cases, then it would work
k, I've seen that, so you are suggesting "stores_metadata" as something that would work similarly
would we still need to store the node id as the metadata, or could we look the node up in the docstore if we had the vector id
Well, stores_text controls the entire node_content field being inserted. So more than text, we can just rely on the docstore having all the node content
do want the PR that honors the original nodes embed key exclusions in the summary node?
that would be appreciated!
I am presuming that you'll want this change for more than just the document summary index?
not seeing any other index that uses TextNode
Add a reply
Sign up and join the conversation on Discord