LlamaIndex

Log inLog into community

Find answers from the community

Updated 8 months ago

@Logan M Big Docs are continuing to

@Logan M Big Docs are continuing to

At a glance

·

@Logan M Big Docs are continuing to plague me with issues. When I create a DocumentSummaryIndex, this line grabs the first node's metadata and that ends up exceeding pinecone limits. Shouldnt this also add the exclude llm/embed field lists? I did try to add that but that embed exclusion filter seems to happen somewhere else...

r

L

30 comments

one sec let me grab the line from the main repo

https://github.com/run-llama/llama_index/blob/722cb67ca4e52c8c4d6ef8c5e99b7f6c9f57e244/llama-index-core/llama_index/core/indices/document_summary/base.py#L203

lol oh pinecone

thats the most silly limit

(Have you considered not using pinecone?)

But for real, the solution here is not including so much metadata in your documents/nodes I think (Just guessing, without seeing the full error or full context)

Ah, I see this line of code isn't inheriting the exlcude fields yea. But that won't solve the pinecone issue, because it will still get inserted into pinecone

will need to do something like this

Plain Text

            excluded_embed_metadata_keys = doc_id_to_nodes.get(doc_id, [TextNode()])[0].excluded_embed_metadata_keys
            excluded_llm_metadata_keys = doc_id_to_nodes.get(doc_id, [TextNode()])[0].excluded_llm_metadata_keys
            summary_node_dict[doc_id] = TextNode(
                text=summary_response.response,
                relationships={
                    NodeRelationship.SOURCE: RelatedNodeInfo(node_id=doc_id)
                },
                metadata=metadata,
                excluded_embed_metadata_keys = excluded_embed_metadata_keys,
                excluded_llm_metadata_keys = excluded_llm_metadata_keys
            )

and then this

Plain Text

                for k in node_with_embedding.excluded_embed_metadata_keys:
                    if node_with_embedding.metadata.get(k):
                        node_with_embedding.metadata.pop(k)
                summary_nodes_with_embedding.append(node_with_embedding)

ha. we may need to consider using something other than pinecone for thos

the other bit that goes against this is that the _node_content bit that duplicates everything in the metadata, essentially doubling things up

This could be solved by removing the metadata for node_content and adding it back back when querying. (Not every DB would support a flow like this, but pinecone might)

one thought along those lines is that we dont store any metadata and just the node reference as the metadata that gets uploaded to pinecone, and then when the query fetches the nodes in pinecone pull the full node in from docstorage?

essentially copy the node sans metadata

but yeah, at that point maybe it is just simpler to go with qdrant or pg_vector

yea that would be a nice option, but would require the user to have a second dedicated storage layer (the docstore). Maybe not a bad thing, but annoying lol

we are doing this on the backend, and already use redis and postgres, so doing the docstore isnt an issue

Where do you think that option can go? Kv store base?

Or pinecone specific

Basically, vector stores have this attribute "stores_text" -- if its true, outside classes know to shove all the node info into the vector store. If its false, outside classes assume there is some docstore that has the actual content, and the vector store only has the ID.

So, if this field was configurable for a vector store (say pinecone), and also implemented in the vector store to handle both cases, then it would work

k, I've seen that, so you are suggesting "stores_metadata" as something that would work similarly

would we still need to store the node id as the metadata, or could we look the node up in the docstore if we had the vector id

Well, stores_text controls the entire node_content field being inserted. So more than text, we can just rely on the docstore having all the node content

but yea

do want the PR that honors the original nodes embed key exclusions in the summary node?

that would be appreciated!

I am presuming that you'll want this change for more than just the document summary index?

https://github.com/run-llama/llama_index/pull/14911

not seeing any other index that uses TextNode

Add a reply

Sign up and join the conversation on Discord