is it possible to pull out the nodes

At a glance

is it possible to pull out the nodes from a chroma vector store that were created when we use either .from_documents() or .refresh_ref_docs() for use in another type of Index? I want to save on computing embeddings and just insert the nodes+embeddings from first indexstore into another. So for example, when creating a basic SummaryIndex or a TreeIndex. Is it better to compute embeddings manually and store them separately ahead of time then make various indices? cheers!

9 comments

ttheta

one thing I'm not sure is when we use a method like .refresh_ref_docs() and we've used the SentenceWindowNodeParser, what part of each TextNode is embedded? is it just the "text" attribute? or is it the text and the metadata not filtered by "exclude_llm_metadata"? I can see how to extract embeddings, metadata and documents from Chroma I'm just not entirely clear what my next steps would be? Are people storing their documents embeddings in a separate collection? THanks!!

LLogan M

the content that is embeded is node.get_content(metadata_mode="embed")

I'm not sure on the chroma question though, would probably have to read the chroma docs

ttheta

Thanks @Logan M do you compute and store node embeddings separately when you want to index documents in more than one index or do you just eat the cost and re-embed for every index? I have 3500 documents and it comes out to about 4.5 million OpenAI tokens. so I'm curious what Llama Index's perspective is on indexing documents across multiple indices. Thanks kindly for your thoughts.

LLogan M

4.5 million tokens is $0.45 for embeddings 👀 $0.0001 / 1k tokens

I would usually just recompute the embeddings. If you really wanted, you could compute the embeddings manually and attach them to each node, and then spread your nodes into indexes as needed

ttheta

Sorry, one follow-up question re: nodes[0].get_content(metadata_mode='embed') when I inspect the nodes created from my node_parser with that code, I see all the metadata keys listed out despite setting excluded_embed_metadata_keys for example, my 'excluded_embed_metadata_keys': ['id', 'tags', "
"'keywords', 'cve_fixes', 'cve_mentions', 'added_to_summary_index', "
"'added_to_vector_store']. I set those keys on the Documents I'm parsing and the nodes have those keys specified. Am I doing something wrong (ie., get_content(metadata_mode="embed") should respect excluded_embed_metadata_keys or am I misunderstanding when excluded_embed_metadata_keys gets applied?

LLogan M

oh whoops, the enum is auto

LLogan M

Plain Text

>>> from llama_index import Document
>>> from llama_index.schema import MetadataMode
>>> Doucment(text='test', metadata={'1': '2'}, excluded_embed_metadata_keys=['1'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'Doucment' is not defined. Did you mean: 'Document'?
>>> doc = Document(text='test', metadata={'1': '2'}, excluded_embed_metadata_keys=['1'])
>>> doc.get_content(metadata_mode=MetadataMode.EMBED)
'test'
>>> doc.get_content(metadata_mode="embed")
'1: 2\n\ntest'
>>> doc.get_content(metadata_mode='2')
'test'
>>>

LLogan M

rip ☠️

LLogan M

gotta use .get_content(metadata_mode=MetadataMode.EMBED)

Add a reply

Find answers from the community

is it possible to pull out the nodes