Find answers from the community

Updated 3 months ago

Deeplake

I am running into an issue creating documents out of a an array of articles.

I am attempting to load in a set of docs and then create LI documents out of them, create embeddings and then add then to a DeepLake dataset via
Plain Text
DeepLakeVectorStore
. It seems that
Plain Text
    vector_store = DeepLakeVectorStore(dataset_path=dataset_path, ingestion_batch_size=1024).add(nodes)
expects a node structure different structure then what i currently have.

The error:
Plain Text
AttributeError: 'Node' object has no attribute 'node'


code:
Plain Text
    if not medium_input or medium_input == '':
            print("The string is empty.")
    else:
            print("The string is not empty.")
            print(medium_input)
            
            publication = medium.publication(publication_slug=medium_input)
            
            medium_articles = medium.publication(publication_id=str(publication._id)).get_articles_between(_from=datetime.now(),_to=datetime.now() - timedelta(days=70))

            docs = []
            texts = ''
            # print("medium_articles bool", medium_articles[0].content)
            for article in medium_articles:
                document = article.content
                document = Document(article.content)
                new_dict = {key: article.info[key] for key in ['url', 'published_at', 'title']}
                document.extra_info = new_dict
                docs.append(document)


    parser = SimpleNodeParser()

    nodes = parser.get_nodes_from_documents(docs)

    print('nodes',nodes)


    dataset_path = f"hub://tali/{deeplake_datasets}"

    vector_store = DeepLakeVectorStore(dataset_path=dataset_path, ingestion_batch_size=1024).add(nodes)
L
A
14 comments
Yea it looks like the add() function expects the NodeWithEmbedding object

I think a better approach here is creating your index and using the insert_nodes() function
@Logan M as far as i can see
Plain Text
index.insert_nodes(nodes)
is associated with an index. How does this apply to getting data in deeplake?
Or how can i create
Plain Text
NodeWithEmbedding
?
A vector store is used by an index. It holds the emeddings for the text (and sometimes the text itself too, but deeplake doesn't store the text in this case)

Calling add() on the vector store directly, it expects the nodes to already have embeddings generated. Using an index automates this process.

Let me get an example.

If you want to not use an index, you'll need to generate your own embeddings before inserting the nodes
Im okay with creating an index. So i create an index based on my new data then what? How can i go from index to NodesWithEmbedding?
Plain Text
nodes = parser.get_nodes_from_documents(docs)

dataset_path = f"hub://tali/{deeplake_datasets}"

vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)  # this can also be false

from llama_index import StorageContext, GPTVectorStoreIndex

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = GPTVectorStoreIndex(nodes, storage_context=storage_context)

# query the index
query_engine = index.as_query_engine()
response = query_engine.query("hello world")

# save the index, since deeplake doesn't store text
index.storage_context.persist(persist_dir="./my_index")

# load the index later
from llama_index import load_index_from_storage
storage_context = StorageContext.from_defaults(vector_store=vector_store, persist_dir="./my_index")
index = load_index_from_storage(storage_context)
index = GPTVectorStoreIndex(nodes, storage_context=storage_context)

This line will insert all your nodes
if you have more loads to insert later, you can use index.insert_nodes(nodes)
Let me try this out.
@Logan M Looks like this works if i set overwrite to false, it just adds to the current dataset. Its kinda kinda hacky and not obvious in my opinion and I hope in the future add() is modified to work with LI docs with embeddings.

Also FYI the text did make it into the db.

Thank you so much for the help! πŸ™
I think if you follow the docs, it's a pretty common pattern to index in this way.

All the vector stores that llama index works with follow a very similar pattern πŸ’ͺ
Glad it worked!πŸ’ͺ
Oh yea, creating the index is not the issue here. Rather what I was pointing out is adding data to deeplake to an existing dataset should be more streamlined vs creating and index and toggling overwrite=false to add.
Ahh I see I see, makes sense!
Add a reply
Sign up and join the conversation on Discord