Deeplake

At a glance

I am running into an issue creating documents out of a an array of articles.

I am attempting to load in a set of docs and then create LI documents out of them, create embeddings and then add then to a DeepLake dataset via

Plain Text

DeepLakeVectorStore

. It seems that

Plain Text

    vector_store = DeepLakeVectorStore(dataset_path=dataset_path, ingestion_batch_size=1024).add(nodes)

expects a node structure different structure then what i currently have.

The error:

Plain Text

AttributeError: 'Node' object has no attribute 'node'

code:

Plain Text

    if not medium_input or medium_input == '':
            print("The string is empty.")
    else:
            print("The string is not empty.")
            print(medium_input)
            
            publication = medium.publication(publication_slug=medium_input)
            
            medium_articles = medium.publication(publication_id=str(publication._id)).get_articles_between(_from=datetime.now(),_to=datetime.now() - timedelta(days=70))

            docs = []
            texts = ''
            # print("medium_articles bool", medium_articles[0].content)
            for article in medium_articles:
                document = article.content
                document = Document(article.content)
                new_dict = {key: article.info[key] for key in ['url', 'published_at', 'title']}
                document.extra_info = new_dict
                docs.append(document)


    parser = SimpleNodeParser()

    nodes = parser.get_nodes_from_documents(docs)

    print('nodes',nodes)


    dataset_path = f"hub://tali/{deeplake_datasets}"

    vector_store = DeepLakeVectorStore(dataset_path=dataset_path, ingestion_batch_size=1024).add(nodes)

14 comments

LLogan M

Yea it looks like the add() function expects the NodeWithEmbedding object

I think a better approach here is creating your index and using the insert_nodes() function

AAli | Tali AI

@Logan M as far as i can see

Plain Text

index.insert_nodes(nodes)

is associated with an index. How does this apply to getting data in deeplake?

AAli | Tali AI

Or how can i create

Plain Text

NodeWithEmbedding

LLogan M

A vector store is used by an index. It holds the emeddings for the text (and sometimes the text itself too, but deeplake doesn't store the text in this case)

Calling add() on the vector store directly, it expects the nodes to already have embeddings generated. Using an index automates this process.

Let me get an example.

If you want to not use an index, you'll need to generate your own embeddings before inserting the nodes

AAli | Tali AI

Im okay with creating an index. So i create an index based on my new data then what? How can i go from index to NodesWithEmbedding?

LLogan M

Plain Text

nodes = parser.get_nodes_from_documents(docs)

dataset_path = f"hub://tali/{deeplake_datasets}"

vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)  # this can also be false

from llama_index import StorageContext, GPTVectorStoreIndex

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = GPTVectorStoreIndex(nodes, storage_context=storage_context)

# query the index
query_engine = index.as_query_engine()
response = query_engine.query("hello world")

# save the index, since deeplake doesn't store text
index.storage_context.persist(persist_dir="./my_index")

# load the index later
from llama_index import load_index_from_storage
storage_context = StorageContext.from_defaults(vector_store=vector_store, persist_dir="./my_index")
index = load_index_from_storage(storage_context)

LLogan M

index = GPTVectorStoreIndex(nodes, storage_context=storage_context)

This line will insert all your nodes

LLogan M

if you have more loads to insert later, you can use index.insert_nodes(nodes)

AAli | Tali AI

Let me try this out.

AAli | Tali AI

@Logan M Looks like this works if i set overwrite to false, it just adds to the current dataset. Its kinda kinda hacky and not obvious in my opinion and I hope in the future add() is modified to work with LI docs with embeddings.

Also FYI the text did make it into the db.

Thank you so much for the help! 🙏

LLogan M

I think if you follow the docs, it's a pretty common pattern to index in this way.

All the vector stores that llama index works with follow a very similar pattern 💪

LLogan M

Glad it worked!💪

AAli | Tali AI

Oh yea, creating the index is not the issue here. Rather what I was pointing out is adding data to deeplake to an existing dataset should be more streamlined vs creating and index and toggling overwrite=false to add.

LLogan M

Ahh I see I see, makes sense!

Add a reply

Find answers from the community

Deeplake