Using llama hub dataloaders with llama-index

At a glance

The post suggests that the document objects from the llama hub loaders might work in llama index. In the comments, a community member tries to use the langchain.document_loaders to load documents, but encounters an issue with the get_doc_id() attribute. Another community member notes that the issue is due to using the langchain loaders instead of the llama hub loaders. A solution is provided using the llama_index.download_loader and UnstructuredReader to load documents in the llama index format, but it's unclear if there's a script or function to convert between the two formats.

LLogan M

Pretty sure the document objects from the llama hub loaders will work in llama index actually 🤔

5 comments

bbradcohn

It looks like they don't generate certain attributes like the get_doc_id. Maybe I'm doing someting wrong, here's my code snippet.

Plain Text

from langchain.document_loaders import UnstructuredFileLoader, BSHTMLLoader, UnstructuredMarkdownLoader

documents = []

for filename in os.listdir(data_directory):
    file_path = os.path.join(data_directory, filename)
    if filename.endswith(".md"):
        loader = UnstructuredMarkdownLoader(file_path)
    elif filename.endswith(".html"):
        loader = BSHTMLLoader(file_path)
    elif filename.endswith('.txt'):
        loader = UnstructuredFileLoader(file_path)
    documents.extend(loader.load())
    
    
index = GPTWeaviateIndex.from_documents(documents, weaviate_client=client)

returns the error:

Plain Text

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[108], line 1
----> 1 index = GPTWeaviateIndex.from_documents(documents, weaviate_client=client)

File ~/projects/GPeaT/backend/venv/lib/python3.8/site-packages/llama_index/indices/base.py:101, in BaseGPTIndex.from_documents(cls, documents, docstore, service_context, **kwargs)
     98 docstore = docstore or get_default_docstore()
    100 for doc in documents:
--> 101     docstore.set_document_hash(doc.get_doc_id(), doc.get_doc_hash())
    103 nodes = service_context.node_parser.get_nodes_from_documents(documents)
    105 return cls(
    106     nodes=nodes,
    107     docstore=docstore,
    108     service_context=service_context,
    109     **kwargs,
    110 )

AttributeError: 'Document' object has no attribute 'get_doc_id'

bbradcohn

Using llama hub dataloaders with llama-index

LLogan M

Ohhh you are using the loader from langchain, not llama hub

LLogan M

This will get the documents in llama index format. Not sure if there's a script or function to convert between the two yet

Plain Text

from pathlib import Path
from llama_index import download_loader

UnstructuredReader = download_loader("UnstructuredReader")

loader = UnstructuredReader()
documents = loader.load_data(file=Path('./10k_filing.html'))

bbradcohn

Ahhh my bad okay

Add a reply

Find answers from the community

Using llama hub dataloaders with llama-index