Find answers from the community

Updated 5 months ago

i 've been trying to read pdfs, docx,

At a glance
The community member is trying to read PDF and DOCX files from a directory using the llama_index library, but is encountering a ValidationError. The community members suggest trying a fresh virtual environment and using the UnstructuredReader from the llama_index.readers.file module to handle the PDF files. However, they encounter an ImportError due to a missing unstructured dependency. The community members discuss how the paid version of the llama_index service may be more capable than the open-source version. The solution proposed is to convert the unstructured data into Document objects that can be processed and indexed by llama_index. The community members also discuss how to preserve the chapter and subchapter structure of the documents in the metadata.
Useful resources
i 've been trying to read pdfs, docx, directories but am still getting this error:
Plain Text
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("/Users/home/Library/Mobile Documents/com~apple~CloudDocs/Academia/Legal Research").load_data()
index = VectorStoreIndex.from_documents(documents)

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
/Users/home/Downloads/OpenAI_Finetuning_Distill_GPT_4_to_GPT_3_5_(v2).ipynb Cell 13 line 1
----> 1 from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
      3 documents = SimpleDirectoryReader("/Users/home/Library/Mobile Documents/com~apple~CloudDocs/Academia/Legal Research").load_data()
      4 index = VectorStoreIndex.from_documents(documents)

File ~/.ooba/text-generation-ui/installer_files/env/lib/python3.10/site-packages/llama_index/__init__.py:21
     17 from llama_index.embeddings import OpenAIEmbedding
     19 # indices
     20 # loading
---> 21 from llama_index.indices import (
     22     ComposableGraph,
     23     DocumentSummaryIndex,
     24     GPTDocumentSummaryIndex,
     25     GPTKeywordTableIndex,
     26     GPTKnowledgeGraphIndex,
     27     GPTListIndex,
     28     GPTRAKEKeywordTableIndex,
     29     GPTSimpleKeywordTableIndex,
     30     GPTTreeIndex,
     31     GPTVectorStoreIndex,
     32     KeywordTableIndex,
     33     KnowledgeGraphIndex,
     34     ListIndex,
...
File ~/.ooba/text-generation-ui/installer_files/env/lib/python3.10/site-packages/pydantic/main.py:341, in pydantic.main.BaseModel.__init__()

ValidationError: 1 validation error for DataSource

i 've installed all required packages i think with
Plain Text
pip install llama-index-core llama-index-readers-file llama-index-llms-ollama llama-index-embeddings-huggingface
L
M
19 comments
this is a scannable pdf, unstructured should handle it well, does "SimpleDirectory" include unstructured?
not by default, but you can use

Plain Text
from llama_index.readers.file import UnstructuredReader

file_extractor = {".pdf": UnstructuredReader()}

documents = SimpleDirectoryReader("./data", file_extractor=file_extractor).load_data()
Plain Text
from llama_index.readers.file import UnstructuredReader

file_extractor = {".pdf": UnstructuredReader()}

documents = SimpleDirectoryReader("./data", file_extractor=file_extractor).load_data()

Plain Text
ImportError                               Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/llama_index/core/readers/file/base.py in load_file(input_file, file_metadata, file_extractor, filename_as_id, encoding, errors)
    324                 # ensure that ImportError is raised so user knows
    325                 # about missing dependencies
--> 326                 raise ImportError(str(e))
    327             except Exception as e:
    328                 # otherwise, just skip the file and report the error

ImportError: No module named 'unstructured'

@Logan M do i need to install it separately?
hmmm I think so (I thought it was in the package deps, I guess not)
apparently , the unstructured that is in the llama package cannot decypher this file, but it worked well when i used their pay as you go Saas api
the question now how to convert it to the format that llama uses to create an index
I'm guessing their paid offering is more capable than open-source 😅
the output of the reader should be document objects right?
VectorStoreIndex.from_documents(documents)
Plain Text
{'type': 'ListItem',
  'element_id': '685c346992da8cb638277234e18455dc',
  'text': '3. Analysis and application of the rule of law to the facts of the case. This step is composed of three parts:',
  'metadata': {'filetype': 'application/pdf',
   'languages': ['eng'],
   'page_number': 48,
   'parent_id': '1f5eb66d4519bd762414d483640a7cd1',
   'filename': 'Legal Research_Part 1.pdf'}},
 {'type': 'NarrativeText',
  'element_id': '269a00dfa7fbab67e141c6c5600e8440',
  'text': 'a. A determination of the elements or requirements of the rule of law b. A matching of the facts of the client’s case to the elements and a determi- nation of how the rule of law applies to the facts ¢. A counteranalysis that addresses any counterarguments to the analysis 4. A conclusion that summarizes the previous steps. The conclusion may also include a weighing of the merits of the case and an identification of other information or avenues of research that should be pursued.',
  'metadata': {'filetype': 'application/pdf',
   'languages': ['eng'],
   'page_number': 48,
... 
 50,
   'parent_id': '8aebadf99302a64184ebba5341df7d89',
   'filename': 'Legal Research_Part 1.pdf'}}] 
@Logan M how do i turn it into the format that could be processed and indexed by llama ?
Oh you are using raw unstructured, and not the unstructured reader?

Turn it into a document object
from llama_index.core import Document

doc = Document(text=text, metadata=metadata)
thank you. will i miss then the chapters and subchapters partitions then?
Not if you do that for every partition?
Or you can attach the chapter/subchatper info in the metadata, up to you
thank you so much!
Add a reply
Sign up and join the conversation on Discord