Pdf index

LLawSnap

Hi I have a collection of pdf files and I'm trying to create a Pineconeindex where each pdf file is a node in the Pineconeindex.

I have a loop that loads the pdfs in from the directory, but I must be missing something because I just cannot create the index using GPTPineconeIndex.

PDFReader setup

PDFReader = download_loader("PDFReader")
loader = PDFReader()

current_directory = Path(".")
pdf_files = current_directory.glob("*.pdf")
documents = []

Maximum length of each document and the overlap between chunks

max_length = 2048
overlap = 20

Loop through and load PDF files

for pdf_file in pdf_files:
pdf_content_list = loader.load_data(file=pdf_file)
for pdf_content in pdf_content_list:
if len(pdf_content) > max_length:
# Split the document into chunks
chunks = [pdf_content[i:i + max_length - overlap] for i in range(0, len(pdf_content), max_length - overlap)]
documents.extend(chunks)
else:
documents.append(pdf_content)

Build the GPTPineconeIndex with the PDF documents

index = GPTPineconeIndex(documents, pinecone_index=pinecone_index)

when I run this I get an error: Error initializing GPTPineconeIndex: Invalid document type: <class 'list'>.

I've also tried
index = GPTPineconeIndex.from_documents(documents, pinecone_index=pinecone_index)

and I get the same error. Any help appreciated. I'm a noob so probably missing something obvious.

14 comments

LLogan M

You need to use GPTPineconeIndex.from_documents() instead 💪

LLogan M

Also, llama index can split the documents for you internally 👍 check out the prompt helper and chunk_size_limit in the docs, let me know if you need a hand 👍

LLawSnap

Thanks so much for pointers.

I think I'm missing something because when I try to use from_documents() I get an error.

My pdfs are stored in a list called "documents" I've verified that document[1] contains the correct text and document[2] contains the correct text (I didn't check the rest of them)

But when I try
index = GPTPineconeIndex.from_documents(documents, pinecone_index=pinecone_index)

I get this error: AttributeError: type object 'GPTPineconeIndex' has no attribute 'from_documents'

And when I do print(dir(GPTPineconeIndex))

['annotations', 'class', 'class_getitem', 'delattr', 'dict', 'dir', 'doc', 'eq', 'format', 'ge', 'getattribute', 'getstate', 'gt', 'hash', 'init', 'init_subclass', 'le', 'lt', 'module', 'ne', 'new', 'orig_bases', 'parameters', 'reduce', 'reduce_ex', 'repr', 'setattr', 'sizeof', 'slots', 'str', 'subclasshook', 'weakref', '_add_document_to_index', '_aget_node_embedding_results', '_async_add_document_to_index', '_build_fallback_text_splitter', '_build_index_from_documents', '_delete', '_get_node_embedding_results', '_get_nodes_from_document', '_insert', '_is_protocol', '_preprocess_query', '_process_documents', '_update_index_registry_and_docstore', '_validate_documents', 'aquery', 'build_index_from_documents', 'delete', 'docstore', 'embed_model', 'get_doc_id', 'get_query_map', 'index_registry', 'index_struct', 'index_struct_cls', 'index_struct_with_text', 'insert', 'llm_predictor', 'load_from_dict', 'load_from_disk', 'load_from_string', 'prompt_helper', 'query', 'refresh', 'save_to_dict', 'save_to_disk', 'save_to_string', 'set_doc_id', 'set_extra_info', 'set_text', 'update']

LLogan M

hmmm Imma double check the docs, one sec

LLogan M

Are you using the latest llama index version? From documents should exist

From looking at your code, I would try something like this assuming you have the latest version installed

Plain Text

from llama_index import GPTPineconeIndex, ServiceContext, SimpleDirectoryReader

pinecone_index = ...

documents = SimpleDirectoryReader("./path/to/my_pdfs_dir").load_data()

# this will cut documents into chunks of 2048 tokens. The default overlap is 20 tokens already
service_context = ServiceContext.from_defaults(chunk_size_limit=2048)

index = GPTPineconeIndex.from_documents(documents, service_context=service_context, pinecone_index=pinecone_index)

LLawSnap

ok, thanks logan, I'll give that a try.! much obliged

LLawSnap

Logan btw of course you were right that I did not have latest version installed. TIL that you have to update the packages, and they don't update automatically!!! 🙂 Appreciate all your patience and your help. I'm simultaneously learning zsh + python + llamaindex! Having said that, super excited to see the progress. I actually managed to get a response it got it right!