LawSnap

Pdf index

Hi I have a collection of pdf files and I'm trying to create a Pineconeindex where each pdf file is a node in the Pineconeindex.

I have a loop that loads the pdfs in from the directory, but I must be missing something because I just cannot create the index using GPTPineconeIndex.

PDFReader setup

PDFReader = download_loader("PDFReader")
loader = PDFReader()

current_directory = Path(".")
pdf_files = current_directory.glob("*.pdf")
documents = []

Maximum length of each document and the overlap between chunks

max_length = 2048
overlap = 20

Loop through and load PDF files

for pdf_file in pdf_files:
pdf_content_list = loader.load_data(file=pdf_file)
for pdf_content in pdf_content_list:
if len(pdf_content) > max_length:
# Split the document into chunks
chunks = [pdf_content[i:i + max_length - overlap] for i in range(0, len(pdf_content), max_length - overlap)]
documents.extend(chunks)
else:
documents.append(pdf_content)

Build the GPTPineconeIndex with the PDF documents

index = GPTPineconeIndex(documents, pinecone_index=pinecone_index)

when I run this I get an error: Error initializing GPTPineconeIndex: Invalid document type: <class 'list'>.

I've also tried
index = GPTPineconeIndex.from_documents(documents, pinecone_index=pinecone_index)

and I get the same error. Any help appreciated. I'm a noob so probably missing something obvious.

14 comments

LLawSnap

how do I load a collection of pdf files from a file and then turn them into a pineconeinde

how do I load a collection of pdf files from a file and then turn them into a pineconeindex

2 comments

LLawSnap

If I've already created an index using GTPineconeIndex and saved to pinecone -- how do I l

@kapa.ai If I've already created an index using GTPineconeIndex and saved to pinecone -- how do I load it in next time so I don't have to recreate the index each time?

2 comments

LLawSnap

I have constructed an index using GPTPineconeIndex. I have also queried it for response an

@kapa.ai I have constructed an index using GPTPineconeIndex. I have also queried it for response and received a response. How do I get the sources from the response?

5 comments

LLawSnap

how to create Pinecone Index from a set of pdfs using GPTPineconeIndex

@kapa.ai how to create Pinecone Index from a set of pdfs using GPTPineconeIndex

2 comments

LLawSnap

Large index

Noob question about working with large (230 mb) index? Hi, I used GPTSimpleVectorIndex to create an index of several hundred pdf files. When I try to query the index, I get answers but they are kind of inconsistent -- makes me think(?) that query is only grabbing some information each time but then running out of memory?

Right now I'm using the index.query as described in the "getting started" tutorial. Very new at this and so would appreciate pointers?

I reviewed recent videos on youtube that suggested

using pinecone to increase memory size (but does that improve processing?)
video that suggested progressive summarization (query applies to one chunk at time, then all answers are concatenated together and then fed back into GPT, which provides summary of the concatenation of all the answers)
video that suggests using langchain.

Any pointers appreciated. Thanks!

6 comments

Find answers from the community

Pdf index

PDFReader setup

Maximum length of each document and the overlap between chunks

Loop through and load PDF files

Build the GPTPineconeIndex with the PDF documents

how do I load a collection of pdf files from a file and then turn them into a pineconeinde

If I've already created an index using GTPineconeIndex and saved to pinecone -- how do I l

I have constructed an index using GPTPineconeIndex. I have also queried it for response an

how to create Pinecone Index from a set of pdfs using GPTPineconeIndex

Large index