Find answers from the community

Home
Members
LawSnap
L
LawSnap
Offline, last seen 3 months ago
Joined September 25, 2024
L
LawSnap
·

Pdf index

Hi I have a collection of pdf files and I'm trying to create a Pineconeindex where each pdf file is a node in the Pineconeindex.

I have a loop that loads the pdfs in from the directory, but I must be missing something because I just cannot create the index using GPTPineconeIndex.

PDFReader setup

PDFReader = download_loader("PDFReader")
loader = PDFReader()

current_directory = Path(".")
pdf_files = current_directory.glob("*.pdf")
documents = []

Maximum length of each document and the overlap between chunks

max_length = 2048
overlap = 20

Loop through and load PDF files

for pdf_file in pdf_files:
pdf_content_list = loader.load_data(file=pdf_file)
for pdf_content in pdf_content_list:
if len(pdf_content) > max_length:
# Split the document into chunks
chunks = [pdf_content[i:i + max_length - overlap] for i in range(0, len(pdf_content), max_length - overlap)]
documents.extend(chunks)
else:
documents.append(pdf_content)

Build the GPTPineconeIndex with the PDF documents

index = GPTPineconeIndex(documents, pinecone_index=pinecone_index)

when I run this I get an error: Error initializing GPTPineconeIndex: Invalid document type: <class 'list'>.

I've also tried
index = GPTPineconeIndex.from_documents(documents, pinecone_index=pinecone_index)

and I get the same error. Any help appreciated. I'm a noob so probably missing something obvious.
14 comments
L
p
L
how do I load a collection of pdf files from a file and then turn them into a pineconeindex
2 comments
k
@kapa.ai If I've already created an index using GTPineconeIndex and saved to pinecone -- how do I load it in next time so I don't have to recreate the index each time?
2 comments
k
@kapa.ai I have constructed an index using GPTPineconeIndex. I have also queried it for response and received a response. How do I get the sources from the response?
5 comments
k
L
@kapa.ai how to create Pinecone Index from a set of pdfs using GPTPineconeIndex
2 comments
k
L
LawSnap
·

Large index

Noob question about working with large (230 mb) index? Hi, I used GPTSimpleVectorIndex to create an index of several hundred pdf files. When I try to query the index, I get answers but they are kind of inconsistent -- makes me think(?) that query is only grabbing some information each time but then running out of memory?

Right now I'm using the index.query as described in the "getting started" tutorial. Very new at this and so would appreciate pointers?

I reviewed recent videos on youtube that suggested
  1. using pinecone to increase memory size (but does that improve processing?)
  2. video that suggested progressive summarization (query applies to one chunk at time, then all answers are concatenated together and then fed back into GPT, which provides summary of the concatenation of all the answers)
  3. video that suggests using langchain.
Any pointers appreciated. Thanks!
6 comments
L
L