Find answers from the community

Updated 3 months ago

Is it normal for extraction from

Is it normal for extraction from documents to take +20 minutes?
I'm trying to build some embeddings from Postgresql internals documentation

Plain Text
from llama_index import (SimpleDirectoryReader, StorageContext,
                         VectorStoreIndex, load_index_from_storage)
from llama_index.query_engine import RetrieverQueryEngine

base_path = os.path.dirname(os.path.realpath(__file__))
persist_dir = os.path.join(base_path, "../index")
postgres_doc_dir = os.path.join(base_path, "../postgres-documents")

index_exists = os.path.exists(persist_dir)
if index_exists:
    storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
    index = load_index_from_storage(storage_context)
else:
    document_loader = SimpleDirectoryReader(input_files=[
        os.path.join(postgres_doc_dir, "postgresql_internals-14_en.pdf"),
        os.path.join(postgres_doc_dir, "postgresql-16-US.pdf")
    ])
    documents = document_loader.load_data()
    index = VectorStoreIndex.from_documents(documents)
    index.storage_context.persist(persist_dir)

query = input("Enter your query: ")
query_engine = index.as_query_engine()
response = query_engine.query(query)
print(response)


It's been running for ages and still not gotten to the enter your query part
I can hear my fans blowing though, lol
L
C
8 comments
that sounds... abnormal haha

are the pdfs huge? If you did print(len(documents)), how many documents are there?

If there is a ton, you can increase the batch size
Plain Text
from llama_index import ServiceContext
from llama_index.embeddings.openai import OpenAIEmbedding

# default is 10
embed_model = OpenAIEmbedding(embed_batch_size=50)

service_context = ServiceContext.from_defaults(embed_model=embed_model)

index = VectorStoreIndex.from_documents(documents, service_context=service_context)
Ah okey ty -- I will try this. The two PDF's aren't huge, a combined total of like ~21MB:
Attachment
image.png
Well... 21MB seems like possibly a lot of text πŸ˜… but that could just be pdf bloat too
Wut the hecc:
Plain Text
Loaded 3749 documents
What is considered a "document"? :thinking:
The default pdf loader splits each page into a "Document" object

Then when you insert, each document is broken into chunks/nodes of 1024 tokens
Add a reply
Sign up and join the conversation on Discord