Chem

Is it normal for extraction from

Is it normal for extraction from documents to take +20 minutes?
I'm trying to build some embeddings from Postgresql internals documentation

Plain Text

from llama_index import (SimpleDirectoryReader, StorageContext,
                         VectorStoreIndex, load_index_from_storage)
from llama_index.query_engine import RetrieverQueryEngine

base_path = os.path.dirname(os.path.realpath(__file__))
persist_dir = os.path.join(base_path, "../index")
postgres_doc_dir = os.path.join(base_path, "../postgres-documents")

index_exists = os.path.exists(persist_dir)
if index_exists:
    storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
    index = load_index_from_storage(storage_context)
else:
    document_loader = SimpleDirectoryReader(input_files=[
        os.path.join(postgres_doc_dir, "postgresql_internals-14_en.pdf"),
        os.path.join(postgres_doc_dir, "postgresql-16-US.pdf")
    ])
    documents = document_loader.load_data()
    index = VectorStoreIndex.from_documents(documents)
    index.storage_context.persist(persist_dir)

query = input("Enter your query: ")
query_engine = index.as_query_engine()
response = query_engine.query(query)
print(response)

It's been running for ages and still not gotten to the enter your query part
I can hear my fans blowing though, lol

8 comments

CChem

Memory

So, I took the Postgres repo, dumped all of the source code and documentation files into an index.
It came out to +1.5GB, and my WSL2 box goes OOM sometimes when trying to read from it 😬

Plain Text

]$ ls -lh ./index
total 1.7G
-rw-r--r-- 1 user user 121M Jul  2 19:06 docstore.json
-rw-r--r-- 1 user user   18 Jul  2 19:09 graph_store.json
-rw-r--r-- 1 user user 3.8M Jul  2 19:06 index_store.json
-rw-r--r-- 1 user user 1.6G Jul  2 19:09 vector_store.json

How much memory should a machine have for trying to call load_index_from_storage() on this?

5 comments

CChem

LLM

I thought that llama-index was essentially a way to add extra information on top of ChatGPT or other models
So there's no way that it could perform worse than not having the extra info :THONK:

5 comments

Find answers from the community

Is it normal for extraction from

Memory

LLM