General question on time to index Im

At a glance

General question on time to index. Im trying to index the same documents with different indexes (GPTSimpleVector, GPTKnowledgeGraphIndex, GPTSimpleKeywordTableIndex).

Vector and Keyword were pretty fast, but KnowledgeGraph is so slow that I get the impression that its stuck, running over 10 minutes! I tried knowlede-graph indexing with both: with and without embeddings, but it doesnt matter!

13 comments

LLau Fla

Here the code:

embed_model = LangchainEmbedding(OpenAIEmbeddings(query_model_name="text-embedding-ada-002"))

chunk_len = 256
chunk_overlap = 32

splitter = TokenTextSplitter(chunk_size=chunk_len, chunk_overlap=chunk_overlap)
node_parser = SimpleNodeParser(text_splitter=splitter, include_extra_info=True, include_prev_next_rel=False)

llm_predictor_gpt3 = LLMPredictor(llm=ChatOpenAI(temperature=0.2, model_name='gpt-3.5-turbo', max_tokens=2000))

prompt_helper_gpt3 = PromptHelper.from_llm_predictor(llm_predictor=llm_predictor_gpt3)

service_context_gpt3 = ServiceContext.from_defaults(llm_predictor=llm_predictor_gpt3, prompt_helper=prompt_helper_gpt3, embed_model=embed_model, node_parser=node_parser, chunk_size_limit=chunk_len)

Vector:
reader = JSONReader()

documents = SimpleDirectoryReader('/content/drive/Shareddrives/AI/docs').load_data()
index_conf_vec = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context_gpt3)

Keyword
reader = JSONReader()

documents = SimpleDirectoryReader('/content/drive/Shareddrives/AI/docs').load_data()
index_conf_kw = GPTSimpleKeywordTableIndex.from_documents(documents, service_context=service_context_gpt3)

Knowledge Graph
reader = JSONReader()
documents = SimpleDirectoryReader('/content/drive/Shareddrives/AI/docs').load_data()
index_conf_kg = GPTKnowledgeGraphIndex.from_documents(documents, max_triplets_per_chunk=3, service_context=service_context_gpt3)
index_conf_kg_embedded = GPTKnowledgeGraphIndex.from_documents(documents, max_triplets_per_chunk=3, service_context=service_context_gpt3, include_embeddings=True)

LLogan M

Your chunk size is pretty small. Maybe try decreasing the triplets per chunk to 1, or increasing the chunk size?

Each chunk is a call to the LLM, so if you have a large index, it can take some time (especially when openai might already be slow)

LLau Fla

makes sense. Is increasing the triples per chunk also increasing the time?

LLogan M

a little bit, but mostly it's the number of chunks

LLau Fla

can you briefly explain what the difference is between GPTKnowledgeGraphIndex.from_documents(documents)
&
GPTKnowledgeGraphIndex.from_documents(documents, include_embeddings=True) ?

LLogan M

One will generate embeddings for each triplet, the other just extracts triplets themselves. And extracting triplets is done with LLM calls, which can be slow-ish.

Then at query time, it extracts keywords from the query and uses those keywords to find triplets that overlap with the query keywords. If you use embeddings, it will also return triplets that have similar embeddings

If include_text=True is in your query call, it will use the text where those triplets were found to generate an answer. If it's false, then it will use only the triplets themselves to generate an answer

LLau Fla

so without include_text=True it wont use the text those triples were generated from? sounds like it defeats my purpose 🙂

LLau Fla

so its slow anyways 😄

LLau Fla

the doc that took tree, vector and KW around 1min to index is not done with KG after 30m still running

LLogan M

Yea, because all those other indexes don't make as many LLM calls as the KG lol I wonder if openAI is throttling the requests too

LLau Fla

would you recommend KG for really small docs only? still running 1h35m first doc, i have 3 and do each with and without embedding, i guess ill cancel that lol

LLogan M

I've really only experimented with the paul graham essay example + the nyc wikipedia page lol

In my opinion, I think it's more useful when you have a dedicated model that extracted the triplets for you or you have an existing ontology. Then you don't have to rely on the LLM to extract triplets (slow and uses tokens), and you can insert the triplets directly

LLau Fla

Gotcha, skipping KG for now 🙂 have enough to playground-around with

Add a reply

Find answers from the community

General question on time to index Im