Find answers from the community

Updated 2 years ago

General question on time to index Im

At a glance
General question on time to index. Im trying to index the same documents with different indexes (GPTSimpleVector, GPTKnowledgeGraphIndex, GPTSimpleKeywordTableIndex).

Vector and Keyword were pretty fast, but KnowledgeGraph is so slow that I get the impression that its stuck, running over 10 minutes! I tried knowlede-graph indexing with both: with and without embeddings, but it doesnt matter!
L
L
13 comments
Here the code:

embed_model = LangchainEmbedding(OpenAIEmbeddings(query_model_name="text-embedding-ada-002"))

chunk_len = 256
chunk_overlap = 32

splitter = TokenTextSplitter(chunk_size=chunk_len, chunk_overlap=chunk_overlap)
node_parser = SimpleNodeParser(text_splitter=splitter, include_extra_info=True, include_prev_next_rel=False)

llm_predictor_gpt3 = LLMPredictor(llm=ChatOpenAI(temperature=0.2, model_name='gpt-3.5-turbo', max_tokens=2000))

prompt_helper_gpt3 = PromptHelper.from_llm_predictor(llm_predictor=llm_predictor_gpt3)

service_context_gpt3 = ServiceContext.from_defaults(llm_predictor=llm_predictor_gpt3, prompt_helper=prompt_helper_gpt3, embed_model=embed_model, node_parser=node_parser, chunk_size_limit=chunk_len)

Vector:
reader = JSONReader()

documents = SimpleDirectoryReader('/content/drive/Shareddrives/AI/docs').load_data()
index_conf_vec = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context_gpt3)

Keyword
reader = JSONReader()

documents = SimpleDirectoryReader('/content/drive/Shareddrives/AI/docs').load_data()
index_conf_kw = GPTSimpleKeywordTableIndex.from_documents(documents, service_context=service_context_gpt3)

Knowledge Graph
reader = JSONReader()
documents = SimpleDirectoryReader('/content/drive/Shareddrives/AI/docs').load_data()
index_conf_kg = GPTKnowledgeGraphIndex.from_documents(documents, max_triplets_per_chunk=3, service_context=service_context_gpt3)
index_conf_kg_embedded = GPTKnowledgeGraphIndex.from_documents(documents, max_triplets_per_chunk=3, service_context=service_context_gpt3, include_embeddings=True)
Your chunk size is pretty small. Maybe try decreasing the triplets per chunk to 1, or increasing the chunk size?

Each chunk is a call to the LLM, so if you have a large index, it can take some time (especially when openai might already be slow)
makes sense. Is increasing the triples per chunk also increasing the time?
a little bit, but mostly it's the number of chunks
can you briefly explain what the difference is between GPTKnowledgeGraphIndex.from_documents(documents)
&
GPTKnowledgeGraphIndex.from_documents(documents, include_embeddings=True) ?
One will generate embeddings for each triplet, the other just extracts triplets themselves. And extracting triplets is done with LLM calls, which can be slow-ish.

Then at query time, it extracts keywords from the query and uses those keywords to find triplets that overlap with the query keywords. If you use embeddings, it will also return triplets that have similar embeddings

If include_text=True is in your query call, it will use the text where those triplets were found to generate an answer. If it's false, then it will use only the triplets themselves to generate an answer
so without include_text=True it wont use the text those triples were generated from? sounds like it defeats my purpose πŸ™‚
so its slow anyways πŸ˜„
the doc that took tree, vector and KW around 1min to index is not done with KG after 30m still running
Yea, because all those other indexes don't make as many LLM calls as the KG lol I wonder if openAI is throttling the requests too
would you recommend KG for really small docs only? still running 1h35m first doc, i have 3 and do each with and without embedding, i guess ill cancel that lol
I've really only experimented with the paul graham essay example + the nyc wikipedia page lol

In my opinion, I think it's more useful when you have a dedicated model that extracted the triplets for you or you have an existing ontology. Then you don't have to rely on the LLM to extract triplets (slow and uses tokens), and you can insert the triplets directly
Gotcha, skipping KG for now πŸ™‚ have enough to playground-around with
Add a reply
Sign up and join the conversation on Discord