Speed

At a glance

The post indicates that the default LLM predictor is ChatGPT 3.5. Community members discuss the performance of GPT-3.5, noting that it can be slow depending on the time of day. They suggest enabling streaming and other tweaks to improve performance. The community members also share a code snippet for data ingestion, indexing, and querying using LlamaIndex and LangChain. They discuss the use of GPT-4 and alternative models like MPNet and GPT4All to potentially improve performance. The community members also mention using GPU-accelerated models and tools like FAISS-GPU to enhance the performance of the text embedding and retrieval process.

Useful resources

hhfarooq

Default version, the LLM predictor is chatgpt 3.5

18 comments

LLogan M

What kind of index? Any other specific settings?

Gpt-3.5 can be very slow depending on the time of day tbh

LLogan M

You could try enabling streaming to make it feel faster at least (and there might be some other tweaks depending on other details)

hhfarooq

This is the code snippet:

Plain Text

`
def data_ingestion_indexing(directory_path):

    #constraint parameters
    max_input_size = 4096
    num_outputs = 512
    max_chunk_overlap = 20
    chunk_size_limit = 600

    #allows the user to explicitly set certain constraint parameters
    prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

    #LLMPredictor is a wrapper class around LangChain's LLMChain that allows easy integration into LlamaIndex
    llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.5, model_name="gpt-3.5-turbo", max_tokens=num_outputs))

    #loads data from the specified directory path
    documents = SimpleDirectoryReader(directory_path).load_data()

    #constructs service_context
    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)

    #when first building the index
    index = GPTVectorStoreIndex.from_documents(
        documents, service_context=service_context
    )

    #persist index to disk, default "storage" folder
    index.storage_context.persist()

    return index

def data_querying(input_text):

    #rebuild storage context
    storage_context = StorageContext.from_defaults(persist_dir="./storage")

    #loads index from storage
    index = load_index_from_storage(storage_context)
    
    #queries the index with the input text
    response = index.as_query_engine().query(input_text)
    
    return response.response

hhfarooq

Source : https://github.com/wenqiglantz/DevSecOpsKB-LlamaIndex-LangChain-OpenAI/blob/main/kb.py

hhfarooq

I was thinking if I could use example you published earlier for using GPU in which we setup our llm

LLogan M

It might help yea, because then at least you arent depending on openai servers (which can be under heavy load at times)

Definitely check out the gpu section here (you'll need at least 15GB of vram)

https://colab.research.google.com/drive/16QMQePkONNlDpgiltOi7oRQgmB8dU5fl?usp=sharing

hhfarooq

I really liked the results from the openai query results though , even though I'm not usually focused on openai , so you think it's just the bottleneck from open AI which causes the delay ?

Just curious, is the index query running biencoders for similarity search ? I'm guessing I can look up the code on GitHub

LLogan M

yea open source models have not caught up to openAI yet.

And it's 100% the LLM calls (openai, or any LLM), that will be the bottleneck

For the similarity search, it embeds the query text using text-ada-002 (fast/cheap), and then runs a cosine similarity search to get the top K nodes (top k is 2 by default)

LLogan M

There was a recent feature that will show the time of each step, if you are curious (in v0.6.11)

LLogan M

https://gpt-index.readthedocs.io/en/latest/examples/callbacks/LlamaDebugHandler.html

tthomoliverz

presumably 4 is way faster? i have been on the waitlist for ages!

hhfarooq

Plain Text

**********
Trace: query
    |_query ->  35.440498 seconds
      |_retrieve ->  0.140897 seconds
        |_embedding ->  0.125669 seconds
      |_synthesize ->  35.299303 seconds
        |_llm ->  35.281847 seconds
**********

hhfarooq

that is a lot of time - going to work with mpnet and GPT4all now - the above result for openai

LLogan M

Is gpt4all running on gpu?

LLogan M

(I'm glad my tracer works haha)

hhfarooq

Setting up as we speak - I am trying to see if I can make a basic function - text-embedding using mpnet then run retrieval with cosine similarity and push that result to gpt3.5/ decoder and test the time - 35 seconds is too long

LLogan M

Ah I see. That totally makes sense (and I agree, gpt-3.5 seems to be very slow lately)

hhfarooq

this code is a piece of art - I am going to use this to test performance with faiss-gpu , mpnet and local encoder model

Add a reply

Find answers from the community

Speed