Find answers from the community

Updated 3 months ago

Speed

Default version, the LLM predictor is chatgpt 3.5
L
h
t
18 comments
What kind of index? Any other specific settings?

Gpt-3.5 can be very slow depending on the time of day tbh
You could try enabling streaming to make it feel faster at least (and there might be some other tweaks depending on other details)
This is the code snippet:

Plain Text
`
def data_ingestion_indexing(directory_path):

    #constraint parameters
    max_input_size = 4096
    num_outputs = 512
    max_chunk_overlap = 20
    chunk_size_limit = 600

    #allows the user to explicitly set certain constraint parameters
    prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

    #LLMPredictor is a wrapper class around LangChain's LLMChain that allows easy integration into LlamaIndex
    llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.5, model_name="gpt-3.5-turbo", max_tokens=num_outputs))

    #loads data from the specified directory path
    documents = SimpleDirectoryReader(directory_path).load_data()

    #constructs service_context
    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)

    #when first building the index
    index = GPTVectorStoreIndex.from_documents(
        documents, service_context=service_context
    )

    #persist index to disk, default "storage" folder
    index.storage_context.persist()

    return index

def data_querying(input_text):

    #rebuild storage context
    storage_context = StorageContext.from_defaults(persist_dir="./storage")

    #loads index from storage
    index = load_index_from_storage(storage_context)
    
    #queries the index with the input text
    response = index.as_query_engine().query(input_text)
    
    return response.response
I was thinking if I could use example you published earlier for using GPU in which we setup our llm
It might help yea, because then at least you arent depending on openai servers (which can be under heavy load at times)

Definitely check out the gpu section here (you'll need at least 15GB of vram)

https://colab.research.google.com/drive/16QMQePkONNlDpgiltOi7oRQgmB8dU5fl?usp=sharing
I really liked the results from the openai query results though , even though I'm not usually focused on openai , so you think it's just the bottleneck from open AI which causes the delay ?

Just curious, is the index query running biencoders for similarity search ? I'm guessing I can look up the code on GitHub
yea open source models have not caught up to openAI yet.

And it's 100% the LLM calls (openai, or any LLM), that will be the bottleneck

For the similarity search, it embeds the query text using text-ada-002 (fast/cheap), and then runs a cosine similarity search to get the top K nodes (top k is 2 by default)
There was a recent feature that will show the time of each step, if you are curious (in v0.6.11)
presumably 4 is way faster? i have been on the waitlist for ages!
Plain Text
**********
Trace: query
    |_query ->  35.440498 seconds
      |_retrieve ->  0.140897 seconds
        |_embedding ->  0.125669 seconds
      |_synthesize ->  35.299303 seconds
        |_llm ->  35.281847 seconds
**********
that is a lot of time - going to work with mpnet and GPT4all now - the above result for openai
Is gpt4all running on gpu?
(I'm glad my tracer works haha)
Setting up as we speak - I am trying to see if I can make a basic function - text-embedding using mpnet then run retrieval with cosine similarity and push that result to gpt3.5/ decoder and test the time - 35 seconds is too long
Ah I see. That totally makes sense (and I agree, gpt-3.5 seems to be very slow lately)
this code is a piece of art - I am going to use this to test performance with faiss-gpu , mpnet and local encoder model
Add a reply
Sign up and join the conversation on Discord