LlamaIndex

Log inLog into community

Find answers from the community

Updated last year

LlamaIndex_RAG_Memory/example_UI_app.py ...

LlamaIndex_RAG_Memory/example_UI_app.py ...

At a glance

·

I am trying to connect multiple clients to the same LLM model, so the model gets loaded only once and can handle multiple clients.
I am using this script and it loads the LLM twice and it bugs with more than 1 connection https://github.com/Josh-ee/LlamaIndex_RAG_Memory/blob/main/example_UI_app.py
I had it working where it only loads once, but it gave me segmentation fault.
I believe I have to put the LLM into a new async function, but I dont know how to handle it if that is actually the solution.
Question: how to make the script load 1 model and make it possible for multiple clients to have their own chat that doesnt get influenced by other users?

W

B

L

44 comments

You'll need to create chat_engine instance based on each user. That way every chat will be unique to that particular user.

Thank you for your reply, I believe this script is doing that + more, it also creates a new LLM instance etc
I don't know how to change the script to only create a new chat_engine.
I get async errors and I don't know how to handle async correctly.
I couldn't find an example I understand better that has this solution with PDF chat_engine

Not sure if this will work or not but I think you can try putting the index part in the chainlit session and put a check like

Plain Text

if not cl.user_session.get("index"):
  # do the instantiation of llm and set the index in user_session
  cl.user_session.set('index', index)
else:
  index = cl.user_session.get('index')

This should stop loading the index, ultimately llm

should i put this at the start of

Plain Text

@cl.on_chat_start
async def factory():

to block off the function if there is already a chat started for the current user?
but i believe that wont stop the model from being loaded again if a second client opens a connection?

i believe i have to remove some functions out of above function, so it only gets run once at init, not for every chat start/user connection

@WhiteFang_Jr but that is where i dont know how to handle that

if i put the LLM in its own function i get errors like these

Plain Text

line 139, in <module>
    callback_manager=CallbackManager([cl.LlamaIndexCallbackHandler()]),
  File "/usr/local/lib/python3.10/dist-packages/chainlit/llama_index/callbacks.py", line 31, in __init__
    self.context = context_var.get()
LookupError: <ContextVar name='chainlit' at 0x7fd24f880630>

This part is more related to Chainlit. But let's give it a try. I guess we'll only learn even if we fail. lol 😅

Plain Text

@cl.on_chat_start
async def factory():
    global QA_TEMPLATE, MEM_PROMPT
    # Detect hardware acceleration device
    if torch.cuda.is_available():
        device = 'cuda'
        gpu_layers = 50
    elif torch.backends.mps.is_available():  # Assuming MPS backend exists
        device = 'mps'
        gpu_layers = 1
    else:
        device = 'cpu'
        gpu_layers = 0

    print(f'Using device: {device}')
    if not cl.user_session.get("index"):
      # do the instantiation of llm and set the index in user_session
      cl.user_session.set('index', index)
    else:
      index = cl.user_session.get('index')
    
    # Do the query engine part from here
    # percentile_cutoff: a measure for using the top percentage of relevant sentences.
    query_engine = index.as_query_engine(streaming=True, similarity_top_k = 2, text_qa_template=QA_TEMPLATE,

If this cl.user_session remains active until the server is active and not get's created with every new connection then this part should work.

This error when I put it there:

Plain Text

line 91, in factory
    cl.user_session.set('index', index)
UnboundLocalError: local variable 'index' referenced before assignment

When I put it after index it loads the model for every connection

You'll need to create the index first

and it gives gibberish in that case

Attachment

Yeah when I put that code snippet after creating index, it loads the LLM model twice

So maybe take the index function in its seperate function?

Can you share the code when you got this error

It's this code
https://github.com/Josh-ee/LlamaIndex_RAG_Memory/blob/main/example_UI_app.py

No i mean with the changes mentioned here

Plain Text

@cl.on_chat_start
async def factory():
    global QA_TEMPLATE, MEM_PROMPT, CHAT_HISTORY
    # Detect hardware acceleration device
    if torch.cuda.is_available():
        device = 'cuda'
        gpu_layers = 50
    elif torch.backends.mps.is_available():  # Assuming MPS backend exists
        device = 'mps'
        gpu_layers = 1
    else:
        device = 'cpu'
        gpu_layers = 0

    print(f'Using device: {device}')

    if not cl.user_session.get("index"):
        # do the instantiation of llm and set the index in user_session
        cl.user_session.set('index', index)
    else:
        index = cl.user_session.get('index')

    embed_model_name = 'BAAI/bge-small-en-v1.5'
    # Create an instance of HuggingFace
    embed_model = HuggingFaceEmbedding(
        model_name=embed_model_name,
        device = device,
        normalize='True'
        )
    # load from disk
    path = 'RAG_VectorDB'
    db = chromadb.PersistentClient(path=path)

    chroma_collection = db.get_collection('arxiv_PDF_DB')

    print(chroma_collection.metadata)
    if embed_model_name != chroma_collection.metadata['embedding_used']:
        raise Warning('Not using the same embedding model!')

    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

Like you said

Oh wait you said to do the query_engine part earlier?

But the index part is lower in the function

Plain Text

@cl.on_chat_start
async def factory():

So it doesn't know what index is here

Plain Text

cl.user_session.set('index', index)

I have updated the code part, try this once

You may have to crrect the indentation here, as I typed it directly here

This part

Plain Text

    # percentile_cutoff: a measure for using the top percentage of relevant sentences.
    query_engine = index.as_query_engine(streaming=True, similarity_top_k = 2, text_qa_template=QA_TEMPLATE,
    node_postprocessors=[SentenceEmbeddingOptimizer(percentile_cutoff=0.5, embed_model=embed_model)]
    )
    
    CHAT_HISTORY = []

    chat_engine = CondenseQuestionChatEngine.from_defaults(
        query_engine=query_engine,
        embed_model=embed_model,
        service_context = service_context,
        condense_question_prompt=MEM_PROMPT,
        chat_history=CHAT_HISTORY,
        verbose=False,
    )

    print('Model Loaded')
    cl.user_session.set('chat_engine', chat_engine)

needs to be outside of if/else case

Thanks for your help, it still loads the LLM twice.
I am sure it is because

Plain Text

@cl.on_chat_start

gets triggered on every new connection to the client.

Plain Text

 if not cl.user_session.get("index"):

Only looks at the session, but the LLM should be created on init, not based on session

Yeah, I think you'll be better to ask this part of the query in chainlit discord as this part is more related to their code

Thank you for your help, I do understand, they said to use it before the @cl.on_chat_start

you can load the model once at the beginning of your app file (before the decorators)
example here https://github.com/Chainlit/cookbook/blob/main/local-llm/llama-cpp.py

But I get errors when I change this code to load the LLM before, I had it working, but then instead of 2 models being loaded I get the error "segmentation fault"
I believe that has to do with async functions, but I am not sure

You can't use the same LLM concurrently -- requests have to be processed sequentially by each instance of a model

If you want to have multiple chats, you can manage the chat history for each user and control from there, although I'm not sure how chainlit exposes this, I've never used it

Okay I was "afraid" of this, but it's also a good thing I guess, I read I could batch requests to speed them up, I was also thinking about using vLLM.
Also my idea was to make an FastAPI on the same docker with the model and send requests from chainlit or another UI.
I have been looking for such an API example with PDF chat, I was thinking about the current query engine, it's pretty okay, but I have also been thinking about agents for more business related RAG.
Do you have a nice example for business related RAG API agent/query for this use case?

I wouldn't mind starting from scratch since this is breaking my head for 2 days now, I like to continue with LlamaIndex since I like what I have seen from your side 🙂

Yea I would run a vLLM server and use that to process requests.

Have you seen create-llama? It sets can setup a fastapi backend with a nextjs frontend

npx create-llama@latest
https://blog.llamaindex.ai/create-llama-a-command-line-tool-to-generate-llamaindex-apps-8f7683021191#:~:text=Introducing%20create-llama%20%2C%20the%20easiest,deploying%20your%20create-llama%20apps!&text=If%20playback%20doesn't%20begin%20shortly%2C%20try%20restarting%20your%20device

From there, you could modify the LLM to point to your vLLM server and off you go

just an easy example to start from though

That is looking good, I am going to try that out, thank you!

I had issues setting up create-llama
In the meanwhile I got vLLM working with chainlit
Now I try to implement vLLM in my query engine, I believe you can just load the vLLM in the ServiceContext and the rest should work like before?
So I got

Plain Text

model_path = 'mistralai/Mistral-7B-v0.1'
    from llama_index.llms.vllm import Vllm
    llm = Vllm(model_path)

Which I then load the same way in the ServiceContext:

Plain Text

service_context = ServiceContext.from_defaults(
        embed_model=embed_model,
        llm=llm,
        # callback manager show progress in UI
        callback_manager=CallbackManager([cl.LlamaIndexCallbackHandler()]),
    )

Error I am getting:

Plain Text

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/websockets/websockets_impl.py", line 247, in run_asgi
    result = await self.app(self.scope, self.asgi_receive, self.asgi_send)
  
  ...

  File "/usr/local/lib/python3.10/dist-packages/engineio/async_drivers/asgi.py", line 247, in send
    await self.asgi_send({'type': 'websocket.send',
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in sender
    await send(message)
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/websockets/websockets_impl.py", line 320, in asgi_send
    await self.send(data)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/protocol.py", line 635, in send
    await self.ensure_open()
  File "/usr/local/lib/python3.10/dist-packages/websockets/legacy/protocol.py", line 948, in ensure_open
    raise self.connection_closed_exc()
websockets.exceptions.ConnectionClosedOK: received 1005 (no status received [internal]); then sent 1005 (no status received [internal])
First question
2023-12-15 17:17:10 - Not Implemented

A second error behind it:

Plain Text

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/chainlit/utils.py", line 39, in wrapper
    return await user_function(**params_values)
  File "/home/Josh-ee_Llama_RAG/test-gpu.py", line 165, in main
    response = await cl.make_async(chat_engine._query_engine.query)(question)
  File "/usr/local/lib/python3.10/dist-packages/asyncer/_main.py", line 358, in wrapper
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  
  ...  

  File "/usr/local/lib/python3.10/dist-packages/llama_index/query_engine/retriever_query_engine.py", line 171, in _query
    response = self._response_synthesizer.synthesize(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/response_synthesizers/base.py", line 146, in synthesize
    response_str = self.get_response(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/response_synthesizers/compact_and_refine.py", line 38, in get_response
    return super().get_response(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/response_synthesizers/refine.py", line 127, in get_response
    response = self._give_response_single(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/response_synthesizers/refine.py", line 196, in _give_response_single
    response = self._service_context.llm_predictor.stream(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llm_predictor/base.py", line 251, in stream
    stream_response = self._llm.stream_complete(formatted_prompt)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/base.py", line 313, in wrapped_llm_predict
    f_return_val = f(_self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm.py", line 256, in stream_complete
    raise (ValueError("Not Implemented"))

In case someone wants the vLLM + chainlit script:

Plain Text

@cl.cache
def cachellm():
    model_path = 'mistralai/Mistral-7B-v0.1'
    from llama_index.llms.vllm import Vllm
    llm = Vllm(model_path)
    return llm

@cl.on_chat_start
async def factory():
    global llm
    llm = cachellm()

@cl.on_message
async def main(message: cl.Message):
    question = message.content
    output = llm.complete(question)
    response_message = cl.Message(content=output[0].text)
    await response_message.send()

Rip the vllm implementation is missing streaming. Should probably add that 😅

would be lovely, i dont know if vLLM has it themselves
anyway I am now trying to follow the LlamaIndex page on vLLM https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/service_context.html

Plain Text

question = "What is the paper about?"
    query_engine = index.as_query_engine(service_context=service_context)
    response = query_engine.query(question)

Error:

Plain Text

2023-12-15 18:15:29 - 'list' object has no attribute 'text'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/chainlit/utils.py", line 39, in wrapper
    return await user_function(**params_values)
  File "/home/Josh-ee_Llama_RAG/vllm.py", line 102, in factory
    response = query_engine.query(question)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/base_query_engine.py", line 30, in query
    return self._query(str_or_query_bundle)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/query_engine/retriever_query_engine.py", line 171, in _query
    response = self._response_synthesizer.synthesize(

...

  File "/usr/local/lib/python3.10/dist-packages/llama_index/response_synthesizers/refine.py", line 182, in _give_response_single
    program(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/response_synthesizers/refine.py", line 53, in __call__
    answer = self._llm_predictor.predict(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llm_predictor/base.py", line 225, in predict
    output = response.text
AttributeError: 'list' object has no attribute 'text'

Followed by the first error in my previous post

I believe for vLLM the following line should be changed

Plain Text

File "/usr/local/lib/python3.10/dist-packages/llama_index/llm_predictor/base.py", line 225, in predict
    output = response.text

to

Plain Text

output = response[0].text

Hmm but why is the llm_predictor returning a list

It might be an issue in the vLLM integration code itself? 🤔

@Logan M I know from my example above that the Vllm import from llama_index.llms.vllm does return a list
i also checked it on github

Attachment

but in the end i would like to use the CondenseQuestionChatEngine for now, so i would prefer that stream_complete ^^

also i was using llama-index version 0.9.13, i upgraded to 0.9.15, so this is the new error:

Plain Text

2023-12-15 21:13:51 - 'list' object has no attribute 'text'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/chainlit/utils.py", line 39, in wrapper
    return await user_function(**params_values)
  File "/home/Josh-ee_Llama_RAG/vllm-gpu.py", line 111, in factory
    response = query_engine.query(question)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/base_query_engine.py", line 30, in query
    return self._query(str_or_query_bundle)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/query_engine/retriever_query_engine.py", line 171, in _query
    response = self._response_synthesizer.synthesize(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/response_synthesizers/base.py", line 146, in synthesize
    response_str = self.get_response(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/response_synthesizers/compact_and_refine.py", line 38, in get_response
    return super().get_response(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/response_synthesizers/refine.py", line 146, in get_response
    response = self._give_response_single(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/response_synthesizers/refine.py", line 202, in _give_response_single
    program(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/response_synthesizers/refine.py", line 64, in __call__
    answer = self._llm.predict(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/llm.py", line 221, in predict
    output = response.text
AttributeError: 'list' object has no attribute 'text'

Add a reply

Sign up and join the conversation on Discord