jhthompson12

Hello, I am trying to figure out if it's

Hello, I am trying to figure out if it's possible to run the embeddings model on my GPU rather than the CPU. I have this simple script where VectorStoreIndex.from_documents(documents) is taking a long time to finish while maxing out my CPU.

Plain Text

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext, set_global_service_context
 from llama_index.llms import OpenAILike 

llm = OpenAILike(max_tokens=3900)  

service_context = ServiceContext.from_defaults(llm=llm, embed_model="local:BAAI/bge-small-en-v1.5", chunk_size=256, num_output=256) 
set_global_service_context(service_context)

documents = SimpleDirectoryReader('data2').load_data()

index = VectorStoreIndex.from_documents(documents) index.storage_context.persist(persist_dir="./vector-storage-esic2")

It seems like one of the following is true:

I've not configured something properly (in Llama-Index?) which would push the embeddings to the GPU
This is just how Llama-Index works and can only use the CPU for embeddings

any wisdom is greatly appreciated!

8 comments

jjhthompson12

Hello, I've had a lot of success using

Hello, I've had a lot of success using Llama-Index in a RAG context, but now im trying to reuse some of my RAG code to build a simple non-RAG tool to send questions directly to a locally running instance of the new CodeLlama-Instruct model.

So, I do not actually have any underlying index that I want to pull context from, but im trying to reuse the some of my code that uses index.as_query_engine but it seems that Im running into issues with an empty index.

I feel like there's a better way to do this, but im a bit stuck. Here's a snippet of my code right now:

Plain Text

index = VectorStoreIndex([]) 

@app.server.route("/code-llama/streaming-chat", methods=["POST"])
def streaming_chat():
  user_prompt = request.json["prompt"]
  user_question = request.json["question"]

  # Create a system message
  user_prompt = ChatMessage(role=MessageRole.USER, content=user_prompt)
  text_qa_template = ChatPromptTemplate(message_templates=[user_prompt])
    
  query_engine = index.as_query_engine(streaming=True, text_qa_template=text_qa_template)
    
  def response_stream():
    yield from (line for line in query_engine.query(user_question).response_gen)

  return Response(response_stream(), mimetype="text/response-stream")

At query_engine.query(user_question).response_gen I am getting a AttributeError: 'Response' object has no attribute 'response_gen'

3 comments

jjhthompson12

Hello, im getting a `'LLMPredictor'

Hello, im getting a 'LLMPredictor' object has no attribute '_llm' error when attempting to perform RAG inference on an index within a demo web app im building with Plotly Dash.

23 comments

Find answers from the community

Hello, I am trying to figure out if it's

Hello, I've had a lot of success using

Hello, im getting a `'LLMPredictor'