Find answers from the community

Updated 3 months ago

Hi all I am trying to run a query from

Hi all, I am trying to run a query from examples and it is taking more than 30 mins. I am using 4 A10 GPUs to load model

Plain Text
max_input_size = 1024
num_output = 64
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)
llm_predictor = LLMPredictor(llm=CustomLLM())
embed_model = LangchainEmbedding(HuggingFaceEmbeddings())
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper, embed_model=embed_model)
documents = SimpleDirectoryReader('/home/ubuntu/llama_index/examples/paul_graham_essay/data').load_data()
parser = SimpleNodeParser()
nodes = parser.get_nodes_from_documents(documents)
index = GPTSimpleVectorIndex(nodes, service_context=service_context)
response = index.query("What did the author do growing up?", service_context=service_context, optimizer=SentenceEmbeddingOptimizer(percentile_cutoff=0.5))
L
j
24 comments
Did you define the LLM inside the CustomLLM class or outside? I always put the LLM model itself as a global, putting it inside the CutsomLLM class causes pydantic to do weird stuff
Thanks, i used CustomLLM. Can you share me example
Also can we limit he no of llm calls made in the query
not limit on the llm calls. all depends on your input size and chunk size limit

I have an example, give me one sec
This is a pretty specific example, you can ignore the prompt template thing and call the model however it works for vicuna

Plain Text
# define prompt helper
# set maximum input size
max_input_size = 2048
# set number of output tokens
num_output = 256
# set maximum chunk overlap
max_chunk_overlap = 20

model_name = "Writer/camel-5b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16
)
PROMPT_TEMPLATE = (
  "Below is an instruction that describes a task. "
  "Write a response that appropriately completes the request.\n\n"
  "### Instruction:\n{instruction}\n\n### Response:"
)

class CustomLLM(LLM):
  model_name = "Writer/camel-5b-hf"

  def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
    prompt = prompt.strip()
    text = PROMPT_TEMPLATE.format(instruction=prompt)
    model_inputs = tokenizer(text, return_tensors="pt", max_length=max_input_size).to("cuda")
    output_ids = model.generate(**model_inputs, max_new_tokens=num_output) #, temperature=0, do_sample=True)
    output_text = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
    clean_output = output_text.split("### Response:")[1].strip()
    return clean_output

  @property
  def _identifying_params(self) -> Mapping[str, Any]:
    return {"name_of_model": self.model_name}

  @property
  def _llm_type(self) -> str:
    return "custom"

llm_predictor = LLMPredictor(llm=CustomLLM())
embed_model = LangchainEmbedding(HuggingFaceEmbeddings())

prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap, chunk_size_limit=512)
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, embed_model=embed_model, prompt_helper=prompt_helper, chunk_size_limit=512)
Thanks. @Logan M Will try with this
@Logan M I made the changes and now i am running into this error

ValueError: Got a larger chunk overlap (20) than chunk size (-139), should be smaller.
Ha classic. What prompt helper/service context settings do you have now?
Plain Text
# set maximum input size
max_input_size = 2048
# set number of output tokens
num_output = 256
# set maximum chunk overlap
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap, chunk_size_limit=512)
Plain Text
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper, embed_model=embed_model, chunk_size_limit=512)
Hmm.. just for fun, try removing chunk size limit altogether, let the library try to figure it out lol

The math gets a little complicated sometimes when these parameters are tweaked
Or maybe num_output needs to be a little smaller... it might take some fiddling around
@Logan M okay. i started the query and it is arount 8 mins and is still running. does this example query takes this much time ?
Will reducing num_outputs make it faster ?
Hmm I feel like something weird is happening. Maybe check if the GPU is actually being used?
Definitely not normal if it's on gpu
the query is still running an all my gpu are being used. checked with nvidia-smi
Spooky.

In the customLLM class, in the call function,
maybe print how long the text is that's going into it

print(len(prompt.split(' ')))

Definitely shouldn't take that long lol
Yea that sounds about right. Probably close to 1700 tokens.... tbh idk man, debugging this stuff is hard, even harder remotely haha
Try printing the prompt and then using that prompt with the model outside of llama index maybe?
Remove llama index from the equation, might make debugging easier to narrow down the issue
Thanks will try that
Add a reply
Sign up and join the conversation on Discord