Hi all I am trying to run a query from

jjagadeeshj

Hi all, I am trying to run a query from examples and it is taking more than 30 mins. I am using 4 A10 GPUs to load model

Plain Text

max_input_size = 1024
num_output = 64
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)
llm_predictor = LLMPredictor(llm=CustomLLM())
embed_model = LangchainEmbedding(HuggingFaceEmbeddings())
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper, embed_model=embed_model)
documents = SimpleDirectoryReader('/home/ubuntu/llama_index/examples/paul_graham_essay/data').load_data()
parser = SimpleNodeParser()
nodes = parser.get_nodes_from_documents(documents)
index = GPTSimpleVectorIndex(nodes, service_context=service_context)
response = index.query("What did the author do growing up?", service_context=service_context, optimizer=SentenceEmbeddingOptimizer(percentile_cutoff=0.5))

24 comments

LLogan M

Did you define the LLM inside the CustomLLM class or outside? I always put the LLM model itself as a global, putting it inside the CutsomLLM class causes pydantic to do weird stuff

jjagadeeshj

Thanks, i used CustomLLM. Can you share me example

jjagadeeshj

Also can we limit he no of llm calls made in the query

LLogan M

not limit on the llm calls. all depends on your input size and chunk size limit

I have an example, give me one sec

LLogan M

This is a pretty specific example, you can ignore the prompt template thing and call the model however it works for vicuna

Plain Text

# define prompt helper
# set maximum input size
max_input_size = 2048
# set number of output tokens
num_output = 256
# set maximum chunk overlap
max_chunk_overlap = 20

model_name = "Writer/camel-5b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16
)
PROMPT_TEMPLATE = (
  "Below is an instruction that describes a task. "
  "Write a response that appropriately completes the request.\n\n"
  "### Instruction:\n{instruction}\n\n### Response:"
)

class CustomLLM(LLM):
  model_name = "Writer/camel-5b-hf"

  def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
    prompt = prompt.strip()
    text = PROMPT_TEMPLATE.format(instruction=prompt)
    model_inputs = tokenizer(text, return_tensors="pt", max_length=max_input_size).to("cuda")
    output_ids = model.generate(**model_inputs, max_new_tokens=num_output) #, temperature=0, do_sample=True)
    output_text = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
    clean_output = output_text.split("### Response:")[1].strip()
    return clean_output

  @property
  def _identifying_params(self) -> Mapping[str, Any]:
    return {"name_of_model": self.model_name}

  @property
  def _llm_type(self) -> str:
    return "custom"

llm_predictor = LLMPredictor(llm=CustomLLM())
embed_model = LangchainEmbedding(HuggingFaceEmbeddings())

prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap, chunk_size_limit=512)
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, embed_model=embed_model, prompt_helper=prompt_helper, chunk_size_limit=512)

jjagadeeshj

Thanks. @Logan M Will try with this

jjagadeeshj

@Logan M I made the changes and now i am running into this error

ValueError: Got a larger chunk overlap (20) than chunk size (-139), should be smaller.

LLogan M

Ha classic. What prompt helper/service context settings do you have now?

jjagadeeshj

Plain Text

# set maximum input size
max_input_size = 2048
# set number of output tokens
num_output = 256
# set maximum chunk overlap
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap, chunk_size_limit=512)

jjagadeeshj

Plain Text

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper, embed_model=embed_model, chunk_size_limit=512)

LLogan M

Hmm.. just for fun, try removing chunk size limit altogether, let the library try to figure it out lol

The math gets a little complicated sometimes when these parameters are tweaked

LLogan M

Or maybe num_output needs to be a little smaller... it might take some fiddling around

jjagadeeshj

@Logan M okay. i started the query and it is arount 8 mins and is still running. does this example query takes this much time ?

jjagadeeshj

Will reducing num_outputs make it faster ?

LLogan M

Hmm I feel like something weird is happening. Maybe check if the GPU is actually being used?

LLogan M

Definitely not normal if it's on gpu

jjagadeeshj

the query is still running an all my gpu are being used. checked with nvidia-smi

jjagadeeshj

Attachment

LLogan M

Spooky.

In the customLLM class, in the call function,
maybe print how long the text is that's going into it

print(len(prompt.split(' ')))

Definitely shouldn't take that long lol

jjagadeeshj

1349

LLogan M

Yea that sounds about right. Probably close to 1700 tokens.... tbh idk man, debugging this stuff is hard, even harder remotely haha

LLogan M

Try printing the prompt and then using that prompt with the model outside of llama index maybe?

LLogan M

Remove llama index from the equation, might make debugging easier to narrow down the issue

jjagadeeshj

Thanks will try that

Add a reply

Find answers from the community

Hi all I am trying to run a query from