Find answers from the community

Updated 2 years ago

Jaqen 8180 stupid question what is 4090

At a glance
@Jaqen stupid question - what is 4090 ? Do you have any insights why inference would be so slow
1
L
a
J
17 comments
An RTX4090, which is a powerful (but expensive) consumer graphics card πŸ™‚
Running on an M1 macbook pro is going to be slow for any local LLM model tbh, especially compared to something like OpenAI
Thanks @Logan Mfor the clarification. I thopught 4090 was another model offered in GPT4All group of models.
do you guys want to create a work chat? love to group together to work on something or help each other
@ashishsha @Logan M
I am trying with writer 5b LLM model from Huggingface on a 16GB GPU machine, However the ressponse is taking avg 20-25 sec to generate. Is this normal?

Model takes around 11GB and I have good 5GB remaining.
Sounds about right. It's likely making more than one LLM call though, I think that model has a smaller input size right?
No just one LLM call,
I tried with .chat() as well in which there are two LLM calls, it was taking around 30 secs which is fine as there are two LLM calls

But If i'm trying with .query() there is one instance of LLM interaction and that is to synthesize the response.
Okay wait, You mean to synthesize the response it is making multiple call to LLM?
yea that's what I meant! Depending on your settings (chunk size, num_outputs, max input size, top k), it might be retrieving more text than can fit into a single LLM call (in which case, it refines an answer). Pretty common for models with smaller input sizes to run into this
For me it's
Plain Text
Chunk size: 512
num_outputs: Its the default one
max_input_size: 2048
max_new_tokens: 256
top_k: didnt set so it is default 
Didn't change a thing
Plain Text
HuggingFaceLLMPredictor(
                        max_input_size=2048,
                        max_new_tokens=256,
                        temperature=0.25,
                        do_sample=False,
                        query_wrapper_prompt=query_wrapper_prompt,
                        tokenizer_name="Writer/camel-5b-hf",
                        model_name="Writer/camel-5b-hf",
                        device_map="auto",
                        tokenizer_kwargs={"max_length": 2048},
                        model_kwargs={"torch_dtype": torch.bfloat16}
                    )

from your example when you created HF wrapper πŸ˜…
Nice! And how/where did you set the chunk size?
I set Chunk size while creating service_context, I hope that's what you meant
Okay so if there are more chunks greater than 2048 then there will be more than 1 LLM call?
Yea like directly in the service context right?

Hmm interesting, with these settings, a query would retrieve two chunks (2*512 tokens), the prompt template + query is probably ~200 tokens, and we need to leave room for 256 output tokens. So, ~1480 tokens, which is way less than 2048

With these settings, I think it's only making one LLM call? Camel might just be really that slow

You can test it directly by doing something like this

Plain Text
pred, prompt = llm_predictor.predict("really long string asking to tell a joke or something idk")


And use a string that's like 1500 tokens lol and see how fast it is
Yes, let me try this. I'm really hoping it's camel though πŸ˜‚
Add a reply
Sign up and join the conversation on Discord