Hey again I m running into some issues

At a glance

Hey again, I'm running into some issues running a GPTQ model locally with llama-index.

I'm seeing a CUBLAS_STATUS_NOT_SUPPORTED error when trying to make a query. It's really weird because i'm able to use transformers to run this same GPTQ model without llama-index, but running it within llama-index is giving me this error.

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

The only options i gave to HuggingFaceLLM is model_name and device_map="auto"

Any idea what steps I should take?

6 comments

LLogan M

If you can use it outside of llamaindex, then load the model outside of llamaindex in the way that works for you, and pass it in (instead of passing in the model name)

Plain Text

HuggingFaceLLM(..., model=model)

BBenBot

oh? i didn't see that. Let me give that a shot

BBenBot

Hrm, no that didn't seem to do it. Still seeing that same error, but I can still do inference directly from the model

LLogan M

hmm you might have to debug the source code tbh. All it's doing is applying the auto-tokenizer and passing that to the model 🤷‍♂️ Maybe you can spot the difference
https://github.com/run-llama/llama_index/blob/main/llama_index/llms/huggingface.py

BBenBot

Time to go spelunking then

LLogan M

🔦 🥾

Add a reply

Find answers from the community

Hey again I m running into some issues