Find answers from the community

Updated 3 months ago

Hey again I m running into some issues

Hey again, I'm running into some issues running a GPTQ model locally with llama-index.

I'm seeing a CUBLAS_STATUS_NOT_SUPPORTED error when trying to make a query. It's really weird because i'm able to use transformers to run this same GPTQ model without llama-index, but running it within llama-index is giving me this error.

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

The only options i gave to HuggingFaceLLM is model_name and device_map="auto"

Any idea what steps I should take?
L
B
6 comments
If you can use it outside of llamaindex, then load the model outside of llamaindex in the way that works for you, and pass it in (instead of passing in the model name)

Plain Text
HuggingFaceLLM(..., model=model)
oh? i didn't see that. Let me give that a shot
Hrm, no that didn't seem to do it. Still seeing that same error, but I can still do inference directly from the model
hmm you might have to debug the source code tbh. All it's doing is applying the auto-tokenizer and passing that to the model πŸ€·β€β™‚οΈ Maybe you can spot the difference
https://github.com/run-llama/llama_index/blob/main/llama_index/llms/huggingface.py
Time to go spelunking then
πŸ”¦ πŸ₯Ύ
Add a reply
Sign up and join the conversation on Discord