The community member is using the llamaindex postprocessor with a huggingface crossencoder to rerank their results, which takes around 3 seconds. However, using the same model via the Hugging Face inference endpoint takes less than 1 second with an Nvidia T4 GPU. The community member would prefer to use the llamaindex version because it includes a score in the rerank (NodeWithScore), and is wondering if there is a way to utilize an available GPU to speed up the llamaindex postprocessing.
In the comments, another community member suggests that if the user has CUDA installed, the llamaindex postprocessor should be using the GPU automatically.
hey I've been using the llamaindex postprocessor with a huggingface crossencoder to rerank my results. i've noticed it usually takes around 3 seconds, whereas if i use the same model via API on a huggingface inference endpoint it takes <1s (with nvidia t4). would love to use llamaindex's version instead though because it includes a score in the rerank as well (NodeWithScore), so was wondering if there's a way to have the postprocessing utilize an available gpu for speed?