Find answers from the community

Updated last year

how can i use local models for

how can i use local models for summarization from HF in llamaindex?
L
e
18 comments
thanks Logan! Im referring to models like facebook/bart-large-cnn not LLMs
I'm struggling setting up a FastAPI app to locally call a summarization model
but it seems like it's not easy to let it work asynchronously
instead I remember that local Embedding models from HF with llama-index are working async
so i was wondering how to do it with other models
Ah, those types of models aren't really compatible -- llama index is intended for LLMs and embedding models

You'd have to run the model yourself and generate summaries (which tbh isn't too hard to do)
I'm struggling with let it handle async requests
but probably these are not meant for this purposes idk
any suggestion on this?
Async is tough.

One copy of a model can only handle requests sequentially.

So you can setup a queue and process requests as they come. You can also duplicate the model in memory to scale

There should be some packages out there that handle this for you (torch serve comes to mind)
Yeah that’s hard! One approach I was thinking about is to init N>1 docker containers, each of which running its instance of the model then somehow managing request routing to available ones
Our server has tons of RAM and hopefully enough cores to handle thsi, even if clearly subsubsubsuboptimal
What do you think about?
My purpose is to process a big amount of requests in order to generate metadata for each article I’m scraping from the web
So it needs to be surely fast, and sequential processing it s not really helpful
Yea I think something like torchserve should handle this well.

Otherwise you need to setup something like kubernetes and autoscalling and an api -- have to do a lot more yourself
Add a reply
Sign up and join the conversation on Discord