Log in
Log into community
Find answers from the community
View all posts
Related posts
Did this answer your question?
😞
😐
😃
Powered by
Hall
Inactive
Updated last year
0
Follow
how can i use local models for
how can i use local models for
Inactive
0
Follow
e
emmepra
last year
·
how can i use local models for summarization from HF in llamaindex?
L
e
18 comments
Share
Open in Discord
L
Logan M
last year
We support a ton of other LLM libraries (including huggingface)
https://docs.llamaindex.ai/en/stable/module_guides/models/llms.html
Heres an example with Zephyr
https://colab.research.google.com/drive/1UoPcoiA5EOBghxWKWduQhChliMHxla7U?usp=sharing
e
emmepra
last year
thanks Logan! Im referring to models like
facebook
/bart-large-cnn
not LLMs
e
emmepra
last year
I'm struggling setting up a FastAPI app to locally call a summarization model
e
emmepra
last year
but it seems like it's not easy to let it work asynchronously
e
emmepra
last year
instead I remember that local Embedding models from HF with llama-index are working async
e
emmepra
last year
so i was wondering how to do it with other models
L
Logan M
last year
Ah, those types of models aren't really compatible -- llama index is intended for LLMs and embedding models
You'd have to run the model yourself and generate summaries (which tbh isn't too hard to do)
e
emmepra
last year
no sure!
e
emmepra
last year
I'm struggling with let it handle async requests
e
emmepra
last year
but probably these are not meant for this purposes idk
e
emmepra
last year
any suggestion on this?
L
Logan M
last year
Async is tough.
One copy of a model can only handle requests sequentially.
So you can setup a queue and process requests as they come. You can also duplicate the model in memory to scale
There should be some packages out there that handle this for you (torch serve comes to mind)
e
emmepra
last year
Yeah that’s hard! One approach I was thinking about is to init N>1 docker containers, each of which running its instance of the model then somehow managing request routing to available ones
e
emmepra
last year
Our server has tons of RAM and hopefully enough cores to handle thsi, even if clearly subsubsubsuboptimal
e
emmepra
last year
What do you think about?
e
emmepra
last year
My purpose is to process a big amount of requests in order to generate metadata for each article I’m scraping from the web
e
emmepra
last year
So it needs to be surely fast, and sequential processing it s not really helpful
L
Logan M
last year
Yea I think something like torchserve should handle this well.
Otherwise you need to setup something like kubernetes and autoscalling and an api -- have to do a lot more yourself
Add a reply
Sign up and join the conversation on Discord
Join on Discord