Find answers from the community

Updated last year

how can i use local models for

At a glance

how can i use local models for summarization from HF in llamaindex?

18 comments

We support a ton of other LLM libraries (including huggingface)https://docs.llamaindex.ai/en/stable/module_guides/models/llms.html

Heres an example with Zephyr

https://colab.research.google.com/drive/1UoPcoiA5EOBghxWKWduQhChliMHxla7U?usp=sharing

eemmepra

thanks Logan! Im referring to models like facebook/bart-large-cnn not LLMs

eemmepra

I'm struggling setting up a FastAPI app to locally call a summarization model

eemmepra

but it seems like it's not easy to let it work asynchronously

eemmepra

instead I remember that local Embedding models from HF with llama-index are working async

eemmepra

so i was wondering how to do it with other models

LLogan M

Ah, those types of models aren't really compatible -- llama index is intended for LLMs and embedding models

You'd have to run the model yourself and generate summaries (which tbh isn't too hard to do)

eemmepra

no sure!

eemmepra

I'm struggling with let it handle async requests

eemmepra

but probably these are not meant for this purposes idk

eemmepra

any suggestion on this?

LLogan M

Async is tough.

One copy of a model can only handle requests sequentially.

So you can setup a queue and process requests as they come. You can also duplicate the model in memory to scale

There should be some packages out there that handle this for you (torch serve comes to mind)

eemmepra

Yeah that’s hard! One approach I was thinking about is to init N>1 docker containers, each of which running its instance of the model then somehow managing request routing to available ones

eemmepra

Our server has tons of RAM and hopefully enough cores to handle thsi, even if clearly subsubsubsuboptimal

eemmepra

What do you think about?

eemmepra

My purpose is to process a big amount of requests in order to generate metadata for each article I’m scraping from the web

eemmepra

So it needs to be surely fast, and sequential processing it s not really helpful

LLogan M

Yea I think something like torchserve should handle this well.

Otherwise you need to setup something like kubernetes and autoscalling and an api -- have to do a lot more yourself

Add a reply