Hi all - I'm trying to make the simplest

At a glance

The community member is trying to use a local Mistral model with the llama-index library, but is encountering issues with the tokenizer not having a padding token. The community members suggest loading the tokenizer separately and configuring the pad token, as well as using a more performant embedding model like BGE instead of Mistral.

The community members also provide examples of using local models with llama-index, including a Colab notebook for the Zephyr model. However, the community member still faces issues when trying to use their own documents, and another community member asks about storing a quantized model locally, which doesn't seem to be straightforward with the HuggingFaceLLM class.

Useful resources

SSimon

Hi all - I'm trying to make the simplest possible RAG pipeline calling a local model. If I simply use 'local' for model name, I get back expected results from the model query, but if I hardcode the model name to point at my local '/ai/Mistral-7B-v0.1' directory, I get:

/site-packages/transformers/tokenization_utils_base.py", line 2707, in _get_padding_truncation_strategies raise ValueError(ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'})

The code is:


local_model = '/ai/Mistral-7B-v0.1'
llm = HuggingFaceLLM(model_name=local_model)
embed_model = HuggingFaceEmbedding(model_name=local_model, tokenizer_name=local_model)

chroma_client = chromadb.PersistentClient()
chroma_collection = chroma_client.create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
service_context = ServiceContext.from_defaults(llm=llm,  embed_model=embed_model)

documents = SimpleDirectoryReader("data").load_data()
VectorStoreIndex.from_documents(documents, storage_context=storage_context, service_context=service_context)

19 comments

LLogan M

I think you need to do what the error is suggesting.

Load the tokenizer outside of llama-index, configure the pad token, then pass it into the LLM class

Plain Text

tokenizer = <your tokenizer>
<configure pad token>

llm = HuggingFaceLLM(tokenizer=tokenizer, ...)
embed_model = HuggingFaceEmbedding(tokenizer=tokenizer, ...)

although I noticed you are using mistral for both the LLM and embeddings. I would expect that to perform very badly (if it even works) 😅 Use BGE or something more performant for embeddings

SSimon

Thanks! If I set embed_model to 'local' or 'local:/ai/bge-small-en-v1.5', I get this when I call query_engine.query():


  File "...site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: index out of range in self

LLogan M

🤔 what version of llama-index do you have? For me, doing embed_model="local" or embed_model="BAAI/bge-small-en-v1.5" works 🤔

SSimon

Just installed 0.9.21 today with pip install.

Could you post for reference a full script querying a local non-default model? This is my first time using llama-index, I'm probably doing something dumb.

LLogan M

Here's an example with zehpyr

https://colab.research.google.com/drive/1UoPcoiA5EOBghxWKWduQhChliMHxla7U?usp=sharing

LLogan M

I think this error is likely the LLM, not the embeddings, now that I look at this again

SSimon

Do you mean the error is specifically the Mistral model? I'll be happy with any example using any non-default model to start with.

LLogan M

Yea -- I think it's becuse at some point an input got too large 🤔

LLogan M

Try setting the context_window in the service context to be a bit lower, is one remedy

SSimon

Going even as low as context_window=512 results in the same error

LLogan M

sus

LLogan M

mmm hard to say without debugging myself. I would say follow that colab notebook and see what happens

LLogan M

Theres a similar notebook for mistral too

LLogan M

https://colab.research.google.com/drive/1ZAdrabTJmZ_etDp10rjij_zME2Q3umAQ?usp=sharing

SSimon

Thank you, this example works for me. If I then switch to my document, I get back junk results, but at least the call no longer crashes, which is progress.

AA_la_lanterne

sorry for jumping into this. I followed the same notebook, but I want a chatbot that could work on my own end(uni project), and after quantizing i do not see a way to store that quantized model locally :(, I read the document and there is no viable command, would you happen to know what should I do? Thank you in advance!🥲

LLogan M

I'm assuming you can save/load it the same as any other huggingface model?

model.save_pretrained("./path/to/save") ?

AA_la_lanterne

That is what I thought, but I just get a prompt that HuggingFaceLLM doesn't have save_pretrained function 😦

AA_la_lanterne

Hey Simon, I was wondering how did you load the local LLM? Since I downloaded a quantized model and as I try to load it by local_model = "/zephyr-7B-beta-GPTQ"

llm = HuggingFaceLLM(
model_name=local_model I end up getting error
PackageNotFoundError: No package metadata was found for auto-gptq

Add a reply

Find answers from the community

Hi all - I'm trying to make the simplest