I implemented this example:

At a glance

I implemented this example:

except i am using the index as chat engine: https://docs.llamaindex.ai/en/stable/module_guides/models/llms/usage_custom.html#example-using-a-custom-llm-model-advanced

Plain Text

# chat_engine = index.as_chat_engine()
chat_engine = index.as_chat_engine(
    chat_mode="context",
    memory=memory,
    system_prompt=system_prompt,
    service_context=service_context
)

response = chat_engine.chat("Tell me a joke.")
print(f"Agent: {response}")

but when i put in an input it returns no output and gives error:

Plain Text

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

anyone know why this might be happening?
edit: now its giving error ValueError: shapes (384,) and (1536,) not aligned: 384 (dim 0) != 1536 (dim 0)

25 comments

LLogan M

that second error means embeddings are being created with two different models.

I'm not sure which LLM you are using (sounds like huggingface), but huggingface is famous for zero outputs when the input gets too big

LLogan M

Can you share the LLM implementation?

ssegfault

i pu the wrong example https://docs.llamaindex.ai/en/stable/module_guides/models/llms/usage_custom.html#example-using-a-huggingface-llm this one is the one

ssegfault

oh weird

ssegfault

Plain Text

set_global_tokenizer(
    AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf").encode
)

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",#hbacard/Nous-Hermes-Llama2-13b-GGUF
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
# use Huggingface embeddings
from llama_index.embeddings import HuggingFaceEmbedding
# intfloat/e5-mistral-7b-instruct
# BAAI/bge-small-en-v1.5
# BAAI/bge-large-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="intfloat/e5-mistral-7b-instruct")
service_context = ServiceContext.from_defaults(
    chunk_size=1024,
    llm=llm,
    embed_model=embed_model
)
set_global_service_context(service_context)
)

ssegfault

I had embded_model set to "local" before. I just changed it to this one ^ and am trying again

LLogan M

If you switch embedding models, just make sure you re-create the entire index

ssegfault

SHIT

ssegfault

thats gotta be it

ssegfault

i'm loading one from disk I pre made

LLogan M

Yea so that will fix the dim error

LLogan M

I'm not totally sure why the output is empty 🤔 If you are running on CPU though, it will be very slow though

ssegfault

Thanks! I would not have known to look for that

ssegfault

it should be cuda but idk how to check if my pip install put the cuda stuff

LLogan M

Plain Text

import torch
print(torch.cuda.is_available())`

That should show if torch sees your gpu

ssegfault

oh lovely

aalgaebrown

Hi @segfault I am Charlene a very new newbie in LLM and all of this. I am struggling to understand what embedding model I should chose for "StabilityAI/stablelm-tuned-alpha-3b". I searched the web many times but was still confused. I thought the llm has to use the same embedding as the embed_model but here you are using something different "intfloat/e5-mistral-7b-instruct". Would you mind sharing how to know if two different model's embedding are compatible? Thank you!

LLogan M

the LLM and embedding model are completely unrelated. You could have any combination 🙂

aalgaebrown

Oh @Logan M thanks for the quick reply! I read """By default LlamaIndex uses text-embedding-ada-002, which is the default embedding used by OpenAI. If you are using different LLMs you will often want to use different embeddings.""" here and thus think there must be some rules.

https://docs.llamaindex.ai/en/stable/understanding/indexing/indexing.html.

So technically, I can use any embedding+llm combination, despite it maps words into vector space differently? I guess I am failing to understand if embed_model is mapping to a difference space as the LLM does, how does the LLM know what is relevant?

LLogan M

oh, thats a weird senetence to be in the docs. Will make sure that gets removed haha

LLogan M

embedding models are just used to represent and retireve text

LLogan M

once you have the text, it goes to the LLM

LLogan M

so you can use any embedding model with any LLM

aalgaebrown

got it! Thank you!

Add a reply

Find answers from the community

I implemented this example: