Embedding

At a glance

The community member is building a question-answering app that uses embeddings from a vector store as context in the prompt. They are using a Hugging Face LLM and are encountering an "AuthenticationError: No API key provided" when trying to use the GPTVectorStoreIndex. The community members suggest that the issue may be due to the community member only setting the LLM and not the embedding model, which is still defaulting to OpenAI. They provide an example of how to use a custom embedding model, such as a SentenceTransformer (SBERT) model, and advise the community member to set the tokenizer for the embedding part as well.

Useful resources

DDeleted User

Hi everyone. I would like to build a question-answering app that retrieves embeddings from a vector store and uses them as context in the prompt to answer a question. In this app, I am not using OPENAI_API_KEY as my LLM is from Hugging Face Hub. Specifically, I created my LLM instance ("llm") by HuggingFacePipeline, and provided it to the following:

Plain Text

llm_predictor = LLMPredictor(llm=llm)
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)

However, GPTVectorStoreIndex throws

Plain Text

AuthenticationError: No API key provided.

Could anyone help me implement a vector store index without OPENAI_API_KEY?? (Or is a vector store index necessary to build an app if I am going to have an external vector store like FAISS or Pinecone?) Thank you in advance 🙏

8 comments

LLogan M

There are two models in llama index, the LLM (for generating text and answering queries) and the embedding model (for generating embeddings)

Here you've only set the LLM, so the embed model is still defaulting to openai

LLogan M

We support any embeddings offered by langchain, just need to wrap it with our langchain wrapper

LLogan M

Example here
https://gpt-index.readthedocs.io/en/latest/how_to/customization/embeddings.html#custom-embeddings

DDeleted User

@Logan M
Thanks for your kind responses! However, the URL you provided returns "404 Not Found". Could you give me the link?

LLogan M

Lol ya we just refactored the docs

LLogan M

https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/usage_pattern.html#customization

DDeleted User

@Logan M Thanks, I can see it now!!

DDeleted User

@Logan M
As embedding model, I would like to use a SentenceTransformer (SBERT) model and wonder if I need to change the tokenizer for the embedding part as well. Would you let me know if the following is correct?

For my node parser, I need to set the tokenizer for my SBERT model.
Since my SBERT accepts 128 tokens as input at maximum (i.e., model's max_seq_length is 128), I need to set chunk size == 128.

As a result, my node parser looks like this:

Plain Text

# Node Parser
from transformers import AutoTokenizer
embed_tokenizer = AutoTokenizer.from_pretrained(EMBED_MODEL_ID)
text_splitter = TokenTextSplitter(
    chunk_size=128,
    chunk_overlap=20,
    tokenizer=embed_tokenizer.encode
    )
node_parser = SimpleNodeParser(text_splitter=text_splitter)

Thanks in advance 🙏

Add a reply

Find answers from the community

Embedding