I am new to LlamaIndex. The customization tutorial still uses ServiceContext. It would help to newbies if that section can be updated too.
ah that it does. Great call out (it technically still works, but yea, should be updated)
@Logan M Do you have an ETA on the tutorial documentation updates?
I have not had a chance lol its on my todo
Anything specific I can give an example of right now?
Maybe you can just add the revised statements for the first customization tutorial (i.e., smaller chunks example)? If I can have the before and after examples, I should be able to follow. thanks
Already updated the docs (just working on fixing a bug with docs generation)
Here's the example for that section
# Global settings
from llama_index.core import Settings
Settings.chunk_size = 512
# Local settings
from llama_index.core.node_parser import SentenceSplitter
index = VectorStoreIndex.from_documents(
documents, transformations=[SentenceSplitter(chunk_size=512)]
)
Nothing against openai, but I wish llamaindex didn't assume that everyone is using openai as default. Almost all the examples and documents assume we're using openai with api keys, etc. and there are sprinkles of mentions of how to do things with local and public LLM's
@Logan M , thank you. I understand the needs for "transformations=[SentenceSplitter(chunk_size=512)". But what does "Settings.chunk_size = 512" do? How do we know if it is used in the SentenceSplitter() or something else?
It's modifying the default (which is sentence splitter)
This is the same as the previous service context π
Right. It is the same as the previous service context. But I thought the reason for the change is to make it less ambiguous? π
Maybe a section on how to use other LLMs would be useful....
Fair enough.
The simplest version I can think of
Download and start ollama (be sure to do
ollama pull
to download the model you want to run after the server starts)
https://ollama.com/I might do
ollama pull starling-lm
for this example
pip install llama-index llama-index-llms-ollama llama-index-embeddings-huggingface
from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbeddings
Settings.llm = Ollama(model="starling-lm", request_timeout=3000.0)
Settings.embed_model = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents =SimpeDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
chat_engine = index.as_chat_engine(chat_mode="condense_plus_context")
response = chat_engine.chat("Hello!")
print(response)
Very good. Thanks @Logan M . Even I can understand it well. π
@Logan M just a small typo:
should be:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
not
from llama_index.embeddings.huggingface import HuggingFaceEmbeddings
Just curious... not sure why but I keep running into an error that something is trying to reach out and gets a connection refusal when I call chat_engine.chat
...site-packages/httpx/_transports/default.py", line 83, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ConnectError: [Errno 61] Connection refused
any idea why/what is reaching out?
ah... never mind... it's ollama... it's trying to make an http call to ollama server
@Logan M How do I match LLM model vs embedding model? In the above example, you have starling-lm as the LLM and BAAI/bge-small-en-v1.5 as the embedding model. Thank you.
there is no relation between an embedding model and an LLM. You can use any combination you want π
@Logan M These pre-trained LLM models have been trained with certain embedding models. When I use LlamaIndex to perform RAG applications, don't I need to use the same embedding that a model was pre-trained with?
I think there is a misunderstanding
LLMs are decoders. Given a sequence of input tokens, generate the next X tokens. For example input="Hello!"
, an LLM trained on chat messages might respond with "How are you?"
Embedding models are trained to take input tokens, and project them into vectors that have some semantic meaning. I.e. Dog
might have the vector [0, 0, 1]
, while Cat
might have the vector [1, 0, 0]
-- these vectors are different, because the words mean different things. So we can take this principle, embed documents, and then also embed queries and compare vectors to find the most relevant data to a query. And people will finetune embedding models specifically for this purpose, to better capture semantic meaning in vectors.
So both of these processes, there is not a connection between the LLM and embedding model. The embedding model help retrieve relevant chunks of text, and then we prompt an LLM with those retrieved chunks + a query, and the LLM responds with an answer.