context
mode in the chat engine configuration. This mode will ensure that the chatbot's responses are strictly based on the context provided.from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.core.memory import ChatMemoryBuffer # Load data and build index data = SimpleDirectoryReader(input_dir="./data/paul_graham/").load_data() index = VectorStoreIndex.from_documents(data) # Configure chat engine memory = ChatMemoryBuffer.from_defaults(token_limit=1500) # Adjust token limit as needed chat_engine = index.as_chat_engine( chat_mode="context", memory=memory, system_prompt=( "You are a chatbot, able to have normal interactions, as well as talk" " about an essay discussing Paul Grahams life." ), )
chat_mode
is set to "context"
, which means the chatbot will only use the context provided to generate responses. The token_limit
in ChatMemoryBuffer.from_defaults(token_limit=1500)
is set to 1500, which limits the length of the chat history that the chatbot can remember. You can adjust this value as needed to further limit the output of your chatbot.chat
method of the chat_engine
to interact with your chatbot:response = chat_engine.chat("Your question here") print(response)
chat_engine.reset()
response = chat_engine.chat("Your question here") first_line = response.split('\n')[0] print(first_line)
response.split('\n')[0]
splits the response into lines and takes the first line.ResponseSynthesizer
module is responsible for generating a response from an LLM (Language Learning Model), using a user query and a given set of text chunks.TreeSummarize
synthesizer can be used to generate a summarized response. Here's an example of how you can use it:from llama_index.core.response_synthesizers import TreeSummarize from llama_index.llms.openai import OpenAI llm = OpenAI(model="gpt-3.5-turbo") summarizer = TreeSummarize(llm=llm, verbose=True) response = summarizer.get_response("Your question here", texts)
TreeSummarize
is used to generate a summarized response. The get_response
method takes a query and a list of text chunks (texts
) and returns a summarized response.LlamaCPP
module from LlamaIndex, which is designed to work with the llama-cpp-python
library. This library is optimized for multi-core CPUs and can significantly speed up inference.from llama_index.llms.llama_cpp import LlamaCPP llm = LlamaCPP(model_name="NousResearch/Llama-2-7b-chat-hf")
LlamaCPP
is used to create an instance of the LLM. The model_name
parameter specifies the name of the model to use.LlamaCPP
, it is recommended to install the package so that it is compiled with GPU support. However, if you're running on a CPU, you can use the following libraries depending on your hardware:CuBLAS
if you have CUDA and an NVidia GPUMETAL
if you are running on an M1/M2 MacBookCLBLAST
if you are running on an AMD/Intel GPU%pip install llama-index-embeddings-huggingface %pip install llama-index-llms-llama-cpp
OMP_NUM_THREADS
to the number of cores you want to use before running your Python script. This can be done in the terminal or directly in your Python script using the os
module.import os # Set the number of OpenMP threads to use for parallelization os.environ["OMP_NUM_THREADS"] = "8" # Replace "8" with the number of cores you want to use
os.environ["OMP_NUM_THREADS"] = "8"
sets the number of OpenMP threads to 8. You should replace "8" with the number of cores you want to use.LlamaCPP
constructor to specify the number of cores to use for parallel processing. The LlamaCPP
class takes the following parameters:llm = LlamaCPP( model_url=model_url, model_path=None, temperature=0.1, max_new_tokens=256, context_window=3900, generate_kwargs={}, model_kwargs={"n_gpu_layers": 1}, verbose=True, )
model_kwargs
parameter allows you to pass arguments to the underlying model, but it's not clear from the extracts whether this includes parameters for multi-core processing.LlamaCPP
constructor. However, many machine learning libraries, including those likely used by LlamaCPP
, are designed to automatically utilize all available CPU cores for computation.OMP_NUM_THREADS
environment variable, which is often used by libraries such as NumPy, SciPy, and pandas to control multi-threading. Here's how you can set it in your Python script:import os # Set the number of OpenMP threads to use for parallelization os.environ["OMP_NUM_THREADS"] = "8" # Replace "8" with the number of cores you want to use
os.environ["OMP_NUM_THREADS"] = "8"
sets the number of OpenMP threads to 8. You should replace "8" with the number of cores you want to use.get_response_synthesizer
function from the llama_index.core
module. Here's an example of how you can do it:from llama_index.core.data_structs import Node from llama_index.core.schema import NodeWithScore from llama_index.core import get_response_synthesizer # Get a response synthesizer in compact mode response_synthesizer = get_response_synthesizer(response_mode="compact") # Use the response synthesizer to generate a response response = response_synthesizer.synthesize( "query text", nodes=[NodeWithScore(node=Node(text="text"), score=1.0), ...] )
get_response_synthesizer(response_mode="compact")
gets a response synthesizer in compact mode. The synthesize
method is then used to generate a response from a query and a list of nodes.# Create a query engine with the response synthesizer query_engine = index.as_query_engine(response_synthesizer=response_synthesizer) # Use the query engine to generate a response response = query_engine.query("query_text")
index.as_query_engine(response_synthesizer=response_synthesizer)
creates a query engine with the response synthesizer. The query
method is then used to generate a response from a query.ChatEngine
in LlamaIndex does not directly accept a ResponseSynthesizer
as a parameter. The ChatEngine
is designed to handle the conversation flow and it uses its own internal methods to generate responses.ChatEngine
by setting the chat_mode
parameter when creating the ChatEngine
. Here's an example:from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.core.memory import ChatMemoryBuffer # Load data and build index data = SimpleDirectoryReader(input_dir="./data/paul_graham/").load_data() index = VectorStoreIndex.from_documents(data) # Configure chat engine memory = ChatMemoryBuffer.from_defaults(token_limit=1500) # Adjust token limit as needed chat_engine = index.as_chat_engine( chat_mode="context", memory=memory, system_prompt=( "You are a chatbot, able to have normal interactions, as well as talk" " about an essay discussing Paul Grahams life." ), )
chat_mode
is set to "context"
, which means the chatbot will only use the context provided to generate responses. The token_limit
in ChatMemoryBuffer.from_defaults(token_limit=1500)
is set to 1500, which limits the length of the chat history that the chatbot can remember. You can adjust this value as needed to further limit the output of your chatbot.ChatEngine
. The response mode is determined by the underlying LLM and the prompts you use. You might need to experiment with different prompts and configurations to get the desired output length.