from llama_index.finetuning import ( generate_qa_embedding_pairs, EmbeddingQAFinetuneDataset, ) train_dataset = generate_qa_embedding_pairs(train_nodes) val_dataset = generate_qa_embedding_pairs(val_nodes) train_dataset.save_json("train_dataset.json") val_dataset.save_json("val_dataset.json")
TextNode
are considered. The text is converted into vector embeddings, and the metadata fields are stored as such in the underlying vector database. Here is the relevant code snippet:from llama_index import VectorStoreIndex, StorageContext from llama_index.vector_stores import ChromaVectorStore from llama_index.schema import TextNode
TextNode
not only contains the text, but also metadata e.g. category
and country
. These metadata fields will get converted/stored as such in the underlying vector db.import pandas as pd import json # Load your CSV file df = pd.read_csv('your_file.csv') # Convert the DataFrame to a list of dictionaries data = df.to_dict('records') # Prepare the data for LlamaIndex for record in data: record['query'] = record.pop('questions') record['response'] = record.pop('context') # Save the data to a JSON file with open('output.json', 'w') as f: json.dump(data, f)
'your_file.csv'
with the path to your CSV file, and 'output.json'
with the desired output path for the JSON file.pip install pandas
generate_qa_embedding_pairs
function to prepare your data.SentenceTransformersFinetuneEngine
to fine-tune your model. You can specify the model from Hugging Face using the model_id
parameter.from llama_index.finetuning import SentenceTransformersFinetuneEngine finetune_engine = SentenceTransformersFinetuneEngine( train_dataset, model_id="BAAI/bge-small-en", # replace with your Hugging Face model model_output_path="test_model", val_dataset=val_dataset, ) finetune_engine.finetune() embed_model = finetune_engine.get_finetuned_model()
"BAAI/bge-small-en"
with the ID of your Hugging Face model. The train_dataset
and val_dataset
variables should be your training and validation datasets, respectively.model_id
parameter of the SentenceTransformersFinetuneEngine
. However, please ensure that the model is compatible with the Sentence Transformers library, as not all Hugging Face models are.max_length
parameter for the tokenizer, it's not explicitly mentioned in the provided extracts how to pass this parameter during the fine-tuning process in LlamaIndex. However, in general with Hugging Face models, you can specify this parameter when you initialize the tokenizer. Here's an example:from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-mistral-7b-instruct", max_length=512)
512
with your desired maximum length.EmbeddingAdapterFinetuneEngine
in LlamaIndex does not explicitly provide a way to pass the max_length
parameter to the tokenizer or to pass an initialized tokenizer. EmbeddingAdapterFinetuneEngine
:from llama_index.finetuning import EmbeddingAdapterFinetuneEngine from llama_index.embeddings import resolve_embed_model import torch base_embed_model = resolve_embed_model("local:BAAI/bge-small-en") finetune_engine = EmbeddingAdapterFinetuneEngine( train_dataset, base_embed_model, model_output_path="model_output_test", epochs=4, verbose=True, )
resolve_embed_model
function is used to load the base embedding model, but it's not clear from the extracts whether this function accepts a tokenizer or max_length
parameter.resolve_embed_model
function in LlamaIndex. Here's how you can do it:from llama_index.embeddings import resolve_embed_model base_embed_model = resolve_embed_model("intfloat/e5-mistral-7b-instruct")
SubQuestionQueryEngine
of LlamaIndex, you would first need to load your fine-tuned model as an embedding model. Then, you can pass this embedding model to the ServiceContext
which is then used to create the SubQuestionQueryEngine
. Here's a general outline of the process:HuggingFaceEmbedding
or resolve_embed_model
function to load your fine-tuned model.ServiceContext.from_defaults
function to create a service context, passing in your fine-tuned model as the embed_model
parameter.VectorStoreIndex.from_documents
function to create a vector store index, and then use the as_query_engine
method to create the query engine.from llama_index import set_global_tokenizer, ServiceContext from llama_index.embeddings import HuggingFaceEmbedding from transformers import AutoTokenizer # Set the global tokenizer set_global_tokenizer( AutoTokenizer.from_pretrained("YourFineTunedModel").encode ) # Load your fine-tuned model embed_model = HuggingFaceEmbedding(model_name="YourFineTunedModel") # Create a service context service_context = ServiceContext.from_defaults( llm=llm, embed_model=embed_model, ) # Create the query engine query_engine = index.as_query_engine(service_context=service_context)
"YourFineTunedModel"
with the ID or path of your fine-tuned model.EmbeddingAdapterFinetuneEngine
can use, you need to load the JSON file and convert it into an EmbeddingQAFinetuneDataset
. Here's a Python code snippet that can help you do this:import json from llama_index.finetuning import EmbeddingQAFinetuneDataset # Load the JSON file with open('your_file.json', 'r') as f: data = json.load(f) # Convert the data into an EmbeddingQAFinetuneDataset dataset = EmbeddingQAFinetuneDataset.from_list(data)
'your_file.json'
with the path to your JSON file. The from_list
method of EmbeddingQAFinetuneDataset
takes a list of dictionaries, where each dictionary has a 'query' key and a 'response' key.EmbeddingAdapterFinetuneEngine
in LlamaIndex should automatically use a GPU if one is available. PyTorch, which is used by LlamaIndex, defaults to using a GPU when available.EmbeddingAdapterFinetuneEngine
. In general, you can ensure PyTorch uses a GPU by setting the device to 'cuda' if a GPU is available:import torch device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
EmbeddingAdapterFinetuneEngine
.EmbeddingQAFinetuneDataset
class in LlamaIndex does not have a from_list
method. Instead, it has a from_json
method that you can use to load your dataset from a JSON file. Here's how you can do it:from llama_index.finetuning import EmbeddingQAFinetuneDataset # Load the dataset from a JSON file dataset = EmbeddingQAFinetuneDataset.from_json('your_file.json')
'your_file.json'
with the path to your JSON file.EmbeddingQAFinetuneDataset
class in LlamaIndex does not have a from_json
method according to the provided extracts. EmbeddingQAFinetuneDataset
is initialized with a list of dictionaries, where each dictionary represents a data sample with 'query' and 'response' keys. However, the exact format of these dictionaries is not specified in the provided extracts.import json # Load the JSON file with open('your_file.json', 'r') as f: data = json.load(f)
'your_file.json'
with the path to your JSON file. This will give you a list of dictionaries stored in the data
variable.EmbeddingQAFinetuneDataset
, you would typically pass this list to the class constructor. However, the provided extracts do not specify how to do this.