generate_qa_embedding_pairs
function from LlamaIndex's finetuning module is used to generate a structured dataset containing question and context pairs. This function takes in any set of text nodes (chunks) as input. from llama_index.finetuning import ( generate_qa_embedding_pairs, EmbeddingQAFinetuneDataset, ) train_dataset = generate_qa_embedding_pairs(train_nodes) val_dataset = generate_qa_embedding_pairs(val_nodes) train_dataset.save_json("train_dataset.json") val_dataset.save_json("val_dataset.json")
train_nodes
and val_nodes
are the sets of text nodes that you want to generate questions from. The function generate_qa_embedding_pairs
generates questions using each text chunk in the corpus as context. Each pair of (generated question, text chunk used as context) becomes a datapoint in the finetuning dataset.train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json") val_dataset = EmbeddingQAFinetuneDataset.from_json("val_dataset.json")
DatasetGenerator
class. Here is an example of how you can use it:from llama_index import SimpleDirectoryReader, ServiceContext from llama_index.llms import OpenAI from llama_index.evaluation import DatasetGenerator # build service context llm = OpenAI(model="gpt-4", temperature=0.0) service_context = ServiceContext.from_defaults(llm=llm) # build documents documents = SimpleDirectoryReader("./data").load_data() # define generator, generate questions data_generator = DatasetGenerator.from_documents(documents) eval_questions = data_generator.generate_questions_from_nodes()
DatasetGenerator
is initialized with a set of documents. The generate_questions_from_nodes
method is then used to generate questions from the nodes (chunks of text) in these documents. The generated questions will be similar to the ones in your question corpus.eval_questions
generated in the code example are typically used for evaluation, not for fine-tuning the embeddings. generate_qa_embedding_pairs
function is used to generate both the training and validation datasets. Here's an example:from llama_index.finetuning import generate_qa_embedding_pairs, EmbeddingQAFinetuneDataset # Generate training and validation datasets train_dataset = generate_qa_embedding_pairs(train_nodes) val_dataset = generate_qa_embedding_pairs(val_nodes) # Save datasets train_dataset.save_json("train_dataset.json") val_dataset.save_json("val_dataset.json")
train_nodes
and val_nodes
are your training and validation sets of text nodes, respectively.eval_questions = data_generator.generate_questions_from_nodes()
generates a set of questions based on the data in the nodes. These questions are intended to be similar to the ones in your question corpus.generate_qa_embedding_pairs
function, which generates pairs of questions and their corresponding answers.from llama_index.finetuning import generate_qa_embedding_pairs, EmbeddingQAFinetuneDataset # Assume eval_questions are your generated questions and eval_answers are the corresponding answers eval_dataset = list(zip(eval_questions, eval_answers)) # Generate the fine-tuning dataset finetune_dataset = generate_qa_embedding_pairs(eval_dataset) # Save the dataset finetune_dataset.save_json("finetune_dataset.json")
eval_answers
would be the set of answers corresponding to your eval_questions
. The generate_qa_embedding_pairs
function is then used to generate the fine-tuning dataset.