Has anyone used the ```data generator

At a glance

The post discusses using the data_generator.generate_questions_from_nodes() pattern to generate synthetic question/answer pairs for fine-tuning datasets. The community member was working with others on generating synthetic instruction data for qlora/fine-tuning, and realized that the necessary data is already indexed in a vector store. The pattern works well with GPT-4 but has issues with local models, specifically related to the interaction between AutoGPTQ (a 4-bit quantization library for GPUs) and the transformer and/or HuggingFaceLLMPredictor in llama_index.

The comments provide steps to get the AutoGPTQ working with llama_index, including hacking the transformers library, loading the model and tokenizer, setting up prompts, and generating the question/answer pairs using the Axolotl library. The community members also suggest using the built-in response evaluator to classify the retrieved results as helpful or not, which could be used to fine-tune a model to classify the helpfulness of the responses.

Useful resources

FFred Bliss

Has anyone used the

Plain Text

data_generator.generate_questions_from_nodes()

pattern to generate synthetic question/answer pairs for finetuning datasets? Was working with some folks in another discord (local LLM focused) on generating synthetic instruction data for qlora/finetuning, and realized that all the data I need is indexed in a vector store already. This pattern works great for gpt4, hit or miss with local models - seems to mostly be issues between AutoGPTQ (4 bit quant library for GPU) and the transformer and/or HuggingFaceLLMPredictor in llama_index (borrowed from langchain?).

Working on a solution across a few different threads, just curious if anyone went down this path yet.

edit: but by using this + https://github.com/OpenAccess-AI-Collective/axolotl for prompt strategies for converting to jsonl formats for a given instruction set, it's a pretty great solution. Just costly to use gpt4 to generate them. 🙂

9 comments

jjerryjliu0

cc @ravitheja who's worked on this feature

FFred Bliss

Think I just got it working with AutoGPTQ. Had to manually set stop tokens and edit the transformers.util.py (https://github.com/jerryjliu/llama_index/issues/3501)

FFred Bliss

For future reference, if anyone needs the code pattern for using AutoGPTQ with llama_index, this is confirmed working on my side -

FFred Bliss

Step 1. Hack transformers (this sucks, but I couldn't find any other way - if anyone else does, let me know)
https://github.com/jerryjliu/llama_index/issues/3501
Quote from issue:

"You can temporarily make it work this way:
open site-packages/transformers/generation/utils.py
this will be located in the folder wherever your python interpreter is
delete line 1139, 1140, 1141
Remember to keep those lines somewhere, for when you are done with this project and have to restore the package as it was."

FFred Bliss

Step 2. Load your model and tokenizer like this.

Plain Text

# V2

import os
import json
import torch
from transformers import AutoTokenizer, StoppingCriteria, StoppingCriteriaList, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantized_model_dir = os.path.join(llm_model_path, "TheBloke_WizardLM-30B-GPTQ")
model_basename = "wizardlm-30b-GPTQ-4bit.act.order"

use_triton = False

tokenizer_config_path = os.path.join(quantized_model_dir, "tokenizer_config.json")

# Load the tokenizer config as a dict
with open(tokenizer_config_path, "r") as f:
    tokenizer_config = json.load(f)

# Now initialize the tokenizer with the config
tokenizer = AutoTokenizer.from_pretrained(
    quantized_model_dir, use_fast=True, return_token_type_ids=False, **tokenizer_config
)

# Verify the start and stop tokens
print(f"Start token: {tokenizer.bos_token}, ID: {tokenizer.bos_token_id}")
print(f"End token: {tokenizer.eos_token}, ID: {tokenizer.eos_token_id}")

model = AutoGPTQForCausalLM.from_quantized(
    quantized_model_dir,
    model_basename=model_basename,
    use_safetensors=True,
    trust_remote_code=False,
    device="cuda:0",
    use_triton=use_triton,
    quantize_config=None,
)

# Note: check the prompt template is correct for this model.
prompt = "Tell me about AI"

print("\n\n*** Generate:")

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

# Set the bos_token_id and eos_token_id
model.config.bos_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

FFred Bliss

Step 3. Set up your template prompts correctly

Plain Text

# setup prompts
from llama_index.prompts.prompts import SimpleInputPrompt

system_prompt = """
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n
USER: {query_str}\n
ASSISTANT: "
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = SimpleInputPrompt("USER: {query_str}\nASSISTANT: ")

FFred Bliss

Step 4. Set up your service context like this, using the embedding model of your choice

Plain Text

service_context = ServiceContext.from_defaults(
    llm_predictor=hf_predictor, embed_model=embed_model
)

FFred Bliss

Step 5. (WIP) generate your q&a pairs per your instruction template of choice (see: https://github.com/OpenAccess-AI-Collective/axolotl/tree/9492d4ebb718568305a7402150733c9617bfc29f/src/axolotl/prompt_strategies for different prompt strategies depending on your finetuning goals), export as json or jsonl

Integrate with axolotl: https://github.com/OpenAccess-AI-Collective/axolotl/blob/9492d4ebb718568305a7402150733c9617bfc29f/README.md?plain=1#L233

FFred Bliss

Basically a $0 synthetic data generator off your own private data.

If you use the built-in response evaluator, you could even have it classify/label for you whether or not the retrieved result was helpful / not helpful (see SAIL for reference/inspiration: https://openlsr.org/sail-7b)

Then you pretty much have a finetuned model whose sole task is to classify a given retrieved node as helpful/not helpful.

Add a reply

Find answers from the community

Has anyone used the ```data generator