Has anyone used the
data_generator.generate_questions_from_nodes()
pattern to generate synthetic question/answer pairs for finetuning datasets? Was working with some folks in another discord (local LLM focused) on generating synthetic instruction data for qlora/finetuning, and realized that all the data I need is indexed in a vector store already. This pattern works great for gpt4, hit or miss with local models - seems to mostly be issues between AutoGPTQ (4 bit quant library for GPU) and the transformer and/or HuggingFaceLLMPredictor in llama_index (borrowed from langchain?).
Working on a solution across a few different threads, just curious if anyone went down this path yet.
edit: but by using this +
https://github.com/OpenAccess-AI-Collective/axolotl for prompt strategies for converting to jsonl formats for a given instruction set, it's a pretty great solution. Just costly to use gpt4 to generate them. 🙂