Find answers from the community

Updated last year

i am trying to generate a dataset for

i am trying to generate a dataset for fine tuning on HF. is it possible to extract also 'context' when running this code? which i presume will be the top chunk from vector search if i run a question , or 3-5 sentences surrounding the answer.
or should i do it as a separate step (go through the list of generated questions to find the closest vector)?
thank you!
ps. i tried to amend the prompt to ensure the output contains 'context', 'question', 'answer' but i am getting non-sensical response and format.
Plain Text
question_gen_query = (
    "You are a Teacher/ Professor. Your task is to setup "
    "a quiz/examination. Using the provided context, formulate "
    "a single question that captures an important fact from the "
    "context. Restrict the question to the context information provided."
)

dataset_generator = DatasetGenerator.from_documents(
    documents[:50],
    question_gen_query=question_gen_query,
    service_context=gpt_35_context,
) 
L
M
49 comments
Probably using the fine-tuning callback handler will give more desired results? https://docs.llamaindex.ai/en/stable/examples/finetuning/openai_fine_tuning.html#gpt-4-to-collect-training-data

It collects all LLM inputs/outputs. Run your questions through a query engine and collection training data

finetuning_handler.save_finetuning_events("./output_path")

It's called OpenAIFineTuningHandler, but that's only because it saves in the JSONL format for openai, but you could convert that to any format
Thanks. So, request ‘context’ in the prompt ?
not quite

Use the dataset generator to generate questions

Ask those questions to a query engine, with the finetuning handler attached

The fine tuning handler records the llm inputs and outputs, which includes the retrieved context
Got it . Thanks much
On another thought , It’s not very economical to extract the closest vector /context with gpt4 , no?
it's not, and llamaindex doesn't use gpt-4 for that 🙂

Theres two models, the LLM and embedding model
the default embedding model is text-embedding-ada-002 from openai
you can also use local embedding models
The LLM can change at any time, but the embedding model has to be constant across indexing and querying

If you change embed models, you need to re-index all your data
Ok, so it’s just a simple embedding of a question and retrieval to exctract context right ?
I got worried when I saw gpt4 as a model in that function
well, it's using gpt-4 to answer the query using the retrieved context
you can change it to gpt-3.5 if you want
gpt-4 will generate higher-quality training data though
depends on how many questions you want to run 🙂
i don't know , i 've gone through the code , and i am now using the fine tuned model in openai playground , and it just gives me general answers , or does not give me the answers that were in the training set
it is clearly not fine tuned
i'll keep trying
the thing is here, that the goal is to fine-tune for RAG, not to fine-tune for general knowledge.

Fine-tuning for general knowledge generally does not work too well, and is usually not a good idea either.

I would only fine-tune to inherit some kind of personality, or to train it to understand domain specific terms. But I would continue using RAG
your point being, fine tune the model to use it later on rag-based q&a?
have you done the research , say retrieve top 3 vectors + summary by existing model vs. retrieve top 3 vectors + summary by fine-tuned model? is it really worth it?
I would say fine-tuning is not really worth it, unless you have a super specific use-case or domain
espeically with openai, because the LLM costs are quite a bit higher for fine-tuned models
here is where i am coming from: say i have a legal text book, 'how to conduct a legal research, analysis and write legal briefs' it contains both content (what. 'a primary legal source is ...') and instructions (how. 'to brief the case here are the steps ...')
and you probably saw , i want to figure out how to teach llm to reason legally
but before that , i wanted to get the right setup in place.
fine tune model 1: how to do x , y , z
fine tuned model 2: what is x , y, z
the latter could be a rag + fine tuned model
but i definitely need model 1
i tried the routing and agents, once they find the first hit , that 's it
i need to 'embed' into a model the 'how' element
how to find a case
how to cite a case
how to apply rule by analogy , etc
sorry for my verbose background
got this result
Plain Text
{'ragas_score': 0.9058, 'answer_relevancy': 0.9560, 'faithfulness': 0.8606}
seems pretty good to me tbh for ragas
the initial one : {'ragas_score': 0.8664, 'answer_relevancy': 0.9721, 'faithfulness': 0.7814}
That seems like a good improvement then!
doesn't ragas need to be fine tuned on legal knowlege too? 🙂
i wonder how it assesses the relevancy
plus isn't ragas for rag retrievals?
i'll keep digging , and may be try to fine tune mistral or llama 2 for comparison .
thank you for all the help Logan and sorry i kept you busy
based on the description looks like this is the one i need to "bake in" knowledge https://gpt-index.readthedocs.io/en/latest/examples/finetuning/knowledge/finetune_knowledge.html
omission ?
Attachment
Screenshot_2023-10-14_at_8.46.45_AM.png
Must be typo yea
Add a reply
Sign up and join the conversation on Discord