Your options are either creating expected Input/output pairs to evaluate against (using rouge score, or similar), or using a larger LLM to generate questions and evaluate responses for you π
llama-index has the latter in the repo!
There is also the ragas repo for evaluating responses
https://github.com/explodinggradients/ragas