llama_index/QuestionGeneration.ipynb at ...

At a glance

Has anyone used the DatasetGenerator class or generate_questions() method from this notebook?

https://github.com/jerryjliu/llama_index/blob/main/examples/evaluation/QuestionGeneration.ipynb

Tried to use today on a collection of ~1000 blog posts. Ran for an hour without returning anything. Never errored but I eventually stopped out of worry I was using a crazy amount of tokens. Don't see any docs on it.

5 comments

LLogan M

hmmm, this is super new, I haven't tried it yet.

Maybe try it on a smaller subset of documents first? I have a feeling for large datasets the API calls probably increase the runtime quite a bit

mmatt_a

That's a good idea. Will try tomorrow on a smaller set and report back. Was wondering if there was a parameter to specify number of questions to be generated.

LLogan M

Oh good point. Just had a peek at the source code, looks like you can specify the questions per chunk (default is 10)

https://github.com/jerryjliu/llama_index/blob/main/gpt_index/evaluation/dataset_generation.py#L55

mmatt_a

Ah good move I should have just done that. Just tried on a smaller data set and it works. So I should be able to handle a larger dataset with 1 question / chunk. Thanks for pointing this out

mmatt_a

For anyone interested in question generation:

I had a directory /blogposts/ with 1000 text files. I was unable to generate questions from this when putting all posts into one data loader, even with questions per chunk set to 1.

However, I found a workaround.

I broke my 1000 blog posts into 50 directories of 20 posts each.

I then iterated through the directories, set up a new reader and question generator for each, and appended the results to a list of questions. This worked fine and I was able to generate >500 questions in about 10-15 minutes.

Add a reply

Find answers from the community

llama_index/QuestionGeneration.ipynb at ...