A community member tried using the DatasetGenerator class and generate_questions() method from a notebook, but ran into issues when trying to generate questions from a collection of ~1000 blog posts. The process ran for an hour without returning any results, and the community member eventually stopped it out of concern about using too many tokens. Other community members suggested trying the method on a smaller subset of documents first, as the API calls may increase the runtime for larger datasets. They also noted that the method allows specifying the number of questions to generate per chunk, which could help with larger datasets. One community member found a workaround by breaking the 1000 blog posts into 50 directories of 20 posts each, setting up a new reader and question generator for each directory, and appending the results to a list of questions, which allowed them to generate over 500 questions in 10-15 minutes.
Tried to use today on a collection of ~1000 blog posts. Ran for an hour without returning anything. Never errored but I eventually stopped out of worry I was using a crazy amount of tokens. Don't see any docs on it.
That's a good idea. Will try tomorrow on a smaller set and report back. Was wondering if there was a parameter to specify number of questions to be generated.
Ah good move I should have just done that. Just tried on a smaller data set and it works. So I should be able to handle a larger dataset with 1 question / chunk. Thanks for pointing this out
I had a directory /blogposts/ with 1000 text files. I was unable to generate questions from this when putting all posts into one data loader, even with questions per chunk set to 1.
However, I found a workaround.
I broke my 1000 blog posts into 50 directories of 20 posts each.
I then iterated through the directories, set up a new reader and question generator for each, and appended the results to a list of questions. This worked fine and I was able to generate >500 questions in about 10-15 minutes.