Find answers from the community

Updated 2 years ago

llama_index/QuestionGeneration.ipynb at ...

At a glance

A community member tried using the DatasetGenerator class and generate_questions() method from a notebook, but ran into issues when trying to generate questions from a collection of ~1000 blog posts. The process ran for an hour without returning any results, and the community member eventually stopped it out of concern about using too many tokens. Other community members suggested trying the method on a smaller subset of documents first, as the API calls may increase the runtime for larger datasets. They also noted that the method allows specifying the number of questions to generate per chunk, which could help with larger datasets. One community member found a workaround by breaking the 1000 blog posts into 50 directories of 20 posts each, setting up a new reader and question generator for each directory, and appending the results to a list of questions, which allowed them to generate over 500 questions in 10-15 minutes.

Useful resources
Has anyone used the DatasetGenerator class or generate_questions() method from this notebook?

https://github.com/jerryjliu/llama_index/blob/main/examples/evaluation/QuestionGeneration.ipynb

Tried to use today on a collection of ~1000 blog posts. Ran for an hour without returning anything. Never errored but I eventually stopped out of worry I was using a crazy amount of tokens. Don't see any docs on it.
L
m
5 comments
hmmm, this is super new, I haven't tried it yet.

Maybe try it on a smaller subset of documents first? I have a feeling for large datasets the API calls probably increase the runtime quite a bit
That's a good idea. Will try tomorrow on a smaller set and report back. Was wondering if there was a parameter to specify number of questions to be generated.
Oh good point. Just had a peek at the source code, looks like you can specify the questions per chunk (default is 10)

https://github.com/jerryjliu/llama_index/blob/main/gpt_index/evaluation/dataset_generation.py#L55
Ah good move I should have just done that. Just tried on a smaller data set and it works. So I should be able to handle a larger dataset with 1 question / chunk. Thanks for pointing this out
For anyone interested in question generation:

I had a directory /blogposts/ with 1000 text files. I was unable to generate questions from this when putting all posts into one data loader, even with questions per chunk set to 1.

However, I found a workaround.

I broke my 1000 blog posts into 50 directories of 20 posts each.

I then iterated through the directories, set up a new reader and question generator for each, and appended the results to a list of questions. This worked fine and I was able to generate >500 questions in about 10-15 minutes.
Add a reply
Sign up and join the conversation on Discord