Find answers from the community

Updated 2 years ago

llama_index/QuestionGeneration.ipynb at ...

Has anyone used the DatasetGenerator class or generate_questions() method from this notebook?

https://github.com/jerryjliu/llama_index/blob/main/examples/evaluation/QuestionGeneration.ipynb

Tried to use today on a collection of ~1000 blog posts. Ran for an hour without returning anything. Never errored but I eventually stopped out of worry I was using a crazy amount of tokens. Don't see any docs on it.
L
m
5 comments
hmmm, this is super new, I haven't tried it yet.

Maybe try it on a smaller subset of documents first? I have a feeling for large datasets the API calls probably increase the runtime quite a bit
That's a good idea. Will try tomorrow on a smaller set and report back. Was wondering if there was a parameter to specify number of questions to be generated.
Oh good point. Just had a peek at the source code, looks like you can specify the questions per chunk (default is 10)

https://github.com/jerryjliu/llama_index/blob/main/gpt_index/evaluation/dataset_generation.py#L55
Ah good move I should have just done that. Just tried on a smaller data set and it works. So I should be able to handle a larger dataset with 1 question / chunk. Thanks for pointing this out
For anyone interested in question generation:

I had a directory /blogposts/ with 1000 text files. I was unable to generate questions from this when putting all posts into one data loader, even with questions per chunk set to 1.

However, I found a workaround.

I broke my 1000 blog posts into 50 directories of 20 posts each.

I then iterated through the directories, set up a new reader and question generator for each, and appended the results to a list of questions. This worked fine and I was able to generate >500 questions in about 10-15 minutes.
Add a reply
Sign up and join the conversation on Discord