Here is the prompt that I used:
Task: Extract and format multiple-choice questions (MCQs) from the provided text.
Instructions:
Extract only MCQs with a clear question stem and options labeled 'A.', 'B.', 'C.', etc. Ignore non-MCQ content.
Correct grammatical, spelling, and formatting errors in questions and options.
Translate any Arabic text to English while preserving meaning.
Remove duplicate and near-duplicate questions. For each unique question:
Track the number of duplicates ("duplicate_count").
Track how many times each option was chosen across duplicates ("count").
Determine the correct answer ("correctAnswer") from the source if provided, otherwise from the most commonly chosen option.
Associate images and tables with the nearest question. Include images (for exampe: img_p0_1.png) and tables as structured data in "media".
Assign a unique UUID to each question ("id").
Output the results in JSON format with the following structure:
{
"questions": [
{
"id": "UUID",
"question": "Question text",
"options": [
{"option": "A. Option text", "count": 0},
{"option": "B. Option text", "count": 0}
],
"correctAnswer": "A",
"media": [
{"type": "image", "data": "path to image"},
{"type": "table", "data": "table data"}
],
"duplicate_count": 1
}
]
}
Notes:
Options must be labeled alphabetically (A., B., C., etc.).
Flag incomplete or ambiguous questions for review by adding a note in the "question" field (e.g., "[Flagged: Missing options]").
Maintain technical accuracy and clarity.