General info

The input of a question generator can contain the answer (e.g., a span in the input document) or not. The former is answer-answer while the latter is answer-agnostic, or without-answer-supervision. The answer-agnostic case is similar to summarization.
Shallow vs. Deep QG: Shallow, also called low cognitive demanding, if a question’s answer can be found in one sentence and/or is explicitly given. Recently, the focus shifts to multi-hop QG, where the answer can only be obtained by inferencing on multiple sentences. Thus, the questions are considered high cognitive demanding (HCD) or deep.
The output of a question generator can be sentences or multiple choices. In the multiple choice case, a key is generate good distractors.

Question generation papers

Low cognitive demanding QG

Unsupervised multiple-choice question generation for out-of-domain Q&A fine-tuning, ACL-2022. Input: a topic word and a supporting document. Approach: Use jsRealB text realizer from the constituent parsing tree. Distractors are highest ranking candidates except the correct answer. Dataset: SciQ.
Learning to Generate Questions by Learning What not to Generate, CGC-QG, WWW-2019. Answer-aware. Approach: Similar to extractive summarization that predicts whether each token is a “clue” word that should be used to construct the question. To predict the “clueness”, a GCN is applied on the parsing tree.

High cognitive demanding QG

Capturing Greater Context for Question Generation, AAAI-2020. Approach: Non-transformer attention networks between document and answer. Dataset: Squad, MS MARCO, and NewsQA. Evaluation: ROUGE, BLEU and METEOR, and human evaluation on naturalness (grammar) and difficulty.
Educational Question Generation of Children Storybooks via Question Type Distribution Learning and Event-Centric Summarization, ACL-2022. Answer-agonostic Approach: BERT to predict a question type, BART to generate the summary from document and question type, and finally BART to generate the question from the summary. Dataset: FairyTaleQA. Evaluation: ROUGE-L and BERTScore, and human evaluation on four aspects: question type, validity, readability, and child appropriateness.
CQG: A Simple and Effective Controlled Generation Framework for Multi-hop Question Generation, ACL-2022. Answer-aware. Approach: extracting entities using Graph Attention Network (GAT) and then predicting a flag to each token and finally use the cross attention between input and the flag sequence to generate the question. Dataset: HotpotQA. Evaluation: ROUGE-L, METEOR, and human evaluation on fluency, relevance, and complexity.
A Feasibility Study of Answer-Agnostic Question Generation for Education, ACL-2022. Answer-agnositic. This paper finds that QG can be better if the input is not the original document but the summary, human-written or machine-generated. Evaluation: human evaluation. Refs: This paper discusses the difference between answer-agnostic and answer-aware QG.
Question Generation for Reading Comprehension Assessment by Modeling How and What to Ask, ACL-2022 Approach: “Thus, in how to ask (HTA), we train (fine-tune) a model on large QG datasets, and then, we further train the model to teach the model what to ask (WTA).” mentioned education
Semantic Graphs for Generating Deep Questions, ACL-2020. Answer-aware. Approach: First build a semantic graph from the input document (using semantic role lableing, SRL, and dependency parsing), then jointly attend the graph and the input document, and finally decode the question. Evaluation: ROUGE, BLEU, and METEOR, and human evaluation on fluency, relevance, and complexity. Dataset: hotpotQA.
Exploring Question-Specific Rewards for Generating Deep Questions, COLING-2020 Answer-agnostic. Approach: Using reward functions to train question generator. The reward covers three aspectss, fluency, relevance, and answerbility. For fluency, the reward is perplexity. For revelance, a BERT model is finetuned to discirminate correct and negative answers.

Datasets

FairyTaleQA, ACL-22, for educational QA
HotPotQA, requires reasoning over multiple Wikipedia pages.
SQUAD, 2016, the classic

The evaluation of question generation