Research: Summarization metrics
The Transformer architecture invented by Google in 2017 has triggered a boom of text generation (natural language generation, NLG), including summarization, simplification, and translation. A consequent problem is how to judge the quality of generated text or a generator (summarizer, translator, etc.). Therefore, we are now seeing a boom of NLG metrics. For example, ACL 2022 has about 10 papers on NLG metrics. I feel fortunate for being part of the trend. My effort so far has been on summarization metrics.
View this page with properly rendered math and diagrams in Github
Table of contents
- Our publications
- Summarization vs. Summarization evaluation/metrics
- Reference-based vs. reference-free summary evaluation/metrics
- Datasets
- Why not supervised approaches
Our publications
- Ge Luo, Hebi Li, Youbiao He and Forrest Sheng Bao, PrefScore: Pairwise Preference Learning for Reference-free Summarization Quality Assessment, COLING 2022, See you in Korea!
- Forrest Sheng Bao, Ge Luo, Hebi Li, Cen Chen, Yinfei Yang, Youbiao He, Minghui Qiu, A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling, NAACL 2022 (Trailer Video)
Background: Summarization vs. Summarization evaluation/metrics
They sound similar but they differ hugely.
- Summarization:
- By: a summarizer, also called a system
- Input: a document
- Output: a summary, which is usually much shorter than the document. Also called a system summary.
graph LR; A(Document) --> B((Summarizer)) --> C(Summary) ;
- Summary evaluation/metric:
- Input:
- a system summary to be judged
- the corresponding document, OR a reference summary
- Output: a score.
graph LR; A(System Summary) --> B((metric)) --> C(score) ; D(Document <br> OR <br> Reference Summary) --> B ;
- Input:
Background: Reference-based vs. reference-free summary evaluation/metrics
An introductory video, for our NAACL 2022 paper
Depending on the 2nd input of summary evaluation, there are two branches:
- Reference-based: Compares the system summary against a reference summary written by a human.
- Pros: Accurate
- Cons: Laborious to obtain reference summaries.
- Examples: ROUGE, BLEURT, MoverScore, BERTScore
graph LR; A(System Summary) --> B((metric)) --> C(score) ; D(Document) --> E{human} --> F(Reference Summary) --> B; D --> G((summarizer)) -->A;
- Mathematically, $f(\text{reference summary}, \text{system/generated summary})$, short as $f(\text{ref}, \text{sys})$.
- Reference-free: Compares the system summary directly against the document
- Pros: Not relying on reference summaries, which are costly to obtain
- Cons: Less accurate
- Examples: BLANC, SummaQA, SUPERT, LS-Score, SueNes, PrefScore ```mermaid graph LR; A(System Summary) –> B((metric)) –> C(score) ; D(Document) –> B; D –> G((summarizer)) –>A ;
```
- Mathematically, $f(\text{document}, \text{system/generated summary})$, short as $f(\text{doc}, \text{sys})$.
System-level vs. Summary-level evaluation
To be continued
Datasets
There are two kinds of datasets associated with summarization research,
- summarization datasets (pairs of documents and summaries, i.e.,
(doc1, sum1), (doc2, sum2), ...
). No system summaries generated by different summarizers and thus no human ratings on various system summaries.graph LR; A(Document 1) --- B(Reference Summary 1); C(Document 2) --- D(Reference Summary 2); E(... ); G(Document N) --- H(Reference Summary N);
- CNN/Dailymail (CNNDM)
- BigPatent
- Billsum
- Newsroom (which is a not a good one as a summary is just one sentence)
- ScientificPapers (it has two subsets, arXiv and PubMed)
and
- summarization evaluation datasets (tuples of one document and multiple summaries generated by different summarizers, and human ratings for each summary, i.e.,
(doc1, sum1A, sum1B, ..., rate1A, rate1B, ...), (doc2, sum2A, sum2B, ..., rate2A, rate2B, ...), ...
). If we visualize the content, it looks like below:graph TD; D1(Document 1) --> |Summarizer 1|S11(Summary 1,A) D1 --> |Summarizer 2|S12(Summary 1,B) D1 --> |Summarizer 3|S13(Summary 1,C) S11 --> SS11(Score for summary 1,A) S12 --> SS12(Score for Summary 1,B) S13 --> SS13(Score for Summary 1,C) DB(Document 2) --> |Summarizer 1|SB1(Summary 2,A) DB --> |Summarizer 2|SB2(Summary 2,B) DB --> |Summarizer 3|SB3(Summary 2,C) SB1 --> SSB1(Score for summary 2,A) SB2 --> SSB2(Score for Summary 2,B) SB3 --> SSB3(Score for Summary 2,C)
In summarization evaluation/quality studies, the second type of datasets always serve as test sets because human evaluation is the groundtruth on summary qualities.
Supervised approach is hard
Because a summarization evaluation dataset (TAC2010, RealSumm, Newsroom) is usually very small, say 100 samples, it is prone to overfitting to train a model using human ratings as targets/labels on such a dataset. Instead, an unsupervised approach, like ROUGE, BLEU or BERTScore, or a weak/self/semi-surpervised approach, like SueNes or BLUERT, is preferred.
Papers to read
ACL 2022
- Spurious Correlations in Reference-Free Evaluation of Text Generation
- Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation
- FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metrics for Automatic Text Generation
- RoMe: A Robust Metric for Evaluating Natural Language Generation
- Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons
- UniTE: Unified Translation Evaluation
- Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics, Findings
- Just Rank: Rethinking Evaluation with Word and Sentence Similarities, weakly related