Research: Summarization metrics

The Transformer architecture invented by Google in 2017 has triggered a boom of text generation (natural language generation, NLG), including summarization, simplification, and translation. A consequent problem is how to judge the quality of generated text or a generator (summarizer, translator, etc.). Therefore, we are now seeing a boom of NLG metrics. For example, ACL 2022 has about 10 papers on NLG metrics. I feel fortunate for being part of the trend. My effort so far has been on summarization metrics.

View this page with properly rendered math and diagrams in Github

Our publications
Summarization vs. Summarization evaluation/metrics
Reference-based vs. reference-free summary evaluation/metrics
Datasets
Why not supervised approaches

Our publications

Ge Luo, Hebi Li, Youbiao He and Forrest Sheng Bao, PrefScore: Pairwise Preference Learning for Reference-free Summarization Quality Assessment, COLING 2022, See you in Korea!
Forrest Sheng Bao, Ge Luo, Hebi Li, Cen Chen, Yinfei Yang, Youbiao He, Minghui Qiu, A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling, NAACL 2022 (Trailer Video)

Background: Summarization vs. Summarization evaluation/metrics

They sound similar but they differ hugely.

Summarization:
- By: a summarizer, also called a system
- Input: a document
- Output: a summary, which is usually much shorter than the document. Also called a system summary.
```
graph LR;
    A(Document) --> B((Summarizer)) --> C(Summary) ;
```
Summary evaluation/metric:
- Input:
  1. a system summary to be judged
  2. the corresponding document, OR a reference summary
- Output: a score.
```
graph LR;
  A(System Summary) --> B((metric)) --> C(score) ;
  D(Document <br> OR <br> Reference Summary) --> B ;
```

Background: Reference-based vs. reference-free summary evaluation/metrics

An introductory video, for our NAACL 2022 paper

Depending on the 2nd input of summary evaluation, there are two branches:

Reference-based: Compares the system summary against a reference summary written by a human.
- Pros: Accurate
- Cons: Laborious to obtain reference summaries.
- Examples: ROUGE, BLEURT, MoverScore, BERTScore
```
graph LR;
  A(System Summary) --> B((metric)) --> C(score) ;
  D(Document) --> E{human} --> F(Reference Summary) --> B;
  D --> G((summarizer)) -->A;
```
- Mathematically, $f(\text{reference summary}, \text{system/generated summary})$, short as $f(\text{ref}, \text{sys})$.
Reference-free: Compares the system summary directly against the document
- Pros: Not relying on reference summaries, which are costly to obtain
- Cons: Less accurate
- Examples: BLANC, SummaQA, SUPERT, LS-Score, SueNes, PrefScore ```mermaid graph LR; A(System Summary) –> B((metric)) –> C(score) ; D(Document) –> B; D –> G((summarizer)) –>A ;
```
- Mathematically, $f(\text{document}, \text{system/generated summary})$, short as $f(\text{doc}, \text{sys})$.

System-level vs. Summary-level evaluation

To be continued

Datasets

There are two kinds of datasets associated with summarization research,

summarization datasets (pairs of documents and summaries, i.e., (doc1, sum1), (doc2, sum2), ...). No system summaries generated by different summarizers and thus no human ratings on various system summaries.
```
graph LR;
 A(Document 1) --- B(Reference Summary 1);
 C(Document 2) --- D(Reference Summary 2);
 E(... );
 G(Document N) --- H(Reference Summary N);
```
- CNN/Dailymail (CNNDM)
- BigPatent
- Billsum
- Newsroom (which is a not a good one as a summary is just one sentence)
- ScientificPapers (it has two subsets, arXiv and PubMed)
and

summarization evaluation datasets (tuples of one document and multiple summaries generated by different summarizers, and human ratings for each summary, i.e., (doc1, sum1A, sum1B, ..., rate1A, rate1B, ...), (doc2, sum2A, sum2B, ..., rate2A, rate2B, ...), ...). If we visualize the content, it looks like below:


 graph TD;
   D1(Document 1) --> |Summarizer 1|S11(Summary 1,A)
   D1 --> |Summarizer 2|S12(Summary 1,B)
   D1 --> |Summarizer 3|S13(Summary 1,C)
   S11 --> SS11(Score for summary 1,A)
   S12 --> SS12(Score for Summary 1,B)
   S13 --> SS13(Score for Summary 1,C)

   DB(Document 2) --> |Summarizer 1|SB1(Summary 2,A)
   DB --> |Summarizer 2|SB2(Summary 2,B)
   DB --> |Summarizer 3|SB3(Summary 2,C)
   SB1 --> SSB1(Score for summary 2,A)
   SB2 --> SSB2(Score for Summary 2,B)
   SB3 --> SSB3(Score for Summary 2,C)

TAC2010 (I cannot give Non-ISU personnel access to. This is US gov data.)
RealSumm
Newsroom
SummEval

In summarization evaluation/quality studies, the second type of datasets always serve as test sets because human evaluation is the groundtruth on summary qualities.

Supervised approach is hard

Because a summarization evaluation dataset (TAC2010, RealSumm, Newsroom) is usually very small, say 100 samples, it is prone to overfitting to train a model using human ratings as targets/labels on such a dataset. Instead, an unsupervised approach, like ROUGE, BLEU or BERTScore, or a weak/self/semi-surpervised approach, like SueNes or BLUERT, is preferred.