Research: Summarization metrics

The Transformer architecture invented by Google in 2017 has triggered a boom of text generation (natural language generation, NLG), including summarization, simplification, and translation. A consequent problem is how to judge the quality of generated text or a generator (summarizer, translator, etc.). Therefore, we are now seeing a boom of NLG metrics. For example, ACL 2022 has about 10 papers on NLG metrics. I feel fortunate for being part of the trend. My effort so far has been on summarization metrics.

View this page with properly rendered math and diagrams in Github

Table of contents

Our publications

Background: Summarization vs. Summarization evaluation/metrics

They sound similar but they differ hugely.

Background: Reference-based vs. reference-free summary evaluation/metrics

An introductory video, for our NAACL 2022 paper

Depending on the 2nd input of summary evaluation, there are two branches:

System-level vs. Summary-level evaluation

To be continued


There are two kinds of datasets associated with summarization research,

  1. summarization datasets (pairs of documents and summaries, i.e., (doc1, sum1), (doc2, sum2), ...). No system summaries generated by different summarizers and thus no human ratings on various system summaries.
    graph LR;
     A(Document 1) --- B(Reference Summary 1);
     C(Document 2) --- D(Reference Summary 2);
     E(... );
     G(Document N) --- H(Reference Summary N);


  2. summarization evaluation datasets (tuples of one document and multiple summaries generated by different summarizers, and human ratings for each summary, i.e., (doc1, sum1A, sum1B, ..., rate1A, rate1B, ...), (doc2, sum2A, sum2B, ..., rate2A, rate2B, ...), ...). If we visualize the content, it looks like below:
     graph TD;
       D1(Document 1) --> |Summarizer 1|S11(Summary 1,A)
       D1 --> |Summarizer 2|S12(Summary 1,B)
       D1 --> |Summarizer 3|S13(Summary 1,C)
       S11 --> SS11(Score for summary 1,A)
       S12 --> SS12(Score for Summary 1,B)
       S13 --> SS13(Score for Summary 1,C)
       DB(Document 2) --> |Summarizer 1|SB1(Summary 2,A)
       DB --> |Summarizer 2|SB2(Summary 2,B)
       DB --> |Summarizer 3|SB3(Summary 2,C)
       SB1 --> SSB1(Score for summary 2,A)
       SB2 --> SSB2(Score for Summary 2,B)
       SB3 --> SSB3(Score for Summary 2,C)

In summarization evaluation/quality studies, the second type of datasets always serve as test sets because human evaluation is the groundtruth on summary qualities.

Supervised approach is hard

Because a summarization evaluation dataset (TAC2010, RealSumm, Newsroom) is usually very small, say 100 samples, it is prone to overfitting to train a model using human ratings as targets/labels on such a dataset. Instead, an unsupervised approach, like ROUGE, BLEU or BERTScore, or a weak/self/semi-surpervised approach, like SueNes or BLUERT, is preferred.

Papers to read

ACL 2022