Indian Language Summarization
In this edition the text summarization task is offered in three Dravidian languages Kannada, Tamil and Telugu apart from the languages included in previous editions. Each language, except Kannada has over 15,000 document-summary pairs extracted from leading newspapers. Kannada has ~6000 pairs.
Category | Train Dataset | Val Dataset | Test Dataset |
---|---|---|---|
Hindi |
Download | Download | Download |
Gujarati |
Please use ILSUM 2023 Data for Gujarati from link provided bellow |
Download | |
Bengali |
Please use ILSUM 2023 Data for Gujarati from link provided bellow |
Download | |
Tamil |
Download | Download | Download |
Telugu |
Download | Download | Download |
Kannada |
Download | Download | Download |
English |
Download | Download | Download |
The dataset contains the following
Original news articles
Factually correct abstractive summary in English, Hindi and Gujarati
Factually incorrect abstractive summary in English, Hindi and Gujarati (for a subset of the data)
Type of incorrectness
An incorrect summary can have multiple types associated with it, but only one label is provided in the training data. The test data will have all the applicable labels.
For Task 1 The standard ROUGE metrics will be utilized as the standard method for evaluating automatic summarization in the live submission system. Additionally, once all submissions have been received, the BERT score will be used offline to gain insight into abstractive summarization methods. BERT Score will not be part of the leaderboard scores due to high computation requirements. We also plan to do manual evaluation on part of the dataset.
Macro F1 Score will be used for Task 2.