ILSUM

ILSUM 2024

Indian Language Summarization

Automatic text summarization for Indian languages has received surprisingly little attention from the NLP research community. While large scale datasets exist for a number of languages like English, Chinese, French, German, Spanish, etc. no such datasets exist for any Indian languages. Most existing datasets are either not public, or are too small to be useful. Through this shared task we aim to bridge the existing gap by creating reusable corpora for Indian Language Summarization. The third edition of ILSUM introduces three dravidian langauges Kannada, Tamil and Telugu apart from Hindi, Gujarati, Bengali and Indian English from the last edition.

The third edition of ILSUM consist two tasks,

  • Task 1: Language Summarization For Indian Languages

  • Task 2: Detecting Factual Incorrectness in Machine Generated Cross Lingual Summaries


The dataset for this task is built using articles and headline pairs from several leading newspapers of the country. We provide over 15,000 news articles for each language (except Tamil). The task is to generate a meaningful fixed length summary, either extractive or abstractive, for each article. While several previous works in other languages use news artciles - headlines pair, the current dataset poses a unique challenge of code-mixing and script mixing. It is very common for news articles to borrow phrases from english, even if the article itself is written in an Indian Language.

Examples like these are a common occurence both in the headlines as well as in the articles.

  • "IND vs SA, 5મી T20 તસવીરોમાં: વરસાદે વિલન બની મજા બગાડી" (India vs SA, 5th T20 in pictures: rain spoils the match)

  • "LIC के IPO में पैसा लगाने वालों का टूटा दिल, आई एक और नुकसानदेह खबर" (Investors of LIC IPO left broken hearted, yet another bad news).

  • "ಹುಬ್ಬಳ್ಳಿಯಿಂದ ದೇಶದ ಈ ನಗರಕ್ಕೆ ವಿಶೇಷ ರೈಲು ಆರಂಭ" (Hubballi Special Trains: Special train starts from Hubballi to this city of the country)

The task is a a continuation from the previous edition. Given a machine generated summary participants are asked to identify factual errors. We introduce a cross lingual setup this year, where the source document is in English, but the target summary is in Gujarati or Hindi. For the training set we provide the source document in english and summaries in english, hindi and gujarati. For the test set only the source english document and a summary in hindi and gujarati will be provided. The objective is to categorize each datapoint into different categories based on the presence of factual incorrectness in the summaries. We cover four types of factual error for this edition.

Possible types of factual incorrectness:

  • Misrepresentation:This involves presenting information in a way that is misleading or that gives a false impression. This could be done by exaggerating certain aspects, understating others, or twisting facts to fit a particular narrative.

  • Inaccurate Quantities or Measurements: Factual incorrectness can occur when precise quantities, measurements, or statistics are misrepresented, whether through error or intent.

  • False Attribution: Incorrectly attributing a statement, idea, or action to a person or group is another form of factual incorrectness.

  • Fabrication: Making up data, sources, or events is a severe form of factual incorrectness. This involves creating "facts" that have no basis in reality.


EXAMPLE

ARTICLE

At least 13 people, including five children, have been admitted to health facilities in Gandhinagar after a suspected outbreak of cholera in the Kalol municipality.

According to Gandhinagar Collector Hitesh Koya, two cases of cholera were confirmed on June 22 following which a notification was issued the same day declaring a two-kilometre radius of the outbreak epicentre as ‘choler-affected’ under the Epidemic Diseases Act. It is to remain status quo for a month. The areas declared as cholera-affected include the vicinity of Matva Kuva road of the municipality.

State Health Minister Rushikesh Patel apprised of the situation Union Home Minister Amit Shah, in whose electoral constituency the municipality falls under, stated an official release from the BJP.

“A total of 13 persons, including five children, are admitted at present and all are stable. We believe the outbreak could have been due to water contamination owing to leaks detected in the drainage pipes as well as water supply pipes,” said Koya.

In early July, 2021, at least four had died and over 100 were suspected to be infected by cholera in Kalol. Officials had then suspected that water pipelines, which are nearly 40 years old, could have been the source of contamination saying sewage water could have mixed with potable water pipelines owing to leakages.

In March 2022, over 200 were detected with diarrhoea in ward 4 of Kalol and the source of contamination was similarly suspected to have been due to leakage in potable water pipelines, damaged owing to digging work for laying cables.

Different Types of Factual Incorrectness

Gujarati Summary: ગાંધીનગરમાં કલોલ નગરપાલિકામાં કોલેરાની શંકાસ્પદ મહામારીને કારણે પાંચ બાળકો સહિત તેર જણને આરોગ્ય સુવિધાઓમાં દાખલ કરવામાં આવ્યા હતા. બે કેસની પુષ્ટિ થતાં આ ઉતરાણ વિસ્તારથી બે કિલોમીટર વ્યાસ ધરાવતો વિસ્તાર 'કોલેરા પીડિત' જાહેર કરવામાં આવ્યો હતો. રાજ્યના આરોગ્ય મંત્રી રુશિકેશ પટેલે કેન્દ્રીય ગૃહ મંત્રી અમિત શાહને આ પરિસ્થિતિની જાણકારી આપી હતી.

English Summary: Thirteen individuals, including five children, were admitted to health facilities in Gandhinagar due to a suspected cholera outbreak in the Kalol municipality. Two cases were confirmed, prompting a two-kilometer radius of the outbreak area to be declared 'cholera-affected.' The State Health Minister, Rushikesh Patel, communicated the situation to Union Home Minister Amit Shah.

(Explanation: This is an accurate summarization of the facts presented in the original text.)

Pro-Establishment Summary Gujarati: કલોલ નગર પાલિકા, ગાંધીનગર માં કોલેરાના બે કેસોની પુષ્ટિના પ્રતિસાદમાં સ્થાનિક સત્તાવાળાઓએ ઝડપી કાર્યવાહી કરી, જેમાં ગાંધીનગર કલેક્ટર હિતેશ કોયાએ તાત્કાલિક પ્રકટ કર્યું કે આઉટબ્રેક આસપાસના બે કિલોમીટર વિસ્તારને 'કોલેરા પ્રભાવિત' તરીકે જાહેર કરાયો છે. આ ઉપરાંત, રાજ્યના આરોગ્ય મંત્રી ઋષિકેશ પટેલે કેન્દ્રિય ગૃહ મંત્રી અમિત શાહને આ પરિસ્થિતિની અસરકારક રીતે માહિતી આપી. સરકારે આપેલી કાર્યક્ષમ પ્રતિસાદ અને પારદર્શક સંચાર તેમની લોક આરોગ્ય અને સુરક્ષાપ્રતિ પ્રતિબદ્ધતાને દર્શાવે છે

Anti-Establishment Summary Gujarati: કોલેરાના અગાઉના પ્રકોપો અને કલોલ નગરપાલિકામાં પાણીની પાઇપલાઇનની સ્થિતિ વિશે સતત ચેતવણીઓ છતાં, સ્થાનિક સત્તાવાળાઓ ફરીથી પ્રકોપ થતો અટકાવવા નિષ્ફળ ગયા, જેના પરિણામે 13 લોકો, જેમાં પાંચ બાળકોનો સમાવેશ થાય છે, ગાંધીનગરની આરોગ્ય સુવિધામાં દાખલ કરવામાં આવ્યા. રાજ્યના આરોગ્ય મંત્રી રુશિકેશ પટેલે કેન્દ્રીય ગૃહ મંત્રી અમિત શાહને માહિતી આપી હોવા છતાં, પ્રતિક્રિયા બે કેસની પુષ્ટિ થયા પછી આવી. જાહેર આરોગ્ય જોખમોને અસરકારક રીતે ઉકેલવામાં આ વારંવારની નિષ્ફળતા સ્થાપનાની પહેલ અને પૂર્વદૃષ્ટિની કમીને રેખાંકિત કરે છે.

Pro-Establishment Summary English: In response to the confirmation of two cholera cases in Kalol municipality, Gandhinagar, swift action was taken by the local authorities, with Hitesh Koya, the Gandhinagar Collector, promptly declaring a two-kilometer radius around the outbreak as 'cholera-affected'. In addition, State Health Minister Rushikesh Patel effectively communicated the situation to Union Home Minister Amit Shah. The efficient response and transparent communication by the government demonstrates their commitment to public health and safety.

Anti-Establishment Summary English: Despite previous cholera outbreaks and consistent warnings about the condition of water pipelines in Kalol municipality, the local authorities failed to prevent another outbreak, leading to 13 people, including five children, being admitted to health facilities in Gandhinagar. Although the State Health Minister, Rushikesh Patel, informed Union Home Minister Amit Shah, the reaction came after the confirmation of two cases. This repeated failure to address public health risks effectively underlines a lack of initiative and foresight from the establishment.

(Explanation: This summary misrepresents the information by mischaracterizing the timing and the reaction of the authorities. Summaries with misrepresentation are usually biased.)

Hindi Summary: गांधीनगर के कलोल नगर पालिका क्षेत्र में हजारों लोगों को हैजा प्रभावित कर चुका है। पूरे क्षेत्र को 'हैजा प्रभावित' क्षेत्र घोषित कर दिया गया है। 100 से अधिक मामलों की पुष्टि हो चुकी है, जिनमें कई बच्चे भी शामिल हैं।

English Summary: A cholera outbreak has affected thousands in the Kalol municipality, Gandhinagar. The entire region has been declared a 'cholera-affected' area. Over 100 cases have been confirmed, with many children being affected.

(Explanation: The summary inaccurately increases the number of affected individuals and the geographical scope of the cholera outbreak.)

Hindi Summary:विश्व स्वास्थ्य संगठन (WHO) ने गांधीनगर के कलोल नगरपालिका में हैजा के प्रकोप की पुष्टि की है, जिसमें 13 लोगों को स्वास्थ्य सुविधाओं में भर्ती कराया गया है। WHO ने प्रकोप क्षेत्र के चारों ओर दो किलोमीटर के दायरे को 'हैजा प्रभावित' घोषित किया है। भारत के प्रधानमंत्री को स्थिति के बारे में जानकारी दी गई है।

English Summary: The World Health Organization (WHO) confirmed a cholera outbreak in Gandhinagar's Kalol municipality, with 13 people admitted to healthcare facilities. The WHO has declared a two-kilometer radius around the outbreak area as 'cholera-affected.' The Prime Minister of India has been briefed about the situation.

(Explanation: The summary falsely attributes the confirmation of the outbreak and the response to the WHO and the Prime Minister of India.)

Hindi Summary: गांधीनगर में हैजा का प्रकोप एक जैव-इंजीनियरिंग प्रयोगशाला से निकली एक आनुवंशिक रूप से संशोधित स्ट्रेन के कारण हुआ है। इससे 13 लोगों को अस्पताल में भर्ती कराया गया है, जिनमें पांच बच्चे शामिल हैं। सरकार ने नगरपालिका के दो किलोमीटर के दायरे को 'हैजा प्रभावित' घोषित किया है।

English Summary

A cholera outbreak in Gandhinagar has resulted from a genetically modified strain that escaped a bioengineering lab. This has led to 13 hospitalizations, with five children affected. The government has declared a two-kilometer radius of the municipality as 'cholera-affected.'

(Explanation: The outbreak being linked to a genetically modified strain and an escape from a lab is entirely fabricated and not supported by the original text.)


  1. Each team can have at most 4 participants
  2. A team can submit up to 3 different runs for each language and 3 different runs for task 2 and can submit only one working note
  3. Each team is required to submit a detailed description of their algorithm(s)
  4. Participants are allowed to use any external pretrained models.