Indian Language Summarization
Automatic text summarization for Indian languages has received surprisingly little attention from the NLP research community. While large scale datasets exist for a number of languages like English, Chinese, French, German, Spanish, etc. no such datasets
exist for any Indian languages. Most existing datasets are either not public, or are too small to be useful. Through this shared task we aim to bridge the existing gap by creating reusable corpora for Indian Language Summarization.
In the first edition we cover two major indian languages Hindi and Gujarati, which have over 350 million and over 50 million speakers respectively. Apart from this we also include Indian English, a widely regonized dialect
which can be substantially different from English spoken elsewhere.
The dataset for this task is built using articles and headline pairs from several leading newspapers of the country. We provide ~10,000 news articles for each language. The task is to generate a meaningful fixed
length summary, either extractive or abstractive, for each article. While several previous works in other languages use news artciles - headlines pair, the current dataset poses a unique challenge of code-mixing and script
mixing. It is very common for news articles to borrow phrases from english, even if the article itself is written in an Indian Language.
Examples like these are a common occurence both in the headlines as well as in the articles.
"IND vs SA, 5મી T20 તસવીરોમાં: વરસાદે વિલન બની મજા બગાડી" (India vs SA, 5th T20 in pictures: rain spoils the match)
"LIC के IPO में पैसा लगाने वालों का टूटा दिल, आई एक और नुकसानदेह खबर" (Investors of LIC IPO left broken hearted, yet another bad news).