Home > Tasks 2017 > 1 - Content Analysis > Wikipedia XML corpus for summary generation
Last Updated INEX Tweet Contextualization ressource
Wikipedia XML corpus for summary generation
Tuesday 18 October 2016, by
Wikipedia is under Creative Commons license, and its contents can be used to contextualize tweets or to build complex queries referring to Wikipedia entities.
We have extracted an average of 10 million XML documents from Wikipedia per year since 2012 in the four main twitter languages:- en, es, fr and pt.
These documents reproduce in an easy-to-use XML structure the contents of the main Wikipedia pages: title, abstract, section and subsections as well as Wikipedia internal links. Other contents such as images, footnotes and external links are stripped out in order to obtain a corpus easier to process using standard NLP tools.
By comparing contents over the years, it is possible to detect long term trends
View online : Micro Blog Contextualization CLEF & Inex tracks data and tools