MC2 2018 Lab

TimeLine Illustration based on Microblogs

Lorraine, Philippe — 2016-10-19T19:42:26Z

This paper by Nayanika DOGRA, Philippe MULHEM, Nawal OULD AMER, and Lorraine GOEURIOT presents the approach used by the LIG-MRIM research group to the participation of the pilot task TimeLine illustration based on Microblogs for the 2016 CLEF Cultural Microblog Contextualization WorkShop that lead to the 2017 lab.

View online : LIG at CLEF 2016 Cultural Microblog Contextualization: TimeLine Illustration based on Microblogs

Wikipedia XML corpus for summary generation

sanjuan — 2016-10-18T16:44:45Z

Wikipedia is under Creative Commons license, and its contents can be used to contextualize tweets or to build complex queries referring to Wikipedia entities.

We have extracted an average of 10 million XML documents from Wikipedia per year since 2012 in the four main twitter languages:- en, es, fr and pt.

These documents reproduce in an easy-to-use XML structure the contents of the main Wikipedia pages: title, abstract, section and subsections as well as Wikipedia internal links. Other contents such as images, footnotes and external links are stripped out in order to obtain a corpus easier to process using standard NLP tools.

By comparing contents over the years, it is possible to detect long term trends

View online : Micro Blog Contextualization CLEF & Inex tracks data and tools

Microblog Cultural Contextualization 2017 lab introduction

sanjuan — 2016-10-18T12:38:54Z

These are the slides used to presented at CLEF 2016 in Evora to introduce the CM2 lab.

Overall Procedure

Take a microblog about an event with an url.
Identify its language.
Identify a related cultural event or filter it out.
Reveal When, Where, Who ...
Relate it to Wikipedia entities

2017 Organization

Task 1: language, filtering and localization lead by Toulouse, Montréal and Paris starts … now!
Task 2: entity extraction, summarization and linking starts in November 2016 lead by Avignon, London University and Syllabs.
Task 3: time-line illustration starts in January 2017 lead by Grenoble.

Cultural Microblog Contextualization based on Wikipedia

Jian-Yun Nie, josiane, Liana Ermakova — 2016-03-31T21:40:51Z

Organizers:

Liana Ermakova, Josiane Mothe, Jian-Yun Nie (cmct1@irit.fr)

Task 1 participation deadline extended to 23 May, 2016

Objective

The aim of this task is to generate a short summary providing background information for a tweet to help a user understand it. For instance, if a microblog announced a cultural event, participants would have to provide a short summary extracted from Wikipedia that provides -extensive -background about this event. The summary must contain information about the context of the event in order to help answering questions like "what is this tweet about?" using a recent cleaned dump of Wikipedia. The context should be in the form of a readable summary, not exceeding 500 words, composed of passages from the provided Wikipedia corpus.

Any open access resources can be used in addition to the data we provide to participants' subject for describing it and providing a valid URL.

Data

Tweets to contextualize: We select a set of 1001 tweets to be contextualized by the participants using the English version of Wikipedia. These tweets in English are collected from a set of public micro-blogs on Twitter and are related to the keyword “festival”. The microblogs are in UTF8 csv format with various fields. In this task, the tweets do not contain URL. The other tasks will use additional information.

Wikipedia Crawl: Unlike tweets, Wikipedia is under Creative Commons license, and it's content can be used to contextualize tweets or to build complex queries referring to Wikipedia entities. We have extracted from Wikipedia an average of 10 million XML documents per year since 2012 in the four main twitter languages:- en, es, fr and pt. -These documents reproduce in an easy-to-use XML structure the contents of the main Wikipedia pages: title, abstract, section and subsections as well as Wikipedia internal links.
Other contents such as images, footnotes and external links are stripped out in order to obtain a corpus easy to process by standard NLP tools. By comparing contents over the years, it is possible to detect long term trends.

Format of the results

Results should be provided in CSV format:

 Q0       Q0       Q0      ...

where:

The first column is the tweet id (id field of the JSON format).
The second column is currently unused and should always be Q0.
The third column is the file name (without .xml) from which a result is retrieved, it is identical to the one in the Wikipedia document. Alternatively, the wikipedia page title can also be used.
The fourth column is the position number of the passage in the summary, independent of its informativeness.
The fifth column shows the score (integer or floating point) that should reflect the estimated informativeness of the passage. This score is used in the pooling process to build informativeness q-rels.
The sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used.
The seventh column is the raw text of the Wikipedia passage. Text is given without XML tags and without formatting characters (avoid "\n","\r","\l"). The resulting word sequence has to appear in the file indicated in the third field.
The columns are separated by tabs.

Example:

Topic 610507526174601216:

Classes "The scenic writings to the manipulated object." Francis and Peter were very promising in the art of manipulation. Some pictures of the live performances at Usine Tournefeuille.

Possible abstract:

Marionnettissimo is a puppet festival, created by the association Et Qui Libre / Marionnettissimo (or EQL / Marionnettissimo), whose objective is the development of "puppet culture", considering the public, artists, and cultural actors. The Marionnettissimo festival is part of a series of cultural actions, programming, training, conducted by the association since 1990. It takes place in the Toulouse area and the Midi-Pyrenees region, annually since 2006. The “scenic writings to the manipulated object” training was presented by Francis Monty from the La Pire Espèce group (Quebec) and Pier Porcheron from the Elvis Alatac troupe (Poitou-Charentes) at Marionnettissimo festival from the 8th to the 19th of february 2016.

Formated result:

610507526174601216 Q0 1693938 0 14.0	Marionnettissimo is a puppet festival, created by the association Et Qui Libre / Marionnettissimo (or EQL / Marionnettissimo), whose objective is the development of "puppet culture", considering the public, artists, and cultural actors.
610507526174601216 Q0 1693938 1 12.0	The Marionnettissimo festival is part of a series of cultural actions, programming, training, conducted by the association since 1990.
610507526174601216 Q0 1693938 2 11.0	The “scenic writings to the manipulated object” training was presented by Francis Monty from the La Pire Espèce group (Quebec) and Pier Porcheron from the Elvis Alatac troupe(Poitou-Charentes) at Marionnettissimo festival from the 8th to the 19th of february 2016.

Evaluation

The summaries will be evaluated according to:

Informativeness-: the way they overlap with relevant passages (number of them, vocabulary and bi-grams included or missing). For each tweet, all passages from all participants will be merged and displayed to the assessor in alphabetical order. Therefore, each passage's informativeness will be evaluated independently from others, even in the same summary. Assessors will have to provide a binary judgment on whether the passage is should appear in a summary on the topic, or not.

Readability assessed by evaluators and participants. Each participant will have to evaluate readability for a pool of summaries on an online web interface. Each summary consists of a set of passages and for each passage, assessors will have to tick four kinds of check boxes:

Syntax (S): tick the box if the passage contains a syntactic problem (bad segmentation for example),
Anaphora (A): tick the box if the passage contains an unsolved anaphora,
Redundancy (R): tick the box if the passage contains redundant information, i.e. information that has already been given in a previous passage,
Trash (T): tick the box if the passage does not make any sense in its context (i.e. after reading the previous passages). These passages must then be considered as trashed, and the readability of following passages must be assessed as if these passages were not present.

Download the data:

Tweets to contextualize (download)
Wikipedia collection to use to contextualize the tweets (download)

Submission

Participants should be registered at http://clef2016-labs-registration.dei.unipd.it/registrationForm.php. The personal access to the submission form is sent after the registration.

2016 Schedule

Topics and task guidelines released: 1 April
Run submission deadline : 23 May (extended)
Informativeness Evaluation results sent out: 5 June
Readability Evaluation results sent out: 5 June
Participant papers (CLEF proceedings) due: 7 June.
Overview paper due: 30 June

TimeLine illustration of a festival based on Microblogs

Lorraine, Philippe — 2015-11-03T13:10:11Z

Objective

The goal of this task is to link the events of a festival program to a related microblog posts. This information is very important for attendees of festivals and for organizers to get feedback.

Microblog posts will be provided with their timestamps, which are crucial as a basis for the requested linking.

Participants will be have to provide a timetable for each event using the 10 best tweets based on their relevance and diversity. In this task, diversity is a must because retrieving several times the same post is not beneficial in our case.

Data

Microblogs collection:

We collected all public micro-blog posts from twitter containing the keyword “festival” from June to September 2015 using a private archive service with twitter agreement based on streaming API. The average of unique micro-blog posts (i.e. without retweets) is 2,616,008 per month. The total number of collected posts is 13,167,910 without retweets and 24,228,699 with retweets.
These posts are provided in UTF8 csv format with various fields (tweet id, author name, language, …).
Because of privacy issues, this data cannot be publicly released but can be analyzed inside the organization that purchases these archives and among collaborators under privacy agreement. CLEF 2016 CMC workshop will provide this opportunity to share this data among participants. These archives can be indexed, analyzed and general results acquired from them can be published without restriction.

Participants for this task will be provided with a subset of the microblogs collection, matching the months of targeted festivals (July and December 2015).

Festival programme:

Two French music festivals have been selected: the festival des vieilles charrues and the transmusicales de Rennes.
The timelines provided are selected subset of each festival program: the organizers selected a subset of the whole festival program (for each stage and time, list of artists playing).

The participants are free to use any additional data to provide results: social (popularity, …) or not (knowledge bases, …); it should be described in the related paper and specified when submitting the runs.

Example

We have selected 3 events from the festival des vieilles charrues. In the table are given 3 example tweets.

16-juil-15 18:45-19:45 Anna Calvi
- Anna Calvi Festival les Vieilles Charrues jeudi 16 juillet 2015 par Herve Le Gall via @shotsfr - http://t.co/qL5lmRkZCb
- Du Nouveau sur Taste Of Indie : Anna Calvi â€“ Festival des Vieilles Charrues 2015 http://t.co/oZfS2jOKUJ http://t.co/B4DnyGxVll
- Ouest-France Vieilles Charrues. DIRECT - Doux débuts avec Anna Calvi, Soprano ... Ouest-France Le festival desâ€¦ http://t.co/G0VfPEnrs8
16-juil-15 20:10-21:45 Soprano
- RT @Sopranopsy4: Extraordinaire merci les vieilles es charrues merci la Bretagne!!!!
- RT @Laura_AnneT: #charrues @soprano dingue surtout avec le maillot psg @MaxLaMendz3 t'es un client @GuillermNicola1 #rienafoutrederien
- aux vieilles charrues on a tellement bien fait de pas aller voir soprano pour gratter des places pour muse putain
16-juil-15 22:00-23:30 Muse
- MUSE Festival des Vieilles Charrues 2015 - Carhaix - Live HD https://t.co/Qzokxb40V4 via @YouTube
- RT @Charrues: .@muse retourne littéralement le public de Kerampuilh ! #charrues15 Crédit photo : @PierreHennequin http://t.co/MRoC8aTetr
- Aux Vieilles Charrues il y avait 1,7% de chance que Muse jouent The Groove. Et ils l'ont fait PUTAIN

Format of the results

The results will be submitted as usual trec_eval top file results. Related to classical trec_eval top files, each event will be associated to one query/topic identifier.
Specify a format, needs to give details re: type of run, resources used, system used…

Evaluation

The evaluation will be carried out on selected parts of the program chosen by the task organizers depending on the number of relevant tweets per event. The evaluation measures planned are recall/precision based. Several types of runs will be proposed: time-only, content-only, time&content.

How to get the data?

To get an access to the tweets, email eric_dot_sanjuan_at_univ-avignon.fr
The topics (corresponding to the programs) can be downloaded here.

Participants should submit up to 3 runs in the TREC format, named as follows:
_Run.dat
One of them should be a baseline. Other runs can use any additional information.

A text file should also describe the runs and give the priority order.

The runs should be submitted by the 31st of May. The submission website is TBD.

Contact Information

If you have any question, email us: lorraine_dot_goeuriot_at_imag.fr and philippe_dot_mulhem_at_imag.fr

Microlog Data Set

sanjuan — 2015-11-02T08:08:38Z

The document collection provided by GAFES project consists a pool of more than 70M unique microblogs from different sources with their meta-information and expanded URLs on a MySQL server. Due to legal terms the access to this database is restricted to registered participants under privacy agreement.
Along with the microblog corpus, a clean simplified xml dump of wikipedia easy to index and to process with state of the art NLP tools is made available to participants. Ground truth (…)

- 2 - MicroBlog Search / data

Evaluation Methodology

sanjuan — 2015-11-02T07:40:36Z

Systems will be evaluated mainly on informativeness and relevance, but readability and ergonomy will be also checked. Informativeness evaluation will rely on textual references established by experts in project GAFES, following the strict methodology
at CLEF-INEX tweet contextualization track (http://inex.mmci.uni-saarland.de/tracks/qa/). Readability and ergonomy would be carried out on the output for specific festivals based on questionnaires to be filled out by lab participants. Best systems will have the opportunity to be experimented in july 2016 for real with the support of the label French Tech Culture (http://frenchculture.org/digital-cultures).

Therefore, informativeness and relevance evaluation will be automatic and reproducible while readability and ergonomy would only be available for lab participants. All systems will be required to run on a dedicated LINUX server (allowing virtual machines) provided by organizers to will have to run in real time (maximum 5s per query). Access to full micro blog data will only be authorized for applications running on this server.