MC2 2018 Lab

Content Analysis Results: Language identification 2017

Malek Hajjem — 2018-03-15T09:00:54Z

Results

Topics are a random selection of original microblogs posted in June 2016 without external links and with more then 80 characters.
Submissions and scores for the two best teams can be found here Syllabs and Lia.
The task paper can be found here

@inproceedings{DBLP:conf/clef/ErmakovaMS17, author = {Liana Ermakova and Josiane Mothe and Eric SanJuan}, title = {{CLEF} 2017 Microblog Cultural Contextualization Content Analysis task Overview}, booktitle = {Working Notes of {CLEF} 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017.}, year = {2017}, crossref = {DBLP:conf/clef/2017w}, url = {http://ceur-ws.org/Vol-1866/invited_paper_14.pdf}, timestamp = {Thu, 16 Nov 2017 14:36:59 +0100}, biburl = {https://dblp.org/rec/bib/conf/clef/ErmakovaMS17}, bibsource = {dblp computer science bibliography, https://dblp.org}
}

Evaluation process

The Evaluation process detects the reliability of the language on Twitter.
In fact, Tweet objects have a long list of ‘root-level' attributes, including fundamental attributes such as "lang". When present, this attribute indicates a BCP 47 language identifier corresponding to the machine-detected language from where the microblog was edited. Obviously the machine-detected language may be different from the microblog langage.
Scores in this evaluation are assigned by a human expert. Only the tweets where the results of participants' language detector systems differ from tweet's "lang" attribute were examined. Tweets in several languages have a graduated score describing how much a language is present on it.

Available ressources Clef 2018: detailed description

Malek Hajjem — 2018-03-14T16:09:54Z

The festival galleries dataset

A massive collection of microblogs and urls related to culture festivals are provided for registered participants here .
In order to deal with such large dataset we propose different format :

A CSV format : It is a tab-separated CSV file that could be useful in case of managing dataset via a Mysql database or python programming langague.

An XML format for Indri: This format could be smoothly indexed with Indri in case of need. With tweet textual content some metadata ( see description above ) is also provided. We note that XML files are grouped by author.

The festival galleries dataset is presented partially or totally. In case of a partial format, each csv file contains gathered tweets by month. Original tweets are separated from rediffused tweets to manage lighter files.

festival
Originals: Re posts: 1- 2015-05(72M) 2015-05(54M)
2- 2015-06(235M) 2015-06(190M)
3- 2015-07(220M) 2015-07(162M)
... ...
... ...
... ...
... ...
18- 2016-10(102M) 2016-10(148M)

HTML form to test queries: this form make you able to test the Microblog search baseline system using an Indri query

*Simple queries:
For a basic query, just type in the terms you wish to search on. Each term will be weighed equally and combined in an "or" fashion.
- hiphop jazz #combine(hiphop jazz ) *Phrase Matching: To search for a specific phrase (i.e. "hiphop jazz"), you can wrap your terms using the ordered window operator #n (where n is the window size of the number of terms). #1(hiphop jazz) Your search results would return only those documents where the terms "hiphop" and "jazz" appear in order. *Unordered Windows
The #uwN operator performs a search on terms that occur within a certain window size. For example, if we wanted to look for the terms "hiphop" and "jazz" that occured within 2 terms of each other, but we did not care if the term "hiphop" came before "jazz" or not, we would write this as: #uw2(hiphop jazz) *Boolean Searches By default, the Indri will return a document if any of the terms occur in the document; documents that contain more terms will generally be ranked above documents that contain fewer terms. If you wish to specify that all of your search terms must be included, you can use the "boolean and" operator (#band). For example, if you want to ensure that the terms "hiphop" and "jazz" both exist, use: #band(hiphop jazz)

PERL API used to interroge the web service locally with suitable query in Indri language
Indri parameter files : A parameter file in XML format useful to reindex the collection with Indri
Compressed Indri Indexes per month
Programs to generate xml repositories from CSV ordered data
Root of Indri indexes and data

Links

An uncompressed list of tweets url is available for participants in csv format. This metadata could be used to explore more the tweet content.

The festival galleries dataset

sanjuan — 2016-10-18T16:31:57Z

This data set allows to experiment microblog search and stream summarization.

Microblog collection

The document collection is provided to registered participants by ANR GAFES project. It consists in a pool of more than 50M unique micro-blogs from different sources with their meta-information as well as ground truth for the evaluation.

The microblog collection contains a very large pool of public posts on Twitter using the keyword festival since June 2015. These micro-blogs are collected using private archive services based on streaming API. The average of unique microblog posts (i.e. without re-twitts) between June and September is 2, 616, 008 per month. The total number of collected micro-blog posts after one year (from May 2015 to May 2016) is 50, 490, 815 (24, 684, 975 without re-posts). These micro-blog posts are available online on a relational database with associated fields.

Because of privacy issues, they cannot be publicly released but can be analyzed inside the organization that purchased these archives and among collaborators under privacy agreement. The CM2 lab provides this opportunity to share this data among academic participants. These archives can be indexed, analyzed and general results acquired from them can be published without restriction.

Linked web pages

66% of the collected micro-blog posts contain Twittert.co compressed URLs. Sometimes these URLs refer to other online services like adf.ly, cur.lv, dlvr.it, ow.ly that hide the real URL. We used the spider mode of the GNU wget tool to get the real URL, this process required multiple DNS requests.

The number of unique uncompressed urls collected in one year is 11,580,788 from 641,042 distinct domains.

Getting access to the data set for scholars

register your institution to CLEF
send a request by email to admin@talne.eu from the same domain as your institution with full contact information.
if accepted, you will receive a confidential agreement to be approved by your institution.
once we get back the agreement you will receive personal information to access lab data servers.

View online : Cultural micro-blog Contextualization 2016 Workshop Overview: data and pilot tasks