<?xml 
version="1.0" encoding="utf-8"?><?xml-stylesheet title="XSL formatting" type="text/xsl" href="https://clef2018.clef-initiative.eu/mc2/spip.php?page=backend.xslt" ?>
<rss version="2.0" 
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:atom="http://www.w3.org/2005/Atom"
>

<channel xml:lang="fr">
	<title>MC2 2018 Lab</title>
	<link>https://clef2018.clef-initiative.eu/mc2/</link>
	<description>MC2 CLEF Lab is centered on mining the social media sphere surrounding cultural events such as festivals and movies, It provides access for registered participants to the microbolg collection of the GAFES project funded by the French National Research Agency and lead by the University of Avignon.</description>
	<language>fr</language>
	<generator>SPIP - www.spip.net</generator>
	<atom:link href="https://clef2018.clef-initiative.eu/mc2/spip.php?id_mot=2&amp;page=backend" rel="self" type="application/rss+xml" />




<item xml:lang="en">
		<title>Wikipedia XML corpus for summary generation</title>
		<link>https://clef2018.clef-initiative.eu/mc2/spip.php?page=article&amp;id_article=13</link>
		<guid isPermaLink="true">https://clef2018.clef-initiative.eu/mc2/spip.php?page=article&amp;id_article=13</guid>
		<dc:date>2016-10-18T16:44:45Z</dc:date>
		<dc:format>text/html</dc:format>
		<dc:language>en</dc:language>
		<dc:creator>sanjuan</dc:creator>


		<dc:subject>data</dc:subject>
		<dc:subject>CLEF 2016</dc:subject>

		<description>
&lt;p&gt;Wikipedia is under Creative Commons license, and its contents can be used to contextualize tweets or to build complex queries referring to Wikipedia entities. &lt;br class='autobr' /&gt;
We have extracted an average of 10 million XML documents from Wikipedia per year since 2012 in the four main twitter languages:- en, es, fr and pt. &lt;br class='autobr' /&gt;
These documents reproduce in an easy-to-use XML structure the contents of the main Wikipedia pages: title, abstract, section and subsections as well as Wikipedia internal links. Other (&#8230;)&lt;/p&gt;


-
&lt;a href="https://clef2018.clef-initiative.eu/mc2/spip.php?page=rubrique&amp;id_rubrique=5" rel="directory"&gt;1 - Content Analysis&lt;/a&gt;

/ 
&lt;a href="https://clef2018.clef-initiative.eu/mc2/spip.php?page=mot&amp;id_mot=2" rel="tag"&gt;data&lt;/a&gt;, 
&lt;a href="https://clef2018.clef-initiative.eu/mc2/spip.php?page=mot&amp;id_mot=3" rel="tag"&gt;CLEF 2016&lt;/a&gt;

		</description>


 <content:encoded>&lt;div class='rss_texte'&gt;&lt;p&gt;Wikipedia is under Creative Commons license, and its contents can be used to contextualize tweets or to build complex queries referring to Wikipedia entities.&lt;/p&gt;
&lt;p&gt;We have extracted an average of 10 million XML documents from Wikipedia per year since 2012 in the four main twitter languages:- en, es, fr and pt.&lt;/p&gt;
&lt;p&gt;These documents reproduce in an easy-to-use XML structure the contents of the main Wikipedia pages: title, abstract, section and subsections as well as Wikipedia internal links. Other contents such as images, footnotes and external links are stripped out in order to obtain a corpus easier to process using standard NLP tools.&lt;/p&gt;
&lt;p&gt;By comparing contents over the years, it is possible to detect long term trends&lt;/p&gt;&lt;/div&gt;
		&lt;div class="hyperlien"&gt;View online : &lt;a href="http://tc.talne.eu/" class="spip_out"&gt;Micro Blog Contextualization CLEF &amp; Inex tracks data and tools&lt;/a&gt;&lt;/div&gt;
		
		</content:encoded>


		

	</item>
<item xml:lang="en">
		<title>The festival galleries dataset</title>
		<link>https://clef2018.clef-initiative.eu/mc2/spip.php?page=article&amp;id_article=12</link>
		<guid isPermaLink="true">https://clef2018.clef-initiative.eu/mc2/spip.php?page=article&amp;id_article=12</guid>
		<dc:date>2016-10-18T16:31:57Z</dc:date>
		<dc:format>text/html</dc:format>
		<dc:language>en</dc:language>
		<dc:creator>sanjuan</dc:creator>


		<dc:subject>data</dc:subject>

		<description>
&lt;p&gt;This data set allows to experiment microblog search and stream summarization. &lt;br class='autobr' /&gt;
Microblog collection &lt;br class='autobr' /&gt;
The document collection is provided to registered participants by ANR GAFES project. It consists in a pool of more than 50M unique micro-blogs from different sources with their meta-information as well as ground truth for the evaluation. &lt;br class='autobr' /&gt;
The microblog collection contains a very large pool of public posts on Twitter using the keyword festival since June 2015. These micro-blogs are (&#8230;)&lt;/p&gt;


-
&lt;a href="https://clef2018.clef-initiative.eu/mc2/spip.php?page=rubrique&amp;id_rubrique=9" rel="directory"&gt;Data&lt;/a&gt;

/ 
&lt;a href="https://clef2018.clef-initiative.eu/mc2/spip.php?page=mot&amp;id_mot=2" rel="tag"&gt;data&lt;/a&gt;

		</description>


 <content:encoded>&lt;div class='rss_texte'&gt;&lt;p&gt;This data set allows to experiment microblog search and stream summarization.&lt;/p&gt;
&lt;h2 class=&#034;spip&#034;&gt;Microblog collection&lt;/h2&gt;
&lt;p&gt;The document collection is provided to registered participants by ANR GAFES project. It consists in a pool of more than 50M unique micro-blogs from different sources with their meta-information as well as ground truth for the evaluation.&lt;/p&gt;
&lt;p&gt;The microblog collection contains a very large pool of public posts on Twitter using the keyword festival since June 2015. These micro-blogs are collected using private archive services based on streaming API. The average of unique microblog posts (i.e. without re-twitts) between June and September is 2, 616, 008 per month. The total number of collected micro-blog posts after one year (from May 2015 to May 2016) is 50, 490, 815 (24, 684, 975 without re-posts). These micro-blog posts are available online on a relational database with associated fields.&lt;/p&gt;
&lt;p&gt;Because of privacy issues, they cannot be publicly released but can be analyzed inside the organization that purchased these archives and among collaborators under privacy agreement. The CM2 lab provides this opportunity to share this data among academic participants. These archives can be indexed, analyzed and general results acquired from them can be published without restriction.&lt;/p&gt;
&lt;h2 class=&#034;spip&#034;&gt;Linked web pages &lt;/h2&gt;
&lt;p&gt;66% of the collected micro-blog posts contain Twittert.co compressed URLs. Sometimes these URLs refer to other online services like adf.ly, cur.lv, dlvr.it, ow.ly that hide the real URL. We used the spider mode of the GNU wget tool to get the real URL, this process required multiple DNS requests.&lt;/p&gt;
&lt;p&gt;The number of unique uncompressed urls collected in one year is 11,580,788 from 641,042 distinct domains.&lt;/p&gt;
&lt;h2 class=&#034;spip&#034;&gt;Getting access to the data set for scholars&lt;/h2&gt;&lt;ol class=&#034;spip&#034; role=&#034;list&#034;&gt;&lt;li&gt; register your institution to CLEF&lt;/li&gt;&lt;li&gt; send a request by email to admin@talne.eu from the same domain as your institution with full contact information.&lt;/li&gt;&lt;li&gt; if accepted, you will receive a confidential agreement to be approved by your institution.&lt;/li&gt;&lt;li&gt; once we get back the agreement you will receive personal information to access lab data servers.&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;
		&lt;div class="hyperlien"&gt;View online : &lt;a href="http://ceur-ws.org/Vol-1609/16091197.pdf" class="spip_out"&gt;Cultural micro-blog Contextualization 2016 Workshop Overview: data and pilot tasks &lt;/a&gt;&lt;/div&gt;
		
		</content:encoded>


		

	</item>
<item xml:lang="en">
		<title>Microlog Data Set</title>
		<link>https://clef2018.clef-initiative.eu/mc2/spip.php?page=article&amp;id_article=4</link>
		<guid isPermaLink="true">https://clef2018.clef-initiative.eu/mc2/spip.php?page=article&amp;id_article=4</guid>
		<dc:date>2015-11-02T08:08:38Z</dc:date>
		<dc:format>text/html</dc:format>
		<dc:language>en</dc:language>
		<dc:creator>sanjuan</dc:creator>


		<dc:subject>data</dc:subject>

		<description>
&lt;p&gt;The document collection provided by GAFES project consists a pool of more than 70M unique microblogs from different sources with their meta-information and expanded URLs on a MySQL server. Due to legal terms the access to this database is restricted to registered participants under privacy agreement. &lt;br class='autobr' /&gt;
Along with the microblog corpus, a clean simplified xml dump of wikipedia easy to index and to process with state of the art NLP tools is made available to participants. Ground truth (&#8230;)&lt;/p&gt;


-
&lt;a href="https://clef2018.clef-initiative.eu/mc2/spip.php?page=rubrique&amp;id_rubrique=6" rel="directory"&gt;2 - MicroBlog Search&lt;/a&gt;

/ 
&lt;a href="https://clef2018.clef-initiative.eu/mc2/spip.php?page=mot&amp;id_mot=2" rel="tag"&gt;data&lt;/a&gt;

		</description>


 <content:encoded>&lt;div class='rss_texte'&gt;&lt;p&gt;The document collection provided by GAFES project consists a pool of more than 70M unique microblogs from different sources with their meta-information and expanded URLs on a MySQL server. Due to legal terms the access to this database is restricted to registered participants under privacy agreement.&lt;/p&gt;
&lt;p&gt;Along with the microblog corpus, a clean simplified xml dump of wikipedia easy to index and to process with state of the art NLP tools is made available to participants. Ground truth material is the following:&lt;/p&gt;&lt;/div&gt;
		
		</content:encoded>


		

	</item>



</channel>

</rss>
