Datasets

  • Datasets

    This page contains the open-source datasets that have been created and used in the MULTISENSOR Project.

    Feel free to download and use the datasets as you like. We’d love to get your feedback.

    No. Title Description Creator
    01 WikiRef220 220 news articles, which are references to specific Wikipedia pages. The selected topics of the WikiRef220 dataset (and the number of articles per topic) are:

    Paris Attacks November 2015 (36), Barack Obama (5), Premier League (37), Cypriot Financial Crisis 2012-2013 (5), Rolling Stones (1), Debt Crisis in Greece (5), Samsung Galaxy S5 (35), Greek Elections June 2012 (5), smartphone (5), Malaysia Airlines Flight 370 (39), Stephen Hawking (1), Michelle Obama (38), Tohoku earthquake and tsunami (5), NBA draft (1), U2 (1), Wall Street (1). The topics Barack Obama, Cypriot Financial Crisis 2012-2013, Rolling Stones, Debt Crisis in Greece, Greek Elections June 2012, smartphone, Stephen Hawking, Tohoku earthquake and tsunami, NBA draft, U2 and Wall Street appear no more than 5 times and therefore, they are regarded as noise. The remaining 5 topics of WikiRef220 are:

    The WikiRef186 dataset (4 topics) is the WikiRef220 without 34 documents related to “Malaysia Airlines Flight 370” and the WikiRef150 dataset (3 topics) is the WikiRef186 without the 36 documents related to “Paris Attacks”.

    If you use this dataset, please cite: Gialampoukidis, I., Vrochidis, S., & Kompatsiaris, I. (2016). A Hybrid Framework for News Clustering Based on the DBSCAN-Martingale and LDA. In Machine Learning and Data Mining in Pattern Recognition (pp. 170-184). Springer International Publishing

    CERTH
    02 WikiRef150 150 web news articles, which are references to specific Wikipedia pages, so as to ensure reliable ground-truth. The selected topics and the corresponding number of articles per topic are:

    • Barack Obama(5),
    • Premier League(37),
    • Cypriot Financial Crisis 2013(5),
    • Rolling Stones(1),
    • Debt Crisis in Greece(5),
    • Samsung Galaxy S5(35),
    • Greek Elections June 2012(5),
    • smartphone(5),
    • Malaysia Airlines Flight 370(5),
    • Stephen Hawking(1),
    • Michelle Obama(38),
    • Tohoku earthquake and tsunami(5),
    • NBA draft(1),
    • U2(1),
    • Wall Street(1)

    If you use this dataset, please cite: Gialampoukidis, I., Vrochidis, S., & Kompatsiaris, I. (2016). A Hybrid Framework for News Clustering Based on the DBSCAN-Martingale and LDA. In Machine Learning and Data Mining in Pattern Recognition (pp. 170-184). Springer International Publishing

    CERTH
    03 ArticlesNewsSitesData_1043 1043 web pages/articles retrieved from three well known news sites (i.e. BBC, The Guardian and Reuter) and their annotation with the following four topics found in the IPTC news codes taxonomy:

    • Economy-Business-Finance,
    • Lifestyle-Leisure,
    • Science-Technology and Sports.

    It should be noted that the articles are classified to a single topic.

    If you use this dataset in your research, please cite the following article:

    D. Liparas, Y. Hacohen-Kerner, A. Moumtzidou, S. Vrochidis and I. Kompatsiaris, “News articles classification using Random Forests and weighted multimodal features”, 3rd Open Interdisciplinary MUMIA Conference and 7th Information Retrieval Facility Conference (IRFC2014), Copenhagen, Denmark, November 10-12, 2014.

    CERTH
    04 ArticlesNewsSitesData_2382 2382 web pages/articles retrieved from several sites. The web pages were annotated with the following six topics found in the IPTC news codes taxonomy:

    • Nature_Environment,
    • Politics,
    • Science_Technology,
    • Economy_Business_Finance,
    • Health and Lifestyle_leisure.

    It should be noted that the articles are classified to a single topic.

    CERTH
    05 NewsArticlesData_12073 12073 news articles retrieved from several sites. The news articles were annotated with the following six topics found in the IPTC news codes taxonomy:

    • Nature_Environment,
    • Politics,
    • Science_Technology,
    • Economy_Business_Finance,
    • Health and Lifestyle_Leisure.

    It should be noted that the articles are classified to a single topic.

    CERTH
    06 YahooNewsQualityDataset The News Quality Dataset provides over 500 news articles annotated with 14 editorial quality aspects. EURECAT
    07 Event_Detection_Dataset_MS This dataset is the example set for the Multimedia concept and event detection available on the code-page.

    The dataset contains 106 videos from news reports. Videos are categorised into nine concepts/events. Keyframes for the concept and event detection are extracted. The total number of key frames in this dataset is 2826. DCNN features are extracted from the key frames based on the Caffe models trained in the work of (Markatopoulou et al., 2016). Using a random balanced split on the dataset for each concept/event, where the videos are divided into three chunks, a three-fold CV is performed using two chunks for training purposes and the remaining chunk for testing. The classification algorithm used in this code is SVM, where the “c” parameter is tuned using grid search. Output of this module is the evaluation per concept/event on videos in terms of accuracy and F-score.

    CERTH