Datasets

Here is a non-exhaustive list of openly accessible datasets released along with some of our publications.

PublicationLink#PostsTimeline
Mean Birds: Detecting Aggression and Bullying on Twitter (ACM WebSci’17) [PDF]Zenodo1.65M tweetsJune - August, 2016
Kek, Cucks, and God Emperor Trump: A Measurement Study of 4chan’s Politically Incorrect Forum and Its Effects on the Web (ICWSM’17) [PDF]Zenodo11.1M 4chan /pol/, /sp/, and /int/ postsJune 30, 2016 - September 12, 2016
The Web Centipede: Understanding How Web Communities Influence Each Other Through the Lens of Mainstream and Alternative News Sources (IMC’17) [PDF]Zenodo487k tweets, 1.8M Reddit posts/comments, 97k 4chan /pol/, /sp/, /int/, and /sci/ postsTwitter: June 30, 2016 - February 28, 2017, Reddit: June 30, 2016 - February 28, 2017, 4chan: June 30, 2016 - February 28, 2017
What is Gab? A Bastion of Free Speech or an Alt-Right Echo Chamber? (CyberSafety’18) [PDF]Zenodo22.1M Gab posts from 336.7K usersAugust 2016 - January 2018
Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior (ICWSM’18) [PDF]Zenodo, GitHub100K labeled tweet textN/A
On the Origins of Memes by Means of Fringe Web Communities (IMC’18) [PDF]Zenodo158.5M URLs and Phashes for images from Twitter, Reddit, 4chan’s /pol/, and GabJuly 2016 - July 2017
Who Let The Trolls Out? Towards Understanding State-Sponsored Trolls (ACM WebSci’19) [PDF]Zenodo10.1M tweets and 21K subreddit postsFebruary 2012 - August 2018
The Pushshift Telegram Dataset (ICWSM’20) [PDF]Zenodo2.2M Telegram users, 28K Telegram channels, and 317M Telegram messagesSeptember 2015 - November 2019
The Pushshift Reddit Dataset (ICWSM’20) [PDF]PushshiftReddit subreddits posts and usersFull history
Disturbed YouTube for Kids: Characterizing and Detecting Disturbing Content on YouTube (ICWSM’20) [PDF]Zenodo844.7K YouTube videos’ metadataN/A
Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board (ICWSM’20) [PDF]Zenodo134.5M 4chan /pol/ postsJune 29, 2016 - November 1, 2019
A Large Open Dataset from the Parler Social Network (ICWSM’21) [PDF]Zenodo183M Parler postsAugust 1, 2018 - November 1, 2021
The Evolution of the Manosphere Across the Web (ICWSM’21) [PDF]Zenodo6.7M posts from Incel Forums and 22.1M posts from Incel subredditsIncel Forums: June 19-30, 2019. Subreddits: June 2005 - December 2018
How over is it? Understanding the Incel Community on YouTube (CSCW’21) [PDF]Zenodo6.4K Incel derived videos, 5.8K random videos (control), 37.7K Incel derived recommended videos, and 29.3K control recommended videosN/A
It is just a flu: Assessing the Effect of Watch History on YouTube’s Pseudoscientific Video Recommendations (ICWSM’22) [PDF]Zenodo1.1K science, 1.3K pseudoscience, and 3.2K irrelevant videosN/A
“I Can’t Keep It Up.” A Dataset from the Defunct Voat.co News Aggregator (ICWSM’22) [PDF]Zenodo2.3M submissions, 16.2M comments, 113.4K user profiles, and 7K subverse profiles from Voat.coNovember 08, 2013 - December 25, 2020