Using Scalable Machine Learning to
Understand Violent Collective Action
APSA 2016 Annual Meeting & Exhibition
Panel: Coding and Validating Political Event Data
September 1st, 2016
Jake Ryland Williams
Assistant Professor
Department of Information Science
College of Computing and Informatics
Drexel University
Lefteris Jason Anastasopoulos
Assistant Professor
School of Public and International Affairs
Georgia Informatics Institute
University of Georgia, Athens
Overview
Focus: Understanding social action through social media text.
- Why?
- The use of social media has become ubiquitous.
- Users can record even the most mundane actions.
- Along with text, may come other important data.
Twitter: a grassroots tool for mobilization
Traditional mass media requires significant resources to employ.
Cheap and accessible, social media has empowered many.
- For those unfamiliar with Twitter:
- Tweets are short (140 character limit) conversational messages.
- These can be geo-located, allowing for movement tracking.
- Twitter has a huge user base (more than 300 million active).
Dataset 1: location-tagged messages
- The 'PullTweets' database:
- Collected from Twitter's public (spritzer, 1%) API.
- At the time (2014), over 1% of tweets were geo-tagged.
- Targeted filtering rendered over 600 million geo-tagged tweets.
- Data were enriched with polygon tags (country, state, county).
Dataset 2: space-time tagged events
- The AP images database:
- A database of news-covered Associate Press images.
- Images are searchable, tagged by keywords, e.g., 'protest.'
- Along with captions, the data include space-time tags.
The language of modern social action
We've gathered 2014 protest times & locations from the AP.
Using these, we've filtered tweets by protest times and locations.
- We code these tweets as representatives of social action types:
- collective peace, e.g., vigils, trials, singing/chanting;
- collective force, e.g., mass-arrests, looting, blockades;
- singular peace, e.g., discourse, care, disgust; and
- singular force, e.g., physical assault, instigation.
Language is most often entirely unrelated,
but can represent any number of action types at once.
Examples
Methodology
Task: Classify tweets for action types independently.
- Algorithm: binary, Naive Bayes (NB) classifiers with enhancements:
- Instead of just words, phrases, e.g., 'tear gas,' employ context.
-
We decompose NB classifications:
to help answer 'why?'
- We also:
- cluster tweets spatially to track movements of groups, and
- track time series and look for aberrant signatures.
In-domain
Out-of-domain
Prototypical applications
We elucidate social action at very fine temporal and spatial scales.
With this, we can aggregate all collective action at higher levels of geography (e.g., Census tract, county, etc.).
This enables researchers to study protest activity spatially.
Collective action by county
Forceful collective action by county
Building a real-time application
Our tweets were collected from the public API over 2014.
Back then, geo-tagging was prevalant (>1% of all tweets).
- Policy decisions (integration with Foursquare) led to a decline.
- Currently, roughly 0.2% of tweets are geo-tagged.
- Replaced by soft locations, e.g., San Fransisco, Hawaii.
Upcoming challenges
Coding remaining data and building stronger coding guidelines.
Building explorable, online interactive graphics.
Acquisition of and scaling for larger-stream data.
Re-engaging Twitter's community for geo-location participation.
Collating actions into events.
Jake Ryland Williams
Assistant Professor
Department of Information Science
College of Computing and Informatics
Drexel University
Lefteris Jason Anastasopoulos
Assistant Professor
School of Public and International Affairs
Georgia Informatics Institute
University of Georgia, Athens