APSA 2016

Using Scalable Machine Learning to
Understand Violent Collective Action

APSA 2016 Annual Meeting & Exhibition
Panel: Coding and Validating Political Event Data
September 1st, 2016

Jake Ryland Williams
Assistant Professor
Department of Information Science
College of Computing and Informatics
Drexel University

Lefteris Jason Anastasopoulos
Assistant Professor
School of Public and International Affairs
Georgia Informatics Institute
University of Georgia, Athens

Overview

Focus: Understanding social action through social media text.

Why?
- The use of social media has become ubiquitous.
- Users can record even the most mundane actions.
- Along with text, may come other important data.

Twitter: a grassroots tool for mobilization

Traditional mass media requires significant resources to employ.

Cheap and accessible, social media has empowered many.

For those unfamiliar with Twitter:
- Tweets are short (140 character limit) conversational messages.
- These can be geo-located, allowing for movement tracking.
- Twitter has a huge user base (more than 300 million active).

Dataset 1: location-tagged messages

The 'PullTweets' database:
- Collected from Twitter's public (spritzer, 1%) API.
- At the time (2014), over 1% of tweets were geo-tagged.
- Targeted filtering rendered over 600 million geo-tagged tweets.
- Data were enriched with polygon tags (country, state, county).

Tweets

Dataset 2: space-time tagged events

The AP images database:
- A database of news-covered Associate Press images.
- Images are searchable, tagged by keywords, e.g., 'protest.'
- Along with captions, the data include space-time tags.

AP images

The language of modern social action

We've gathered 2014 protest times & locations from the AP.

Using these, we've filtered tweets by protest times and locations.

We code these tweets as representatives of social action types:
- collective peace, e.g., vigils, trials, singing/chanting;
- collective force, e.g., mass-arrests, looting, blockades;
- singular peace, e.g., discourse, care, disgust; and
- singular force, e.g., physical assault, instigation.

Language is most often entirely unrelated,

but can represent any number of action types at once.

Examples

Methodology

Task: Classify tweets for action types independently.

Algorithm: binary, Naive Bayes (NB) classifiers with enhancements:
- Instead of just words, phrases, e.g., 'tear gas,' employ context.
- We decompose NB classifications:
  
  $f(w_i)\left|\log{\mathcal{L}(w_i|c_\text{pos})} - \log{\mathcal{L}(w_i|c_\text{neg})}\right|$
  to help answer 'why?'

We also:
- cluster tweets spatially to track movements of groups, and
- track time series and look for aberrant signatures.

Expected performance

In-domain

Out-of-domain

Prototypical applications

We elucidate social action at very fine temporal and spatial scales.

With this, we can aggregate all collective action at higher levels of geography (e.g., Census tract, county, etc.).

This enables researchers to study protest activity spatially.

Collective action by county

Forceful collective action by county

Building a real-time application

Our tweets were collected from the public API over 2014.

Back then, geo-tagging was prevalant (>1% of all tweets).

Policy decisions (integration with Foursquare) led to a decline.
- Currently, roughly 0.2% of tweets are geo-tagged.
- Replaced by soft locations, e.g., San Fransisco, Hawaii.

Upcoming challenges

Coding remaining data and building stronger coding guidelines.

Building explorable, online interactive graphics.

Acquisition of and scaling for larger-stream data.

Re-engaging Twitter's community for geo-location participation.

Collating actions into events.

Questions? Thanks!