With cat in chair.
It's likely now a motivational understatement to say that "Deceptive agents and content, and the adversarial utilization of [social] media platforms by institutions in information warfare campaigns throughout the 2010s will have lasting geopolitical impacts that stack up over time." One could go further: advances in generative artificial intelligence (AI) in the 2020s will likely make the media-driven information wars of the 2010s feel like little more than tremors in the social fabric, presaging what will likely become even more formidible misuses of information technologies. This sounds sensationilist, and its accuracy highlights the ease with which one can provoke this period's public interest in AI. As defined by the zeitgeist, AI's not all bad, maybe scary; but, it gets worse if we accept to the words of preeminent AI experts, explaining their need to operate for profit and with the source of their technologies closed. To advance AI, there're still lots of tabletop experiments left to be conducted. So, alongside work investigating malicious—and more than ever, malfeasant—uses of social data and information technologies, my research more than ever focuses on leveraging insights from quantitative linguistics to improve the design and scrutability of linguistic foundation models. This research tends to direct mathematical analysis and software development towards improving the precision with which neural network architectures learn, i.e., to match or improve learning capabilities while using less data and computational power. Where possible, this work organizes insights from its process to develop general knowledge and theories that intend to build understanding on what these algorithms learn, as well as what high-level functions might emerge within them (and from their use).
Bit Cipher -- A Simple yet Powerful Word Representation System that Integrates Efficiently with Language Models H. Zhou and J. R. Williams. Arxiv Preprint (2023).
Explicit Foundation Model Optimization with Self-Attentive Feed-Forward Neural Units J. R. Williams and H. Zhou. Arxiv Preprint (2023).
Reducing the Need for Backpropagation and Discovering Better Optima With Explicit Optimizations of Neural Networks J. R. Williams, H. Zhao. Arxiv Preprint (2023).
Which tweets 'deserve' to be included in news stories? Chronemics of tweet embedding M. I. Mujib, A. Zelenkauskaite, J. R. Williams. In Proceedings of the 56th Hawaii International Conference on System Sciences (2023).
To Know by the Company Words Keep and What Else Lies in the Vicinity J. R. Williams and H. S. Heidenreich. Arxiv Preprint (2022).
EigenNoise: A Contrastive Prior to Warm-Start Representations H. S. Heidenreich and J. R. Williams. Arxiv Preprint (2022).
MICA: Motivational Interviewing Conversational Agent for Parents as Proxies for their Children in Healthy Eating D. Smriti; T.-S. A. Kao; R. Rathod; J. Y. Shin; W. Peng; J. R. Williams; M. I. Mujib; M. Colosimo; J. Huh-Yoo. JMIR Human Factors Preprint (2022).
The Mixing Law and Experiments in Document Malformation J. R. Williams and D. Solano-Oropeza. Proceedings of the Fourth Northeast Regional Conference on Complex Systems (2021).
The Earth Is Flat and the Sun Is Not a Star: The Susceptibility of GPT-2 to Universal Adversarial Triggers H. S. Heidenreich and J. R. Williams. Proceedings of the Fourth International AAAI/ACM Conference on Artificial Intelligence, Ethics and Society (2021).
An Evaluation of Generative Pre-Training Model-based Chatbot in Therapist-Patient Dialogue Context L. Wang, M. I. Mujib, J. R. Williams, G. Demiris, and J. Huh-Yoo. Arxiv Preprint (2021)
A general solution to the preferential selection model. J. R. Williams, D. Solano-Propeza, and J. R. Hunsberger. Arxiv Preprint (2020).
NewsTweet: A Dataset of Social Media Embedding in Online Journalism. M. I. Mujib, H. S. Heidenreich, C. J. Murphy, G. C. Santia, A. Zelenkauskaite, and J. R. Williams. Arxiv Preprint (2020).
Investigating Coordinated ‘Social’ Targeting of High-Profile Twitter Accounts. H. S. Heidenreich, M. I. Mujib, and J. R. Williams. Arxiv Preprint (2020).
Investigating Coordinated 'Social' Targeting of High-Profile Twitter Accounts. H. S. Heidenreich, M. I. Mujib, and J. R. Williams. International Conference on Computational Social Science (2020).
Tailorable Autonomous Motivational Interviewing Conversational Agent. D. Smriti, J. Y. Shin, M. I. Mujib, M. Colosimo, T.-S. Kao, J. R. Williams, and J. Huh-Yoo. Conference on Human Factors in Computing Systems (2020).
A scalable machine learning approach for measuring violent and peaceful forms of political protest participation with social media data. L. J. Anastasopoulos and J. R. Williams. PLoS ONE (2019).
Latent semantic network induction in the context of linked example senses. H. S. Heidenreich and J. R. Williams. Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text (2019).
Detecting Social Bots on Facebook in an Information Veracity Context. G. C. Santia, M. I. Mujib, and J. R. Williams. Proceedings of the Thirteenth International AAAI Conference on Web and Social Media (2019).
Making Sense of Clinical Trial Descriptions: A Text Analysis Approach. M. I. Mujib, J. R. Williams, A. Gottsegen, Y. Sharma, A. Chatterjee, O. Gologorskaya. Text Analysis Across Domains Conference (2019).
BuzzFace: A News Veracity Dataset with Facebook User Commentary and Egos. G. C. Santia and J. R. Williams. Proceedings of the Twelfth International AAAI Conference on Web and Social Media (2018).
Expanding Consumer Health Vocabularies with Frequency-Conserving Internal Context Models. M. I. Mujib, C. C. Yang, M. Zhao, and J. R. Williams. IEEE International Conference on Healthcare Informatics (2018).
Empowering targeted tenant organizing: geographic forecasting of housing insecurity. A. Gottsegen and J. R. Williams. Women in Data Science Conference (2018).
Understanding disciplinary vocabularies using a full-text enabled domain-independent term extraction approach. E. Yan, J. R. Williams, and Z. Chen. PloS ONE (2017).
The Lexicocalorimeter: Gauging public health through caloric input and output on social media. S. E. Alajajian, J. R. Williams, A. J. Reagan, S. C. Alajajian, M. R. Frank, L. Mitchell, J. Lahne, C. M. Danforth, and P. S. Dodds. PloS ONE (2017).
Benchmarking sentiment analysis methods for large-scale texts: A case for using continuum-scored words and word shift graphs. A. J. Reagan, B. Tivnan, J. R. Williams, C. M. Danforth, and P. S. Dodds. EPJ Data Science (2017).
Simon's fundamental rich-gets-richer model entails a dominant first-mover advantage. P. S. Dodds, D. R. Dewhurst, F. F. Hazlehurst, C. M. Van Oort, L. Mitchell, A. J. Reagan, J. R. Williams, C. M. Danforth. Physical Review E (2017).
Context-Sensitive Recognition for Emerging and Rare Entities. J. R. Williams and Giovanni C. Santia. Proceedings of the 3rd Workshop on Noisy User-generated Text (2017).
Boundary-Based MWE Segmentation With Text Partitioning. J. R. Williams. Proceedings of the 3rd Workshop on Noisy User-generated Text (2017).
Is space a word, too? J. R. Williams and G. S. Santia. Preprint (2017).
Identifying violent protest activity with scalable machine learning. L. Anastasopoulos and J. R. Williams. Annual Meeting of the Americal Politial Science Association (2016).
Vaporous Marketing: Uncovering Pervasive Electronic Cigarette Advertisements on Twitter. E. M. Clark, C. A. Jones, J. R. Williams, A. N. Kurti, M. C. Norotsky, C. M. Danforth, P. S. Dodds. PLoS ONE (2016).
Sifting robotic from organic text: A natural language approach for detecting automation on Twitter. E. M. Clark, J. R. Williams, C. A. Jones, R. A. Galbraith, C. M. Danforth, P. S. Dodds. Journal of Computational Science (2016).
Photographic home styles in Congress: a computer vision approach. L. J. Anastasopoulos, D. Badani, C. Lee, S. Ginosar, J. R. Williams. Arxiv Preprint (2016).
Zipf ’s law is a consequence of coherent language production J. R. Williams, J. P. Bagrow, A. J. Reagan, S. E. Alajajian, C. M. Danforth, and P. S. Dodds. Arxiv Preprint (2016).
Identifying missing dictionary entries with frequency-conserving context models. J. R. Williams, E. M. Clark, J. P. Bagrow, C. M. Danforth, and P. S. Dodds. Physical Review E (2015).
Zipf's law holds for phrases, not words. J. R. Williams, P. R. Lessard, S. Desu, E. M. Clark, J. P. Bagrow,C. M. Danforth, P. S. Dodds. Nature Scientific Reports (2015).
Reply to Garcia et al.: Common mistakes in measuring frequency-dependent word characteristics. P. S. Dodds, E. M. Clark, S. Desu, M. R. Frank, A. J. Reagan, J. R. Williams, L. Mitchell, K. D. Harris, I. M. Kloumann, J. P. Bagrow, K. Megerdoomian, M. T. McMahon, B. F. Tivnan, and C. M. Danforth. PNAS (2015).
Text mixing shapes the anatomy of rank-frequency distributions. J. R. Williams, J. P. Bagrow, C. M. Danforth, and P. S. Dodds. Physical Review E (2015).
Human language reveals a universal positivity bias. P. S. Dodds, E. M. Clark, S. Desu, M. R. Frank, A. J. Reagan, J. R. Williams, L. Mitchell, K. D. Harris, I. M. Kloumann, J. P. Bagrow, K. Megerdoomian, M. T. McMahon, B. F. Tivnan, and C. M. Danforth. PNAS (2015).
Constructing a taxonomy of fine-grained human movement and activity motifs through social media. M. R. Frank, J. R. Williams, L. Mitchell1, J. P. Bagrow, P. S. Dodds, C. M. Danforth. Arxiv Preprint (2015).
Low-power, phase-preserving 2R Amplitude Regenerator. T. I. Lakoba, J. R. Williams, and M. Vasilyev. Optics Communications (2011).
NALM–based, phase–preserving 2R regenerator of high–duty–cycle pulses. T. I. Lakoba, J. R. Williams, and M. Vasilyev. Optics Express (2011).
I'm a natural scientist trained in physics, mathematics, and scientific programming and develop and teach data science coursework at a variety of levels at Drexel's College of Computing and Informatics. My recent work has been focused on establishing undergraduate and graduate core curricula, and is transitioning into the development of specialized electives, particularly focused on social computing applications, bias and data, and advanced social data processing methods. To see samples, please reach out via email.
Foundations of Data Science (INFO 825). Develops foundations for research practice in data science (DS) through guided and collaborative literature review activities and light research reproduction efforts. Students will gain knowledge about current and emerging trends in DS research methodologies and disciplinary applications. Discusses how critical works and student-selected publications spanning DS research areas interact with different DS-related publishing venues, as well as how to align writing and presentation styles to meet diverse norms and standards of different venues. Specific readings and topics will be selected by students, who must identify a DS-related research subject whose literature they wish to master.
Data Acquisition and Pre-Processing (DSCI 511). Introduces the breadth of data science through a project lifecycle perspective. Covers early-stage data-life cycle activities in depth for the development and dissemination of data sets. Provides technical experience with data harvesting, acquisition, pre-processing, and curation. Concludes with an open-ended term project where students explore data availability, scale, variability, and reliability.
Data Analysis and Interpretation (DSCI 521). Introduces methods for data analysis and their quantitative foundations in application to pre-processed data. Covers reproducibility and interpretation for project life cycle activities, including data exploration, hypothesis generation and testing, pattern recognition, and task automation. Provides experience with analysis methods for data science from a variety of quantitative disciplines. Concludes with an open-ended term project focused on the application of data exploration and analysis methods with interpretation via statistical, algorithmic, and mathematical reasoning.
Natural Language Processing with Deep Learning (DSCI 691). Natural Language Processing (NLP) is one of the most important technologies of the information age and is a critical component to AI. Recently, deep learning approaches have overtaken the domain. This course explores the basis of these neural models with a heavy emphasis on research.
Introduction to Data Science (INFO 103). A first course in data science. Introduces data science as a field, describes the roles and services that various members of the community play and the life cycle of data science projects. Provides an overview of common types of data, where they come from, and the challenges that practitioners face in the modern world of “Big Data.” Provides an introduction to the interdisciplinary mixture of skills that the practice requires.
Text Processing Working Group (TPWG). Would you like to learn how to make a computer understand English? Are you starting out in Natural Language Processing and need teams to work with? The Drexel Data Science Club’s (DSC’s) Text Processing Working Group (TPWG) features interactive demos, projects, tutorials and discussions about text processing. We will start from the basics: TF-IDF, cosine similarity scores, regular expressions, topical modeling, Stemming, Tokenization, Lemmatization and likewise explore more advanced topics.
The CODED lab’s research mission in data science focuses on social information and language processing, with the perspective that technological systems that facilitate public communication and knowledge sharing can be designed with open data for meta-purposes that benefit both participants and organizations. The lab has engages with collaborators ranging from broad areas, including mathematics, computer science, physics, chemical engineering, psychology, political science, linguistics, sociology, communications, and health and clinical informatics. Advisees most-often pursue technical degrees at CCI, though many collaborations initiate from ad hoc conversations at CCI, often through our interdisciplinary data science cohorts. New collaborations are always welcome; please inquire via email :).
Danielle Boccelli. Ph.D. in Information Science (current). Advising: research and graduate curriculum.
Jennifer Bochenek. Ph.D. in Information Science (current). Advising: research and graduate curriculum.
Elizabeth Sheffield. Ph.D. in Information Science (current). Advising: research and graduate curriculum.
Elizabeth Campbell. Ph.D. in Information Science. Advised: dissertation and graduate curriculum.
Munif Mujib. Ph.D. in Information Science. Advised: research and graduate curriculum.
Giovanni Santia. Graduate student in Information Science. Advised: research and graduate curriculum.
Zeyu (Andrew) Chen M.S. Data Science, AI & ML. Advised: research and graduate curriculum
Colin Murphy. M.S. Data Science. Advised: research and graduate curriculum
Meghan Colosimo. Ph.D., Clinical Psychology. Advised: research and graduate data science specialization.
Jacob Hunsberger. M.S. Chemical Engineering. Advised: research and graduate data science specialization.
Haoran Zhou. B.S. Data Science. Advising (current): research and undergraduate curriculum.
Diana Solano-Oropeza. B.S. Physics. Advised: research.
Hunter Heidenreich. B.S. Computer Science. Advised: research and undergraduate curriculum.
Jessica Hoban. B.S. Data Science. Advised: research and undergraduate curriculum.
Palash Pandey. B.S. Data Science. Advised: research and undergraduate curriculum.
Yuvraj Sharma. B.S., Informatics. Advised: research and undergraduate curriculum.
Amy Gottsegen. B.S. Computer Science. Advising: research and undergraduate curriculum.