Large-scale Content Analysis


In the social sciences, when one wishes to perform a quantitative study, the largest limiting factor to the scope of the study is often the number of assistants available to annotate the data by hand in a process known as coding. When larger scale studies need to be accomplished, such as sample sizes of a few thousand articles, it typically requires a coding team working over a period of several months to produce the data required.

Coding is a slow process that involves manually analysing the raw data, often found as text, in order to be able to transform it into a format that can be processed and analysed with statistical software later on. This limitation on the amount of human attention that can be devoted to the task of coding has given rise to strong interest in recent years in computational methods that can bypass this coding step and process the raw data without the need for human coding.

Using data-driven techniques, we have completed numerous large-scale content analysis studies with sample sizes ranging into the tens of millions. Our work on large-scale content analysis highlights how automated approaches can perform content analysis at a scale which was previously impossible.

Screenshot of our “data playground” interface for mentions of Elon Musk in online science news.


Changes in geography over time. Maps of the United Kingdom showing the changes in geographical focus of locations extracted from articles containing the terms (A) British and English, (B) Liberal Party and Labour Party, (C) steam and electricity, and (D) horse and train for the years in which each concept received its peak attention.

Screenshot of our “data playground” interface for subject-verb-object triplets involving Artificial Intelligence extracted from online science news.


Words associated with “Nuclear power” before the Fukushima disaster in mainstream news media.


Words associated with “Nuclear power” after the Fukushima disaster in mainstream news media.


Screenshot of our “data playground” interface for subject-verb-object triplets involving Elon Musk extracted from online science news.


Related Publications

Thomas Lansdall-Welfare, Saatviga Sudhahar, James Thompson, Justin Lewis, FindMyPast Newspaper Team, Nello Cristianini: Content analysis of 150 years of British periodicals. In: Proceedings of the National Academy of Sciences of the United States of America, 2017.

Thomas Lansdall-Welfare: Discovering Culturomic Trends in Large-Scale Textual Corpora. University of Bristol, 2015.

Thomas Lansdall-Welfare, Saatviga Sudhahar, Giuseppe Veltri, Nello Cristianini: On the Coverage of Science in the Media: A Big Data Study on the Impact of the Fukushima Disaster. In: Proceedings of the 2014 IEEE International Conference on Big Data, IEEE, 2014, ISBN: 978-1-4799-5666-1.

Ilias Flaounas, Omar Ali, Thomas Lansdall-Welfare, Tijl De Bie, Nick Mosdell, Justin Lewis, Nello Cristianini: RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM: Massive-scale automated analysis of news-content—topics, style and gender. In: Digital Journalism, 1 (1), pp. 102–116, 2013.

Sen Jia, Thomas Lansdall-Welfare, Saatviga Sudhahar, Cynthia Carter, Nello Cristianini: Women Are Seen More than Heard in Online Newspapers. In: PLoS One, 11 (2), 2016.

Ilias Flaounas, Thomas Lansdall-Welfare, Panagiota Antonakaki, Nello Cristianini: The Anatomy of a Modular System for Media Content Analysis. In: arXiv preprint arXiv:1402.6208, 2014.

Thomas Lansdall-Welfare, Ilias Flaounas, Nello Cristianini: Automatic Annotation of a Dynamic Corpus by Label Propagation. In: Mathematical Methodologies in Pattern Recognition and Machine Learning, pp. 19–32, Springer New York, 2013.

Saatviga Sudhahar, Thomas Lansdall-Welfare, Ilias Flaounas, Nello Cristianini: ElectionWatch: Detecting Patterns in News Coverage of US Elections. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 82–86, Association for Computational Linguistics, Avignon, France, 2012.

Thomas Lansdall-Welfare, Ilias Flaounas, Nello Cristianini: Scalable corpus annotation by graph construction and label propagation. In: Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, pp. 25–34, 2012.