About a year ago I was really convinced that automatic content analysis was part of the future for political science. There is a lot of political text out there and hand coding it all is difficult and time consuming. ACA seemed to offer a lot of potential for new insights about agenda setting, political communication, etc.
Last week I attended a workshop on text analysis and migration politics organised by COMPAS. The team there had made a great effort to bring together people working on the technical aspects of ACA (from the field of, as I believe it is known, corpus linguistics) with people trying to apply it to interesting political science questions, especially in the field of migration. I presented a paper with my colleague Tom Nicholls.
Overall both the workshop was really interesting, however it did make me wonder if ACA is going to play quite as key a role in the future of political science as I thought. Several things struck me:
- ACA in theory eliminates the need for hand coding. But in practice doing ACA properly requires a lot of hand coding to create a training and validation dataset.
- Getting relatively good results with a naïve bayes classifier which tackles a simple problem (e.g. topic classification) isn’t too technically challenging. But getting very good results is much more complex. Furthermore, the field of corpus linguistics is still very much experimenting to find the best techniques.
- I’m not really sure how best to present or interpret the measures of accuracy (precision and recall) presented by ACA, nor how to feed them in to a more typical statistical analysis.
All in all I feel there’s a lot of potential here. But doing good ACA also requires a lot of hard work, and the unfamiliar statistics for how accurate it is mean that I’m not sure the results will be accepted at face value by many political scientists.