What can Python Know? Lessons from the (Digital) Field
Software programs like Python and R have revolutionised empirical research. Increasingly they are being used in Socio-Legal studies for quantitative and qualitative research. Indisputably the use of these software programmes in Socio-Legal Studies opens up spaces for the research of text, images, and videos, be it in social media, news articles, or legal decisions. They provide the opportunity to analyse huge amounts of data and unearth connections that might be missed by the human researcher. The promise of being able to undertake research on an entire body of jurisprudence in a jurisdiction has never been closer. Likewise, the analysis of emotions underlying content, such as narratives expressing agreement, disagreement, happiness, or anger, through sentiment analysis, is also possible. However, in qualitative studies adopting a grounded approach to inductive research a central question arises: what can Python know? Can Python account for complexity and nuance in ways we have come to expect of high-quality qualitative research?
One of the appeals for the use of Python has been its benefits for the study of opinion sharing platforms. Social media can arguably be understood as the new public fora of discussion: it provides a virtual/digital space within which diverse people and publics come into connection, conversation, and conflict. Python has been relevant not only in understanding the influence of social media in government elections, but also in uncovering how it became a tool for protest, mobilisation, and the spread of hate speech. Additionally, it has provided insights about the ways users experience law. While such knowledge can be produced through the use of traditional methods such as discourse analysis or content analysis, they might fall short for the analysis of vast amounts of data. Natural Language Processing, a subfield of AI that applies machine learning methods to text, and uses Python or R as tools for research, provides a solution to this challenge.
In my doctorate, I merge, and draw from, content analysis and Natural Language Processing, as methods to research narratives of the #MeToo movement on Twitter through two case studies. In the pilot of my `netography´, I tested the use of Topic Modelling with Python as a tool for content analysis. This method adopts an unsupervised machine learning model of research that extracts topics from text data. Topic modelling is an example of a human-machine partnership and unsupervised machine learning in qualitative studies, in which the computer makes inferences about the text through prevalence, frequency, and parameters set in the model. Topics are a collection of dominant keywords that are typical representatives.
In the results I obtained, Python extracted the overall themes of the conversation as topics: sexual harassment, injustice, rights, or voice, amongst others. I did not face problems regarding the number of topics or pre-processing of the data, common in Natural Language Processing. However, I observed that the results did not contain the nuance offered by the manual coding exercise I did in parallel. Python’s selection of dominant words removed context, sentiment, and references to the aim of the posted opinion by splitting the sentences (tokenization), returning the word to its dictionary form (lemmatization) and removing common words (stopwords), which do not add anything to the topic. In other words, while dominant words created general themes, they did not necessarily capture the complexities of public opinion conversations.
These issues point to the limits of Python and Topic Modelling for qualitative research of this nature, which uses an interpretative framework to analyse the interactions of tweets within the #MeToo movement. As Mulcahy and Wheeler have noted, the ‘interpretive tradition of textual analysis involves a thorough and fine-grained analysis of text in order to grasp meaning through hermeneutic understanding.’ My first approach to Topic Modelling was to assume that Python, with the use of an LDA algorithm, could unearth a nuanced differentiation to ground a Socio-Legal approach to the narratives on Twitter. The pilot proved me wrong.
Topic Modelling and the use Python has many virtues for qualitative research, as explained above. However, deferring to the software to identify relationships between words to categorise themes and therefore perform content analysis in the Socio-Legal field cannot, as yet, replace the researcher. Python misses the complexity of the context beyond word prevalence and in doing so, misses the central objective of this type of research: the portrayal of lived experiences of the law representing a form of legal consciousness from the bottom-up, manifesting in social media narratives. This does not mean Python cannot do the job, if fed with the parameters from old fashioned manual coding structures. It means that coding and analysis is still an iterative process, requiring critical engagement and sustained reflexivity on the part of the researcher.