Simon Deakin on how to build a dataset of labour laws with global reach

13 August 2019

Following his guest lecture on "The Economic Significance of Laws Relating to Employment Protection and Different Forms of Employment", Simon Deakin (University of Cambridge) speaks to us about the algorithmic nature of legal rules, large scale quantitative analysis of laws from over 100 countries and its role to inform evidence-based policy making of labour law and regulations. He also hints at the potential of AI in the field of legal studies.

Prof. Simon Deakin, Cambridge University — Prof. Simon Deakin, University of Cambridge

You stated that “law is an algorithm in a way.” Could you elaborate on this?

An ‘algorithm’ can be defined as an ordered process for converting inputs into outputs. In computing, algorithms are essentially mathematical functions of various kinds. However, an algorithm does not have to be mathematical. When constructing our dataset we used coding ‘protocols’ or ‘algorithms’ which defined a process for converting text into numerical values. This type of ‘algorithm’ is a set of instructions expressed in verbal form; the raw material, or input, consists of texts of laws and judgments; and the output is a numerical value expressing the content of the law in a mathematical form. Think of it as one way to translate text into code. To see in more detail how we did this go to the codebook for the dataset which can be viewed online: https://www.repository.cam.ac.uk/handle/1810/263766.

Legal rules themselves are algorithmic in a different sense. Think of basic liability rules, such as those defining what a ‘tort’ is. These are algorithmic in the sense of consisting of an ordered process resulting in a legal determination: duty + breach + causation + damage = tort.

You can think of the ‘leximetric’ data we created as lying somewhere between a purely mathematical function of the kind used in machine learning and AI, and legal liability rules which are entirely verbal or textual in content. Our method has elements of both verbal and mathematical processing.

Introducing your research project, you distinguished between empirical research (“computing regressions”) and doctrinal research (“interpreting”). In your project, do you rely more on one than the other?

I understand the term ‘systematic interpretation’ to describe what lawyers do when they analyse legal rules and apply them to particular clusters of social facts (‘fact situations’). Lawyers use their knowledge and training to arrive at an interpretation of a legal text which is systematic in the sense of ‘fitting’ into a wider pattern of meaning attributed to various inter-related texts (statutes and judgments which sum to something we might recognise as ‘labour law’). Through doctrinal analysis, lawyers aim to arrive at a standard or accepted meaning of a legal rule. This is a meaning which will be regarded as possibly authoritative, and at least provisionally stable, by members of the wider ‘cognitive community’ of lawyers. Regression analysis is something else: the use of mathematical techniques to identify patterns in data, including correlations between variables, and causal inferences. This method is also partly conventional since it depends on shared understandings among social scientists of what constitutes a statistically significant result or a sufficiently meaningful correlation coefficient. In our research we used elements of both techniques. We first of all interpreted legal texts, using our labour law training, to arrive at what we considered would be their standard or accepted meaning; we then applied our coding algorithm to convert these meanings into a numerical form; and we then applied regression analysis to the resulting variables with a view to testing various claims about correlation and causation (to do this last part we worked with experts in econometrics). For example, once we had the legal data, we could use statistical techniques to test claims that labour laws tended to cause unemployment.

Your team and you are coding large data sets from a large repertoire of laws. How do you manage this complex coding process?

The online sources accompanying the dataset detail the process we followed and which other researchers can also follow if they are interested in replicating or extending our results. This, however, does not convey the full sense of what was involved: a small team of people spending several months at a time tracking down legal texts (over 100 countries were coded), mostly via the internet but sometimes through ‘physical’ text retrieval (visiting a law library or archive); reading and interpreting these texts; applying to them the coding protocol; and then checking and re-checking the data to eliminate error as far as possible. The project began on a small scale in 2005-7, was then put to one side (partly because we didn’t have funding to extend it), and then resumed with the help of new grants and the support of the ILO in 2014. The dataset was mostly complete by 2016 and since then there have been periodical updates. When the work was its most intense in 2014-15, we would be exchanging data between us more or less every day for weeks on end. It was exciting to see the dataset take shape and then to see the results of the econometric analysis. But it’s like climbing a mountain when you can’t see the summit. If you had known the difficulties you would encounter, you might not have started. I’m glad we did.

To what extent is there room for subjectivity in the coding of laws? And how does the coding process relate to the algorithmic properties of laws?

Some subjectivity is unavoidable. You have to define the coding protocol for a start; there is no single, predetermined way to do this. The best you can do here is to be as transparent as you can be in the assumptions you used. In applying the coding protocol, you have to reduce error as far as possible. The protocol was very tightly defined; there was a more or less ‘correct’ answer to be arrived at for every law. We minimised error by iteratively comparing each other’s codings and assessments until we were satisfied with the outcome. In that sense the process was somewhat deliberative. You can’t entirely eliminate bias at this point. But the virtue of our approach is that since the coding protocol is published along with a very extensive sourcebook listing all the primary sources we relied on (it is nearly 1,000 pages long), other researchers can check the data and tell us if they think we got it wrong. Thus the element of transparency and verifiability in the dataset means that there is an inbuilt error correction mechanism. We do get regular feedback now from third parties and we correct errors if they can be identified, but actually this happens very rarely. It seems that we got it mostly right.

Can you give examples for challenges you encounter in constructing data sets?

Tracking down legal texts is one problem, although the ILO’s NATLEX database has made this much more straightforward. Languages are an issue. Between us we knew half a dozen languages which covered most systems with the help of translations. Sometimes we would have to take advice from a national expert. Today we might make more use of machine learning translation applications like Google Translate.

Your project has been going on for a long time. Is there an underlying motivation or vision driving the project over a long period of time?

All researchers want to shape the field they work in and they may hope that their research will benefit society. It has taken more than a decade to see the project through to its current point, but as I have explained, there were periods when we had to set it to one side. We were fortunate in having institutional support, from Cambridge University, for a long-term project, and to have funders, including the Economic and Social Research Council and the ILO, prepared to commit significant resources to it at the critical points. The project is still not finished: there is updating of the dataset to be done, and we could extend and deepen its scope. There is also much more econometric analysis to be done with the data. It has huge potential to reshape understandings and policy debates. Also, the work is producing new insights and opening up unexpected pathways, including the link between AI and algorithmic reasoning in law discussed above. We didn’t anticipate this at all when we began, but given the huge interest in AI, the methodology underlying project takes on a new significance for research seeking to explore similarities and, just as important, differences between law and machine learning.

What did the funders make of the research

The ESRC has been promoting evidence-based policy for several decades and the ILO has a similar commitment to improving the evidence base for labour laws and labour market policies. So I hope we met their objectives. Of course, labour law is a hot potato politically and evidence alone will never be enough to ensure that we get more effective and fairer laws. There is only so much that researchers can do. The knowledge gained through empirical research is necessarily tentative. It is always open to challenge, but the challenge must on the right grounds: for example, that the methods used were inappropriate, or incorrectly applied. To see expertise itself being challenged is worrying. If we lose sight of the idea that the social sciences are a critical part of society’s knowledge base, we are going to find ourselves in a very difficult place.

Simon Deakin on how to build a dataset of labour laws with global reach

Cookie Consent