Dataset Essays

Date:	2022
Index:	dataset-critique writing

Context

These are excerpts from a 2022 DECRA proposal draft. It comes out of ideas on the Expanded Essay and although it didn't go anywhere there are a few parts of it that informed subsequent work, including Machine Listening's discussion of dataset theatre.

Dataset essays: Creative documentary approaches to machine learning systems

This project develops creative documentary approaches to representing the datasets used for train machine learning algorithms and reflecting on the complex social problems they encode and reproduce. By applying artistic methods of media analysis, storytelling, and social and material engagement, it aims to contribute to the much-needed advancement of critical approaches to 'Responsible AI' and 'Ethical AI'.

Research context

Machine learning models are being deployed across almost every domain - including health, sport, education, government, environment, military, mining, and logistics - to automate classifications, recommendations, decisions, and actions that impact lives, economies, and ecologies. These models are notoriously difficult for people to understand, particularly those who are most directly affected by the consequences of automation. Activists and scholars have made critiques of bias expressed by algorithms as a result of existing social and economic inequalities that have been programmed into the models. Large tech companies, who are best positioned to utilize and profit from big data based AI, have tried to address and even preempt these social critiques through industry-led Ethical AI and Responsible AI initiatives. One pillar of these efforts is directed at the inscrutability of the 'black box' algorithms and machine learning models, attempting to make them transparent or explainable.

If understanding the performance of an algorithm is one approach to explainability, then another would be to understand how it was made in the first place. Datasets are an integral part of machine learning because they consist of the actual pieces of data that will be used to train and test models. They have, until recently, escaped scrutiny as cultural objects, for being seemingly unremarkable, highly specialized, and technically demanding to work with. By reframing the dataset as a cultural object and archive, it can be forensically examined to expand popular literacy and critical awareness of the socio-technic systems within which they are used.

Research question/s

In what ways would a close reading of datasets, their contents, their metadata, and their application generate critical insights into the politics of machines that are perceiving and automating our world? How might these insights contribute to the critical approaches within the developing field of AI Ethics?
If experimental, critical documentary approaches have worked with archives in order to contest official histories and memory, then what happens when these approaches are applied to datasets, whose purpose is to actively shape the future? What affective outcomes result when these artistic approaches appropriate and mobilise the very technologies that they represent?
How could these affective encounters, social engagement, public exhibitions and performances be guided to construct an activated, discursive space for reflecting on datasets and their use?

Gap in knowledge

A new overview of AI Ethics literature concludes that "critical approaches to AI ethics hold the most fruitful and promising way forward." (@birhane_forgotten_2022, 2) Nanna Bonde Thylstrup offers "critical dataset studies" as the name for recently emerging explorations of the ethical and political dimensions of datasets. (@thylstrup_ethics_2022) An industry perspective has noted, however, that "[i]t is very hard to publish a dataset documentary that dives deeply into a dataset, documenting its sources, nuances, limitations and strengths." (@leetaru_how_2018)

There has been an "archival impulse" and "documentary turn" in important politically engaged contemporary art in recent decades, but this tendency has rarely been directed towards machine learning datasets. The archives that creative, critical documentary practice in art have dealt with are typically drawn from national collections, mass media, and more recently social media. What if these tools and sensibilities were applied to datasets? My contention is that artistic research in the documentary tradition of the essay film, (@alter_political_21) which has long foregrounded matters of ethics, subjectivity, and the politics of representation, would make a significant contribution to the search for critical forms.

Within art practice, there have been occasional projects that uncover the datasets underlying AI systems, including Exposing.ai (Adam Harvey and Jules LaPlace) and Excavating AI (Kate Crawford and Trevor Paglen), typically focused on biometric data. This project is aligned in many ways, but rather than pursue a logic of "exposure" it aims for depth: through an innovative application of the essay form, (@adorno_essay_1991; @corrigan_essay_2011; @fausty_trinh_2010) following in the tradition of the essay film but expanding it through video installations, multimedia essays and live essay performances.

An affective, self-reflective handling of datasets that attends to their social and technical production as well as how they are situated in power relations would bring a different approach to ethics than that which has governed the development of AI Ethics.

Conceptual/ theoretical framework, design and methods

My conceptual framework for understanding the interactions between technical systems and social systems has long drawn from media theory, law, and philosophy, particularly Wendy Chun, Mark Andrejevic, Vilem Flusser, Antoinette Rouvroy, Mireille Hildebrandt, and Bernard Stiegler. More recent critiques of machine learning's perpetuation of race, gender, and class inequalities from activists, scholars, and professionals such as Timnit Gebru, Abeba Birhane, Ruha Benjamin, and Virginia Eubanks have implicated technology in ongoing capitalism, colonialism and systemic injustice.

An investigation of datasets might be conducted from within STS, philosophy, or computer science. I frame it within socially-engaged, research-based art partly because art's transdisciplinary orientation is well-suited to this sort of boundary object; but also because art's methods allow for both affect and a critical self-reflection as a means of raising ethical questions. I have pursued this kind of research through a methodology of the Expanded Essay.

The essay-film, as described by Timothy Corrigan, Phillip Lopate, Laura Rascaroli, and Nora Alter, demonstrates that the essay can be considered as methodology that can operate beyond the printed page. Moreover, the methodology of the essay seems to hold some appeal for filmmakers and artists examining new technologies, in part because it allows for those technologies to be incorporated into the form. Chris Marker’s Level Five (1996), Harun Farocki’s Parallel (2012), Rabih Mroué’s The Pixelated Revolution (2012), and Hito Steyerl’s How Not to be Seen. A Fucking Didactic Educational .MOV File (2013) each exemplify this tendency for me.