In spite of a wide range of diversity and inclusion reviews, women remain systematically under-represented in academia, with fewer progressing from PhD to Professor than their (overwhelmingly white) male colleagues. Even in the absence of a moral case for tackling systemic bias, this represents not just an enormous waste of talent, but an undermining of the quality and extent of innovation at a time when government priorities—including a proposed increase to R&D spending of £60 billion—imply recruiting 260,000 more researchers by 2027. This project therefore seeks to enhance our understanding of how a ‘gendering’ of the research pipeline might offer insight into the challenges (and, hopefully, opportunities) faced by women as they make the transition to independent researchers. We know quite a bit (though not nearly enough) about the kinds of negative personal experiences that drive women out of academia, and we have useful snapshots of the overall composition of the academic workforce, but we know next-to-nothing about the research environment formed by the combination of discipline, institution, and department, and of how this shapes doctoral research and researchers. Natural Language Processing (NLP)—using computers to extract data from text—and Data Science (DS) approaches give us a way to bridge this gap: structured metadata on more than 520,000 completed PhDs collected by the British Library (BL) in order to promote public access to doctoral research can give us insight into who was studying what, where, and when. We propose to use this—together with data extracted from the unstructured text of more than 240,000 full theses held by the BL and Institutional Repositories—to develop a picture of the research landscape that allows us to estimate gender effects in doctoral research. We recognise that this is just one aspect of a much larger problem: the challenges faced by BAME students, LGBTQ+ students, and the intersectional challenges encountered by, for instance, women of colour, are profound. The scale and complexity of this issue lies beyond the scope of a single PhD. The project will also raise, and must actively engage with, the ethical issues implied by the need infer gender from input features—together with the uncertain impact on any conclusions that this implies—but this represents an opportunity for wider public engagement with the limits of 3 / 14 ‘ethical AI’. The focus on women in this work with ‘data exhaust’ therefore represents only a first step, but it is hoped that successes here will lead to follow-on work engaging more widely with ‘Research and Innovation Culture’ (UKRI Delivery Plan 2019) and ‘Talent, methods and leadership’ (ESRC Delivery Plan 2019). Where the interests and skills of the student allow, or opportunities for broader collaboration emerge, we would hope to conduct follow-up targeted qualitative investigation of the behaviours—such as supervision or support—of diverse research units.