AI-assisted metadata harmonisation for mouse models of human disease

Dr Ross Gray & Prof Crispin Miller


Labs
: Computational Biology and Data Science and Data Management Team
Duration: 4 years, starting October 2026
Closing Date: 5 June 2026

Application Instructions - please read before applying

Please note, for your application to be considered, you must upload your CV and a completed document CRUK EDI Recruitment Form(107 KB) .

  • We ask that you do not add your name or any Institution details to the CRUK EDI Recruitment Form.
  • Applications will be shortlisted initially based on the CRUK EDI Recruitment Form only. CVs will be used in further rounds of shortlisting to invite candidates to interview.
  • Please upload your CV and Recruitment Form as two seperate documents. 
  • References will only be requested after an initial shortlisting stage. 

 APPLY HERE

Background

Modern life sciences research increasingly depends on the ability to integrate complex, multimodal datasets across institutions, disease areas, platforms and experimental systems. However, much of the value in historical and newly generated datasets remains locked behind inconsistent metadata, fragmented standards, laboratory-specific terminology and variable data capture practices. These problems limit reproducibility, reduce confidence in downstream analyses, and make it difficult to reuse data across projects.

This PhD project aims to make significant advances in addressing this critical problem and supports an existing CRUK data platform project. The central challenge is to transform incomplete and inconsistently described research metadata into structured metadata that supports downstream search, analysis, comparison and reuse. The project will investigate how large language models can support this process by mapping source metadata to established standards such as OMOP and to an emerging mouse model metadata standard being developed through the National Mouse Genetics Network, while ensuring that outputs are reliable, auditable, validated and suitable for use in secure biomedical research environments.

The second part of the project will apply these methods to large historical datasets that are not currently harmonised. Once mapped into a standard representation, these datasets will be used for cross-species disease positioning, linking mouse cancer models to human cancer data. This will enable researchers to investigate which mouse models best represent particular human disease states improving model selection and translation.

Research Question

The project will address this through four linked objectives.

First, the candidate will define the metadata harmonisation problem in the context of a CRUK data platform and the wider National Mouse Genetics Network. This will involve characterising the types of metadata encountered across mouse genetics, cancer, imaging, histopathology, omics and experimental model datasets. They will review existing common data models, including OMOP and relevant biological metadata standards, and investigate how these can be used alongside the developing National Mouse Genetics Network standard for mouse model metadata. A key output will be a practical mapping strategy that allows heterogeneous source metadata to be aligned with these target standards in a consistent and auditable way.

Second, the candidate will develop a programmatic evaluation framework for assessing large commercial LLMs on metadata harmonisation tasks. This will include benchmark dataset construction, automated scoring, error injection and analysis, reproducibility checks, cost and latency profiling, and systematic comparison of model performance across different prompt, retrieval and structured-output strategies. This phase will establish which model capabilities are genuinely useful for harmonisation and where risks such as hallucination, incorrect mappings, poor uncertainty calibration or inconsistent outputs limit their safe use.

Third, the candidate will investigate secure local LLM implementation for use in environments suitable for sensitive health and biomedical research data. This will include evaluating open-weight or locally deployable models, designing workflows that avoid exposing sensitive metadata to external services, and implementing controls around access, logging, provenance, validation and human review. The aim will be to translate lessons from commercial model benchmarking into a secure, auditable and practically deployable LLM-assisted harmonisation workflow.

Fourth, the candidate will apply the resulting harmonisation workflow to large-scale historical datasets. These harmonised datasets will then be used to support disease positioning between mouse cancer models and human cancer datasets, for example by linking model genotype, tissue, tumour type to human cancer cohorts such as TCGA. The project will test whether improved metadata harmonisation increases the scale, reliability and interpretability of cross-species model comparison.

The expected outcome is both a practical contribution to an existing CRUK data platform and a generalisable framework for secure AI-assisted metadata harmonisation in biomedical research infrastructure.

Skills/Techniques that will be gained

The candidate will sit within both a CRUK computational biology research team and a data science team, gaining experience of cutting-edge research in both areas.

The candidate will gain advanced skills in biomedical data science, metadata engineering, common data models, ontology mapping and FAIR data infrastructure. They will learn to design and evaluate LLM-based workflows for scientific metadata harmonisation, including prompt engineering, structured outputs, schema validation, uncertainty scoring and human-in-the-loop review.

They will develop software engineering experience in the context of a production research data platform, including Python package development, API integration, relational databases, S3/object storage, Airflow-driven ETL/ELT workflows, testing, CI/CD, versioned schemas and reproducible reporting.

Biologically, the candidate will gain experience working with mouse cancer models, human cancer datasets, multimodal experimental metadata and cross-species disease positioning. They will also develop transferable skills in benchmarking, reproducible analysis, data governance, scientific communication and collaborative research across software, data management and cancer biology teams.

Funding 

  • stipend at CRUK rate
  • tuition fees at home or international rate
  • consumables funding

For questions regarding the application process, PhD programme/studentships at the CRUK Scotland Institute or any other queries, please contact phdstudentships@crukscotlandinstitute.ac.uk.

Closing date: 5 June 2026

Applications are open to all individuals irrespective of nationality or country of residence.

 APPLY HERE

Application Instructions - please read before applying

 Please note, for your application to be considered, you must upload your CV and a completed document CRUK EDI Recruitment Form(107 KB) .

  • We ask that you do not add your name or any Institution details to the CRUK EDI Recruitment Form.
  • Applications will be shortlisted initially based on the CRUK EDI Recruitment Form only. CVs will be used in further rounds of shortlisting to invite candidates to interview.
  • Please upload your CV and Recruitment Form as two seperate documents. 
  • References will only be requested after an initial shortlisting stage. 

Relevant Publications

  1.  Y. Salimi et al. “Evaluating language model embeddings for Parkinson’s disease cohort harmonization using a novel manually curated variable mapping schema” Sci Rep 15, 20210 (2025).
  2. A. Verbitsky et al. “Metadata harmonization from biological datasets with language models” Bioinformatics Advances 5, vbaf241 (2025).