Background
Juvenile idiopathic arthritis (JIA) and uveitis can cause disability and increased comorbidity risk into adulthood if diagnosed late or treated ineffectively [
1]. Patients’ day-to-day experiences are varied and cannot be fully captured in clinical trial settings; while real-world data can help answer many research questions, substantial resources and time are needed to collect these. New datasets derived from existing data have many benefits: maximising the availability of larger sample sizes that may not be feasibly collected individually, improving the generalisability and validity of research, and providing multi-disciplinary and multi-centre collaborative opportunities [
2].
CLUSTER [
3], a UK Research and Innovation Medical Research Council/Versus Arthritis funded multidisciplinary consortium, aims to improve personalised treatments and predict disease outcomes for JIA and JIA-uveitis through bringing together knowledge, studies, and data. It builds on the work of the MRC-funded CHART consortium (Childhood Arthritis Response to Treatment), which explored how to bring together clinical and biological data from 4 UK observational JIA research cohort studies to create a larger unified dataset for analysis of predictors of treatment response. CLUSTER aims to create a large-scale JIA data resource by harmonising existing data collected in clinical trials and real-world JIA cohort studies. Maximising CLUSTER’s clinical and biological data by successfully harmonising multiple datasets is integral to producing robust analyses with maximal power in this rare disease, and facilitates the goals of defining distinct strata across disease and treatment sub-groups.
As heterogeneous datasets are often collected autonomously, for specific analytical objectives and not in coordination, as well as being nuanced (requiring prior knowledge of data capture methods and coding), a key challenge is in managing and combining disparate, non-standardised datasets. Different systems, data structures, and cultural barriers, such as apprehension to share data and restrictions related to ethical, legal and consent-procedures, are also very common [
2,
4].
There are many ways to bring together disparate datasets for analysis, such as data pooling or federated data analyses with subsequent meta-analyses of results: all approaches requiring the critical step of data harmonisation. Local data laws and study specific governance may dictate to what extent data pooling and linkage can occur; but where pooling is planned, knowledge of the potential of duplicated subjects across datasets is also key. Data linkage, where data are combined from two or more sources of data with the objective of consolidating facts concerning an individual or event that are not available in any separate record [
5], will also enhance the final dataset, although substantial data cleaning, wrangling, and computational resources may be required.
Objective
In 2021, data from 4 JIA datasets under the CLUSTER umbrella were successfully pooled and made available for analyses. Here, we describe and evaluate the current data harmonisation processes derived as part of CHART and CLUSTER, and highlight how this enables the research and data sharing goals.
Discussion
Successes
Data from over 5400 individual patients with JIA were harmonised to create prospective detailed JIA treatment datasets at a scale rarely seen – the highest number of participants in one of the contributing cohort studies is around 2000 across both MTX and etanercept (BSPAR-Et), compared to 2899 in this MTX and 2401 in this TNFi dataset. Many of these studies continue to recruit patients and collect further data; by logging the processes to create the existing dataset, it can be updated at intervals to expand it further. This is invaluable to progressing meaningful JIA research, particularly into personalised treatments and disease outcomes. Integration has added depth, enables big-data approaches such as machine learning, and highlights inconsistencies that would not be apparent in the individual datasets.
Encrypting and matching duplicates led to improvements in identifying erroneous NHS numbers and biological sample labels in the original studies. Using encrypted NHS numbers and pseudonymised study IDs maximised data usage through pooling individual treatment data from multiple sources and time points to create a more complete picture of a patient’s treatment pathway. This process can also bring in further data as it is generated or discovered. With a common unique identifier facilitating data pooling, larger datasets can now be anonymised and shared with external collaborators and third parties.
Building CLUSTER into a multi-disciplinary community was key in achieving our goals, particularly the early involvement of informatics and data science professionals. These datasets also provide the opportunity to expand our community and link with established consortia such as IMID-BIO-UK [
17] to facilitate cross-disease comparison. The additional inclusion of public datasets from the Gene Expression Omnibus (GEO) repository will allow cross-comparison and confirmation in external datasets.
Challenges
Whilst the duplicate patient identification process appears to be accurate, mismatches were identified. As some of these resulted from inaccurate recording of the NHS number in the original study database, it is possible to unknowingly miss duplicate pairs. It is also possible to miss duplicate pairs if a patient is missing an NHS number in one study, though this is a rare occurrence.
Harmonising data is a laborious process as each study is nuanced and significant data cleaning is needed to account for this. Losing specificity impacts detail available for analysis and broad duplicate removal rules could be disadvantageous. For example, where duplicate records existed, we kept CHARMS data over other studies, but automatically lost some pain VAS outcome measures as these are not collected in CHARMS, and the records were retained at the person level and not at the variable level. The impact of this may be an area for future data science research as the impact on our prediction studies has not been fully realised. We also lost granularity, e.g. ethnicity had to be coded in the final CLUSTER dataset as Caucasian/Non-Caucasian as that was the least granular classification across all studies, but much more detailed ethnicity information is available in some studies.
When creating harmonised datasets from existing observational studies, missing data are expected. Our aim was to maximise dataset sizes by avoiding limiting to complete cases only; something that would only be needed for some comprehensive measures of JIA disease activity change. Including those with some missing data retains statistical power and reduces potential biases. However, this could mean that established and validated JIA disease scores cannot be used in some circumstances if missing data are high; though this issue would also exist in the source data. If we choose to apply imputation methods, we can use all available data and make unbiased estimates of expected values, thereby providing more validity than ad hoc approaches to missing data while preserving our sample sizes and power. Imputation methods could also facilitate the inclusion of certain variables within larger analyses that were not collected at all in the source data.
Conclusion
Data pooling and harmonisation are important tools for research, enabling the development of larger, richer datasets which contain detailed treatment response data across patients’ treatment pathways. CLUSTER has succeeded in integrating large, complex JIA datasets and provides a useful reference to similar future projects. Agreeing a framework pre-integration was essential – focusing on a specific, well-defined research question for each dataset meant they were manageable and tailored to their intended use, whilst easily enabling adjustments. Additionally, CLUSTER’s collaborative process was pivotal as data integration on this scale requires a committed, knowledgeable, and diverse community.
However, there are many challenges to consider: time/costs, false linkage, loss of detail, the introduction of errors, systematic biases, and missingness. It is important these limitations are recognised to avoid misinterpretation of findings. Transparent and consistent reporting and appraisal of linked datasets can assist in improving future data collection, coding practices and linkage processes. This again highlights the importance of standardised data collection in the clinical setting.
Ongoing and future studies in JIA should focus on FAIR (findable, accessible, interoperable, reusable) principles [
18] to ensure data utility in research outside of initial study plans. One potential solution is to use a consensus-agreed core outcome dataset, which is then widely implemented in clinical care, captured in electronic patient records that are compatible with fast, efficient data download (with appropriate consent for research) such as the one created by CAPTURE-JIA [
19].
Acknowledgements
CLUSTER is supported by grants from the Medical Research Council (MRC) [MR/R013926/1] and Versus Arthritis [Grant: 22084], Great Ormond Street Hospital Children’s Charity [VS0518], AbbVie, Sobi, and Olivia’s Vision. The CLUSTER Consortium is also supported by in kind contributions from AbbVie, Pfizer, Sobi, UCB and GSK. This work is supported by the NIHR GOSH Biomedical Research Centre, the NIHR Manchester Biomedical Research Centre, and the British Society for Rheumatology (BSR), and the “UK’s Experimental Arthritis Treatment Centre for Children, supported by Versus Arthritis (grant: 20621)”. LW is additionally supported by Versus Arthritis (grant: 21593) at the Centre for Adolescent Rheumatology Versus Arthritis. KLH is additionally supported by the Centre for Epidemiology Versus Arthritis (grant: 21755) at the University of Manchester, UK. This project was enabled through access to the MRC eMedLab Medical Bioinformatics infrastructure supported by the Medical Research Council [grant number MR/L016311/1].
This study acknowledges the use of the following UK JIA cohort collections: The Biologics for Children with Rheumatic Diseases (BCRD) study (funded by Arthritis Research UK grant: 20747); The British Society for Paediatric and Adolescent Rheumatology Etanercept Cohort Study (BSPAR-ETN) (funded by a research grant from the British Society for Rheumatology (BSR); BSR has previously also received restricted income from Pfizer to fund this project;
Childhood Arthritis Prospective Study (CAPS) (funded by Versus Arthritis UK, grant: 20542); Childhood Arthritis Response to Medication Study (CHARMS) (funded by Sparks UK, reference 08ICH09; the Medical Research Council, reference MR/M004600/1, Great Ormond Street Children’s Charity (GOSCC), the Big Lottery Fund UK, and NIHR-GOSH-Biomedical Research Centre), United Kingdom Juvenile Idiopathic Arthritis Genetics Consortium (UKJIAGC). This study also acknowledges the use of the following two UK-wide JIA-associated uveitis clinical trials: the SYCAMORE Trial (funded by Arthritis Research UK, grant: 19612 and the National Institute of Health Research Health Technology Assessment, grant: 09/51/01); and the APTITUDE Trial (funded by Arthritis Research UK, grant: 20659).
M
embers
of the CLUSTER Consortium
are as follows:
Prof Lucy R. Wedderburn, Dr Melissa Kartawinata,
Ms Zoe
Wanstall, Ms Bethany R Jebson, Ms Alyssia McNeece,
Ms Elizabeth Ralph, Ms Vasiliki Alexiou, Mr
Fatjon Dekaj, Ms Aline Kimonyo, Ms
Fatema Merali, Ms Emma Sumner, Ms Emily Robinson, Ms Freya L.
Feilding (UCL GOS Institute of
Child Health, London); Prof Andrew Dick, (UCL Institute of Ophthalmology,
London); Prof Michael W. Beresford, Dr Emil Carlsson, Dr
Joanna
Fairlie, Dr Jenna F. Gritzfeld (University of
Liverpool); Prof Athimalaipet Ramanan, Ms Teresa
Duerr (University Hospitals Bristol); Prof Michael Barnes, Ms Sandra Ng,
(Queen Mary University, London); Prof Kimme Hyrich, Prof Stephen
Eyre, Prof
Soumya Raychaudhuri, Prof Andrew Morris, Dr Annie Yarwood, Dr
Samantha Smith, Dr Stevie Shoop-Worrall, Ms
Saskia Lawson-Tovey, Dr John Bowes, Dr Paul Martin, Ms Melissa
Tordoff, Mr Michael Stadler, Prof Wendy Thomson, Dr
Damian
Tarasek (University of Manchester); Dr Chris Wallace, Dr Wei-Yu Lin (University
of Cambridge); Prof Nophar Geifman (University of Surrey); Dr Sarah
Clarke (School of Population Health sciences and MRC Integrative Epidemiology
Unit, University of Bristol).
Dr Toby Kent, Dr Thierry Sornasse (AbbVie Inc.)
Daniela
Dastros-Pitei MD, PhD, Sumanta Mukherjee, PhD
(GlaxoSmithKline Research and Development Limited.)
Jacqui
Roberts (Pfizer).
Dr
Rami Kallala (Swedish Orphan Biovitrum AB (publ) (Sobi)).
Dr
Helen Neale, Dr John Ioannou, Dr Hussein Al-Mossawi (UCB Biopharma SRL.)
The CLUSTER Champions.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.