Publications

Open Lumbar Spine Image Analysis: A 3D slicer extension for segmentation, grading, and intervertebral disc height index with multi–data set validation

Open Lumbar Spine Image Analysis: A 3D slicer extension for segmentation, grading, and intervertebral disc height index with multi–data set validation

ISSLS Prize in Clinical Science 2026: Data-driven classification of lumbar spine degeneration trajectories in chronic low back pain

ISSLS Prize in Clinical Science 2026: Data-driven classification of lumbar spine degeneration trajectories in chronic low back pain

Background: The temporal sequence of lumbar spine degeneration (trajectory) is difficult to characterize due to limited availability of longitudinal imaging data. The aim of this cross-sectional study was to discover degeneration trajectories and their clinical relevance by applying an innovative computational approach to an extensive dataset from chronic low back pain patients. Methods: Clinical MRI exams from 423 patients in the comeBACK study were graded for disc degeneration, facet osteoarthritis, and other pathoanatomical features. We then trained an event-based model, which is specifically designed to infer longitudinal trajectories from cross-sectional data to model spine degeneration trajectory subtypes. The clinical significance of the identified trajectories was assessed using propensity score matching of trajectory subtypes and subsequent generalized linear mixed-effects modeling. Pain characteristics included the Fear-Avoidance Beliefs Questionnaire, neuropathic pain (painDETECT), pain impact score, and chronic widespread pain (CWP). Results: Two distinct trajectories were identified. A “disc-first” subtype (n=260, 61%) was characterized by a high prevalence of disc herniation and disc degeneration that was more severe than facet osteoarthritis. Conversely, a “facet-first” subtype (n=146, 39%) was characterized by a greater severity of facet osteoarthritis than disc degeneration. The disc-first subtype was associated with more CWP (p = 0.030), while the facet-first subtype had higher neuropathic pain (painDETECT scores) (p = 0.039). Conclusion: We identified two distinct trajectories of lumbar spine degeneration related to differing clinical presentations. After replication with other large datasets, this method could be used to characterize and stage spinal degeneration of individual patients. Ultimately, this new information would help clarify degeneration mechanisms and risk factors and support treatment optimization.

Does combining the STarT Back Tool with a polygenic risk score for chronic low back pain improve prediction of work disability over two years?

Chronic back pain (CBP) is a leading cause of work disability worldwide, yet identifying individuals at risk remains difficult due to its multifactorial etiology. This study investigated whether integrating a polygenic risk score (PRS) for CBP with the STarT Back Tool (SBT)—a widely used psychosocial screening instrument—could improve the prediction of work disability, measured as disability leave days over a two-year follow-up. We analysed data from 1,938 participants in the Northern Finland Birth Cohort 1966 with complete genotyping, SBT responses, and registry-linked disability records. A zero-inflated negative binomial regression model was applied to account for the highly skewed distribution of work disability days. Results showed that both SBT and CBP genetic risk independently predicted the cumulative number of disability leave days. While SBT was also associated with the likelihood of having no disability leave, CBP genetic risk was not, suggesting that polygenic risk contributes specifically to the burden of disability among affected individuals. Participants in the highest CBP genetic risk quartile experienced significantly more work disability days, supporting a dose-response relationship. The two tools captured complementary domains: SBT reflected modifiable biopsychosocial risks, while the PRS represented fixed genetic liability. This distinction supports the value of integrating a CBP PRS into existing screening frameworks, particularly for early CBP management.

Robust radiomic signatures of intervertebral disc degeneration from MRI

Robust radiomic signatures of intervertebral disc degeneration from MRI

Study design A retrospective analysis. Objective The aim of this study was to identify a robust radiomic signature from deep learning segmentations for intervertebral disc (IVD) degeneration classification. Summary of Data Low back pain (LBP) is the most common musculoskeletal symptom worldwide and IVD degeneration is an important contributing factor. To improve the quantitative phenotyping of IVD degeneration from T2-weighted magnetic resonance imaging (MRI) and better understand its relationship with LBP, multiple shape and intensity features have been investigated. IVD radiomics have been less studied but could reveal sub-visual imaging characteristics of IVD degeneration. Methods We used data from Northern Finland Birth Cohort 1966 members who underwent lumbar spine T2-weighted MRI scans at age 45-47 (n=1397). We used a deep learning model to segment the lumbar spine IVDs and extracted 737 radiomic features, as well as calculating IVD height index and peak signal intensity difference. Intraclass correlation coefficients across image and mask perturbations were calculated to identify robust features. Sparse partial least squares discriminant analysis was used to train a Pfirrmann grade classification model. Results The radiomics model had balanced accuracy of 76.7% (73.1-80.3%) and Cohen’s Kappa of 0.70 (0.67-0.74), compared to 66.0% (62.0-69.9%) and 0.55 (0.51-0.59) for an IVD height index and peak signal intensity model. 2D sphericity and interquartile range emerged as radiomics-based features that were robust and highly correlated to Pfirrmann grade (Spearman’s correlation coefficients of -0.72 and -0.77 respectively). Conclusion Based on our findings these radiomic signatures could serve as alternatives to the conventional indices, representing a significant advance in the automated quantitative phenotyping of IVD degeneration from standard-of-care MRI.

Semiautomatic assessment of facet tropism from lumbar spine MRI using deep learning: a Northern Finland Birth Cohort study

Semiautomatic assessment of facet tropism from lumbar spine MRI using deep learning: a Northern Finland Birth Cohort study

Study Design. This is a retrospective, cross-sectional, populationbased study that automatically measured the facet joint (FJ) angles from T2-weighted axial magnetic resonance imagings (MRIs) of the lumbar spine using deep learning (DL). Objective. This work aimed to introduce a semiautomatic framework that measures the FJ angles using DL and study facet tropism (FT) in a large Finnish population-based cohort. Summary of Data. T2-weighted axial MRIs of the lumbar spine (L3/4 through L5/S1) for (n =1288) in the NFBC1966 Finnish population-based cohort were used for this study. Materials and Methods. A DL model was developed and trained on 430 participants’ MRI images. The authors computed FJ angles from the model’s prediction for each level, that is, L3/4 through L5/S1, for the male and female subgroups. Inter-rater and intrarater reliability was analyzed for 60 participants using annotations made by two radiologists and a musculoskeletal researcher. With the developed method, we examined FT in the entire NFBC1966 cohort, adopting the literature definitions of FT thresholds at 7° and 10°. The rater agreement was evaluated both for the annotations and the FJ angles computed based on the annotations. FJ asymmetry (θL - θR) was used to evaluate the agreement and correlation between the raters. Bland-Altman analysis was used to assess the agreement and systemic bias in the FJ asymmetry. The authors used the Dice score as the metric to compare the annotations between the raters. The authors evaluated the model predictions on the independent test set and compared them against the ground truth annotations. Results. This model scored Dice (92.7±0.1) and intersection over union (87.1±0.2) aggregated across all the regions of interest, that is, vertebral body (VB), FJs, and posterior arch (PA). The mean FJ angles measured for the male and female subgroups were in agreement with the literature findings. Intrarater reliability was high, with a Dice score of VB (97.3), FJ (82.5), and PA (90.3). The inter-rater reliability was better between the radiologists with a Dice score of VB (96.4), FJ (75.5), and PA (85.8) than between the radiologists and the musculoskeletal researcher. The prevalence of FT was higher in the male subgroup, with L4/5 found to be the most affected region. Conclusion. The authors developed a DL-based framework that enabled us to study FT in a large cohort. Using the proposed method, the authors present the prevalence of FT in a Finnish population-based cohort.

Cartilaginous endplates: A comprehensive review on a neglected structure in intervertebral disc research

Cartilaginous endplates: A comprehensive review on a neglected structure in intervertebral disc research

The cartilaginous endplates (CEP) are key components of the intervertebral disc (IVD) necessary for sustaining the nutrition of the disc while distributing mechanical loads and preventing the disc from bulging into the adjacent vertebral body. The size, shape, and composition of the CEP are essential in maintaining its function, and degeneration of the CEP is considered a contributor to early IVD degeneration. In addition, the CEP is implicated in Modic changes, which are often associated with low back pain. This review aims to tackle the current knowledge of the CEP regarding its structure, composition, permeability, and mechanical role in a healthy disc, how they change with degeneration, and how they connect to IVD degeneration and low back pain. Additionally, the authors suggest a standardized naming convention regarding the CEP and bony endplate and suggest avoiding the term vertebral endplate. Currently, there is limited data on the CEP itself as reported data is often a combination of CEP and bony endplate, or the CEP is considered as articular cartilage. However, it is clear the CEP is a unique tissue type that differs from articular cartilage, bony endplate, and other IVD tissues. Thus, future research should investigate the CEP separately to fully understand its role in healthy and degenerated IVDs. Further, most IVD regeneration therapies in development failed to address, or even considered the CEP, despite its key role in nutrition and mechanical stability within the IVD. Thus, the CEP should be considered and potentially targeted for future sustainable treatments.

Are current machine learning applications comparable to radiologist classification of degenerate and herniated discs and Modic change? A systematic review and meta-analysis

Are current machine learning applications comparable to radiologist classification of degenerate and herniated discs and Modic change? A systematic review and meta-analysis

Introduction Low back pain is the leading contributor to disability burden globally. It is commonly due to degeneration of the lumbar intervertebral discs (LDD). Magnetic resonance imaging (MRI) is the current best tool to visualize and diagnose LDD, but places high time demands on clinical radiologists. Automated reading of spine MRIs could improve speed, accuracy, reliability and cost effectiveness in radiology departments. The aim of this review and meta-analysis was to determine if current machine learning algorithms perform well identifying disc degeneration, herniation, bulge and Modic change compared to radiologists. Methods A PRISMA systematic review protocol was developed and four electronic databases and reference lists were searched. Strict inclusion and exclusion criteria were defined. A PROBAST risk of bias and applicability analysis was performed. Results 1350 articles were extracted. Duplicates were removed and title and abstract searching identified original research articles that used machine learning (ML) algorithms to identify disc degeneration, herniation, bulge and Modic change from MRIs. 27 studies were included in the review; 25 and 14 studies were included multi-variate and bivariate meta-analysis, respectively. Studies used machine learning algorithms to assess LDD, disc herniation, bulge and Modic change. Models using deep learning, support vector machine, k-nearest neighbors, random forest and naïve Bayes algorithms were included. Meta-analyses found no differences in algorithm or classification performance. When algorithms were tested in replication or external validation studies, they did not perform as well as when assessed in developmental studies. Data augmentation improved algorithm performance when compared to models used with smaller datasets, there were no performance differences between augmented data and large datasets. Discussion This review highlights several shortcomings of current approaches, including few validation attempts or use of large sample sizes. To the best of the authors’ knowledge, this is the first systematic review to explore this topic. We suggest the utilization of deep learning coupled with semi- or unsupervised learning approaches. Use of all information contained in MRI data will improve accuracy. Clear and complete reporting of study design, statistics and results will improve the reliability and quality of published literature.

A stronger baseline for automatic Pfirrmann grading of lumbar spine MRI using deep learning

A stronger baseline for automatic Pfirrmann grading of lumbar spine MRI using deep learning

This paper addresses the challenge of grading visual features in lumbar spine MRI using Deep Learning. Such a method is essential for the automatic quantification of structural changes in the spine, which is valuable for understanding low back pain. Multiple recent studies investigated different architecture designs, and the most recent success has been attributed to the use of transformer architectures. In this work, we argue that with a well-tuned three-stage pipeline comprising semantic segmentation, localization, and classification, convolutional networks outperform the state-of-the-art approaches. We conducted an ablation study of the existing methods in a population cohort, and report performance generalization across various subgroups. Our code is publicly available to advance research on disc degeneration and low back pain.

External validation of SpineNet, an open-vsource deep learning model for grading lumbar disk degeneration MRI features, using the Northern Finland Birth Cohort 1966

External validation of SpineNet, an open-vsource deep learning model for grading lumbar disk degeneration MRI features, using the Northern Finland Birth Cohort 1966

Study design This is a retrospective observational study to externally validate a deep learning image classification model. Objective Deep learning models such as SpineNet offer the possibility of automating the process of disc degeneration (DD) classification from MRI. External validation is an essential step to their development. The aim of this study was to externally validate SpineNet predictions for DD using Pfirrmann classification and Modic changes (MC) on data from the Northern Finland Birth Cohort 1966 (NFBC1966). Summary of Data We validated SpineNet using data from 1331 NFBC1966 participants for whom both lumbar spine MRI data and consensus disc degeneration gradings were available. Methods SpineNet returned Pfirrmann grade and MC presence from T2-weighted sagittal lumbar MRI sequences from NFBC1966, a dataset geographically and temporally separated from its training dataset. A range of agreement and reliability metrics were used to compare predictions with expert radiologists. Subsets of data that match SpineNet training data more closely were also tested. Results Balanced accuracy for DD was 78% (77-79%) and for MC 86% (85-86%). Inter-rater reliability for Pfirrmann grading was Lin’s CCC = 0.86 (0.85-0.87) and Cohen’s κ = 0.68 (0.67-0.69). In a low back pain subset these reliability metrics remained largely unchanged. In total, 20.83% of discs were rated differently by SpineNet compared to the human raters, but only 0.85% of discs had a grade difference greater than 1. Inter-rater reliability for MC detection was κ = 0.74 (0.72-0.75). In the low back pain subset this metric was almost unchanged at κ = 0.76 (0.73-0.79). Conclusion In this study, SpineNet has been benchmarked against expert human raters in the research setting. It has matched human reliability and demonstrates robust performance despite the multiple challenges facing model generalizability.