Redefining TSR scoring with AI: Insights from the data scientists

Discover how AI is revolutionizing TSR scoring, enhancing clinical decisions, and shaping the future of healthcare.

The ARABESC project is the result of a collaborative partnership between WSK Medical and IMP Diagnostics. Since 2023, the two companies have been working together to develop an AI-driven algorithm for tumor-stroma ratio (TSR) scoring.

In the first edition of our interview series, we spoke with two pathologists about the transformative potential of AI in TSR scoring, its impact on clinical decision-making, and the challenges of AI adoption in healthcare. In this second edition, we sat down with Felix Dikland and Cyrine Fekih—two data scientists with extensive experience applying machine learning to healthcare tools and applications.

In this interview, you can expect a candid discussion on:

Variability and bias control
Validation and the evolving scientific paradigm
Bridging the gap between technology and medicine
Key considerations for implementing AI in clinical practice

KEY CONSIDERATIONS FOR IMPLEMENTING AI IN CLINICAL PRACTICE

Interviewer (I) – How would you design an AI pipeline to handle colour variations in H&E-stained slides across different laboratories?

Felix Dikland (FD) – Traditionally colour variations are overcome by preprocessing the image patches, such as colour normalization or stain deconvolution. These are ways to standardise the input of the image, making the model more reliable, but not more robust. To achieve this, input augmentations should be introduced during training.

Cyrine Fekih (CF) –In fact, to handle colour variations in H&E-stained slides across different laboratories, it is important that AI pipeline should begin with a colour normalisation step to standardize staining variations.

I – How would you address inter-slide variability in tissue preparation that could impact TSR quantification accuracy?

FD –From a traditional semantic segmentation model training point of view, appropriate pre-processing and data augmentation in combination with a balanced, large dataset is the foundation of a good and robust model. Having a “balanced” dataset in this case also entails having a proper distribution of institutions, scanners, tumour subtypes, and acquisition methods, such as surgical specimen, pretreatment biopsies and polypectomies.

I – What quality control measures would you implement for stain normalization across different scanner types?

CF –For quality control of stain normalisation across different scanner types, it is possible use statistical metrics such as stain vector similarity or colour histogram comparisons to evaluate consistency before and after normalisation. Another method would incorporate reference slides or colour calibration targets scanned on each device to standardise outputs. Visual inspection by pathology experts can also be used on a sample of slides to validate the perceived consistency.

I – How would you address potential biases in the algorithm’s performance across diverse patient demographics or tumour subtypes?

CF –In a perfect scenario, the training dataset would cover a wide range of patient demographics – like age, ethnicity, tumour types, and molecular subtypes. But since that’s often hard to get in practice, I’d take a few steps to reduce the risk of bias. I’d validate the model on data from different institutions or patient groups, even if the datasets are small, to see how well it generalises. I’d also use model uncertainty to flag predictions it’s less confident about, which could highlight underrepresented cases. Lastly, I’d make sure to clearly report any limitations in the training data and known biases so users are aware of them.

I – What strategies would you use to enable the tool’s application to other epithelial cancers (e.g., breast, pancreatic)?

FD –The majority of development work lays in creating a solid standard for data annotation, data extraction, data augmentation, a training pipeline and a validation pipeline. The basis of these standards can be copied to create new standards for these steps in creating a tool for other epithelial cancers. Each clinical site however will present itself with unique tissues and thus unique issues.

Apresentamos o ARABESC, Solução de IA para Análise Automatizada de Amostras de Cancro Colorretal

Validation and the current dogma change

I – How would you ensure the algorithm’s TSR cutoff values align with established prognostic thresholds (e.g., 50% stroma)?

FD –From literature we know that the TSR is often underestimated by human observers. Visually necrosis, mucin and the area within the lumen should be excluded from evaluation. These tissues however can mistakenly raise the total estimated tumour area, because of their darker appearance. This causes manual TSR scores to be systematically undervalued. This is crucial knowledge when evaluating the automated score, as even if the automated score is a more accurate representation of the TSR this might not correspond with the clinically validated manual TSR score.

I – What steps would you take to validate the tool’s performance against manual pathologist assessments (in multicentre studies)?

FD –The greatest hurdle in the validation of an automated TSR score is comparing it to the current gold standard; visual eyeballing. The automated method is fully deterministic producing identical TSR scores independent of time and user. The manual method is semi-quantitative. It follows a standardized protocol with quantifiable steps, that leaves plenty of room for subjectivity. When comparing these scores, it is crucial to create custom setups that test every step in the TSR scoring process, to find in what capacity deviations arise from pathologists’ subjectivity and in which cases they arise from AI model error.

I – How would you handle discordance between AI-derived TSR and pathologist assessments in borderline cases?

CF –In cases of discordance between the AI-derived TSR and pathologist assessments, a direct comparison should be made between the tool’s output and the expert evaluation—both in terms of tissue identification and TSR calculation. Analysing these differences can help identify the source of disagreement, whether in segmentation accuracy or threshold interpretation. It is also essential to gather feedback from pathologists on such cases, as this input can be used to refine and retrain the model, improving its performance and reliability in handling complex or borderline scenarios over time.

I – How would you standardize the analysed tissue area size (e.g., 1.0 mm vs. 2.0 mm) to ensure consistent prognostic performance?

CF –To ensure consistent of TSR quantification, we would first implement a quality control algorithm to verify that each slide meets pixel-to-micron calibration standards, ensuring accurate and standardised spatial measurements. For the automated pipeline, we would use a fixed circular region of interest with a diameter selected based on established clinical guidelines and supporting literature. In the manual mode, the tool would allow users to select a circular ROI with a diameter between 1.8 mm and 2.2 mm, providing flexibility while maintaining consistency within a clinically validated range.

I – What user interface features would clinicians/pathologists need to trust and adopt automated TSR scoring?

FD –In essence, the TSR scoring tool is a tissue identifier and segmenter. This means that the underlying mechanism is a pixel-wise classification of the tumour tissue, stroma tissue, and all other tissues identifiable in colorectal carcinoma. Besides supplying the user with a percentage score, it also provides a detailed colourmap as an overlay of the analysed region. The pathologist should depend the reliability of the percentage score on the accuracy of this coloured segmentation map.

I – What safeguards would you implement to prevent over-reliance on automated TSR scores in clinical decision-making?

FD –It is important to realize that even if the automated score might objectively be a more accurate quantification of the TSR, the semi-quantitative method estimation used by pathologists is the only method clinically tested for prognostic value. Until the automated score is verified as an independent prognostic indicator the pathologist should agree with the score produced by the algorithm. Besides that, even if the automated score is clinically validated, the pathologist should be well instructed to verify the accuracy of the segmentation map, before using the produced score as a prognostic indicator.

I – How would you quantify the tool’s impact on reducing interobserver variability in stroma-rich vs. stroma-poor classification?

FD –Researchers have invented a measure that quantifies the correlation of the tool with observers normalized by the variability of observers to each other. This score is called the discrepancy ratio. In short, it relies on the fact that individual observers are closer to the ground truth that to each other, given that the error of individual observers are random and independent. So, if the mean interobserver variability of each observer to the tool is lower than the mean of each observer to each other, this means that the discrepancy ratio is larger than 1 and the tool reduces the variability in stroma-rich and stroma-poor classification.

COMBINING THE FIELDS OF TECHNOLOGY AND MEDICINE

I – How could this tool be integrated into existing digital pathology workflows without disrupting diagnostic timelines?

FD – The final product is a pipeline that performs preprocessing, prediction and postprocessing on a snapshot of a whole slide image. This pipeline is deployed on-premise in a hospital or pathology lab or using cloud services. Many viewers for digital pathology slides have their own way of evoking models within their viewer environment. By adjusting the endpoints of the deployment to be compatible with the specific input and output format of these viewers, the model can be used by pathologists as easy as a right mouse click action.

I – How would you structure collaboration between AI developers, pathologists, and oncologists during tool refinement?

CF – Collaboration can be structured as a continuous feedback loop. Pathologists could provide their feedback and whether the tool’s output aligns with therapeutic decision-making needs. Besides, regular review meetings can be held to discuss model performance on edge cases, identify clinical priorities, and guide iterative updates.

I – What training data requirements (e.g., annotated regions, clinical outcomes) would you specify for pathologist collaborators?

FD – Especially with the rise of foundation models, the focus of data requirements has shifted from “big data” to “quality data”. We have found a way to ensure that our training data is of the highest quality, by asking our pathologist collaborators to focus on a relatively small area when annotating but doing this in the highest detail. These annotations are than passed through a set of quality assessment steps and reannotated, after which they are used for training.

CF – High quality annotations from pathologists, including clearly delineated tumour and stroma regions on representative H&E slides, is a key component to train a model effectively.

I – How would you communicate the tool’s limitations to non-technical stakeholders in clinical settings?

FD – The user needs to understand that the tool is never 100% accurate even in production. These errors are normal behaviour of any AI model of the current age. The tools score is as traceable as possible and achieves this by displaying a full segmentation map. The user should realize that this map is not there to convince it to trust the score, but instead to highlight any errors that the model might have made. However, it is important that the transparency of these errors does not convince the user to disregard the model.

KEY CONSIDERATIONS FOR IMPLEMENTING AI IN CLINICAL PRACTICE

I – What regulatory considerations (e.g., FDA/CE-IVDR compliance) are critical for clinical deployment of this tool?

FD – For CE-IVDR we thoroughly test the functional requirements of the model, by validating it to specific benchmarks as well as performing an extensive literature review to research the clinical relevance and useability of the TSR itself. Besides this, the safety requirements are tested with using a risk analysis, that is updated with each step in development, as each added functionality results in an added risk. To comply with CE-IVDR each of these risks will be met with an appropriate mitigation.

I – What KPIs would you track to demonstrate the tool’s clinical utility beyond technical accuracy (e.g., workflow efficiency gains)?

FD – Important KPIs would be bias KPIs that identify unexplainable model behaviour in specific tumour subtypes and the rate of adoptability. Both could be tracked using logging which counts the number of users of the model and the number of uses over time in general and per user.

CF – I think it’s important to track KPIs that reflect the tool’s impact on real-world clinical practice such as the time saved per case or per day when using the tool compared to manual TSR assessment or the improvement in consistency of stroma-high vs. stroma-low classification among pathologists. A survey on the tool’s perceived usefulness and user trust can be conducted.

AI is steadily reshaping how we approach TSR scoring, offering new levels of consistency, transparency, and clinical support. By addressing current challenges with care and collaboration, we move closer to tools that are not only powerful but practical. The future of TSR scoring is not just automated – it’s augmented, collaborative, and clinically meaningful.

IMP Diagnostics Customer Support Line

KEY CONSIDERATIONS FOR IMPLEMENTING AI IN CLINICAL PRACTICE

Validation and the current dogma change

COMBINING THE FIELDS OF TECHNOLOGY AND MEDICINE

KEY CONSIDERATIONS FOR IMPLEMENTING AI IN CLINICAL PRACTICE

Start typing and press enter to search