Evaluation


For both Task 1 and Task 2 assessment, we will separately evaluate the model's performance on isolated structures and contiguous structures.
For assessing segmentation results of both dense isolated and sparse isolated structures, two metrics will be employed:

  • volumetric Dice similarity coefficients (DSC): calculates the similarity of prediction and ground truth by comparing the shared voxel volume relative to the total voxel volume, providing a global measure of segmentation accuracy.
  • Panoptic Quality (PQ) [1]: for the segmentation of isolated structures, the primary objective is to accurately identify each individual component. Panoptic Quality (PQ) offers a holistic evaluation by jointly measuring recognition quality (how well distinct components are correctly identified) and segmentation quality (how accurately their shapes are delineated). As a result, PQ provides a comprehensive metric for assessing both instance segmentation performance and overall semantic structure identification.

For assessing segmentation results of both dense contiguous and sparse contiguous structures, two metrics will be employed:

  • volumetric Dice similarity coefficients (DSC)
  • Centerline-Dice similarity coefficients (cIDice) [2]: evaluates the voxel-wise overlap of the central axis of contiguous structures, helping to assess how well the centerline of the predicted structure matches the centerline of the ground truth.

Assessments are provided separately for each task. For every task, there are distinct evaluations for algorithms with and without SSL. The assessments of algorithms without SSL are included only as a reference to measure the improvement achieved through SSL.

[1] F. Kofler et al. Panoptica: Instance-wise evaluation of 3D semantic and instance segmentation maps. arXiv preprint arXiv:2312.02608, 2023.
[2] S. Shit et al. clDice - a novel topology-preserving loss function for tubular structure segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16555-16564.


Ranking


  • The final ranking will be determined based solely on the performance in the final test phase.

  • The ranking of a submitted algorithm is determined through the following process:

    • Compute the metric scores for each test case;
    • Calculate the average of the metric scores across all test cases for each individual metric;
    • Rank the averaged scores for each metric independently based on its specific optimization trend (higher is better for DSC, PQ and clDice);
    • Determine the ranking of the submitted algorithm by calculating the mean rank across all metrics separately for dense isolated, sparse isolated, dense contiguous and sparse contiguous structures;
    • Determine the overall ranking of the submitted algorithm by calculating the mean rank across four structure;
    • If two or more algorithms have equal final ranks, the prize will be shared equally among them.
  • The top 3 or top 5 algorithms are determined for each task based on the assessment results of the algorithms with SSL.


Statistical Analyses


We will employ resampling-based statistical analyses, including bootstrapping and leave-one-out (LOO) analyses, to assess the robustness and stability of algorithm performance estimates and rankings, following the recommendations of Maier-Hein et al. [1].

Bootstrapping is used to estimate uncertainty in performance metrics, which is particularly suitable given the heterogeneous nature of our test data. Leave-one-out analyses are applied to evaluate the sensitivity of rankings to individual test cases and to assess ranking stability.

To account for dependencies among patches extracted from the same LSM volume, all resampling (bootstrapping and LOO) is performed at the volume level, rather than at the individual patch level. This ensures that intra-volume correlations are preserved.

[1] L. Maier-Hein, Matthias Eisenmann, Annika Reinke, et al. Why rankings of biomedical image analysis competitions should be interpreted with care. Nature communications 9: 1-13, 2018 Dec..