Skip to content

CAMELYON16 dataset

torchmil.datasets.CAMELYON16MILDataset

Bases: BinaryClassificationDataset, WSIDataset

CAMELYON16 dataset for Multiple Instance Learning (MIL). Download it from Hugging Face Datasets.

About the Original CAMELYON16 Dataset. The original CAMELYON16 dataset contains WSIs of hematoxylin and eosin (H&E) stained lymph node sections. The task is to identify whether each slide contains metastatic tissue and to localize it precisely. The dataset includes high-quality pixel-level annotations marking the metastases.

Dataset Description. We have preprocessed the whole-slide images (WSIs) by extracting relevant patches and computing features for each patch using various feature extractors.

  • A patch is labeled as positive (patch_label=1) if more than 50% of its pixels are annotated as metastatic.
  • A WSI is labeled as positive (label=1) if it contains at least one positive patch.

This means a slide is considered positive if there is any evidence of metastatic tissue.

Directory Structure. After extracting the contents of the .tar.gz archives, the following directory structure is expected:

root
├── patches_{patch_size}
│ ├── features
│ │ ├── features_{features_name}
│ │ │ ├── wsi1.npy
│ │ │ ├── wsi2.npy
│ │ │ └── ...
│ ├── labels
│ │ ├── wsi1.npy
│ │ ├── wsi2.npy
│ │ └── ...
│ ├── patch_labels
│ │ ├── wsi1.npy
│ │ ├── wsi2.npy
│ │ └── ...
│ ├── coords
│ │ ├── wsi1.npy
│ │ ├── wsi2.npy
│ │ └── ...
└── splits.csv
Each .npy file corresponds to a single WSI. The splits.csv file defines train/test splits for standardized experimentation.

__init__(root, features='UNI', partition='train', bag_keys=['X', 'Y', 'y_inst', 'adj', 'coords'], patch_size=512, adj_with_dist=False, norm_adj=True, load_at_init=True)

Parameters:

  • root (str) –

    Path to the root directory of the dataset.

  • features (str, default: 'UNI' ) –

    Type of features to use. Must be one of ['UNI', 'resnet50_bt'].

  • partition (str, default: 'train' ) –

    Partition of the dataset. Must be one of ['train', 'test'].

  • bag_keys (list, default: ['X', 'Y', 'y_inst', 'adj', 'coords'] ) –

    List of keys to use for the bags. Must be in ['X', 'Y', 'y_inst', 'coords'].

  • patch_size (int, default: 512 ) –

    Size of the patches. Currently, only 512 is supported.

  • adj_with_dist (bool, default: False ) –

    If True, the adjacency matrix is built using the Euclidean distance between the patches features. If False, the adjacency matrix is binary.

  • norm_adj (bool, default: True ) –

    If True, normalize the adjacency matrix.

  • load_at_init (bool, default: True ) –

    If True, load the bags at initialization. If False, load the bags on demand.

__getitem__(index)

Parameters:

  • index (int) –

    Index of the bag to retrieve.

Returns:

  • bag_dict ( TensorDict ) –

    Dictionary containing the keys defined in bag_keys and their corresponding values.

    • X: Features of the bag, of shape (bag_size, ...).
    • Y: Label of the bag.
    • y_inst: Instance labels of the bag, of shape (bag_size, ...).
    • adj: Adjacency matrix of the bag. It is a sparse COO tensor of shape (bag_size, bag_size). If norm_adj=True, the adjacency matrix is normalized.
    • coords: Coordinates of the bag, of shape (bag_size, coords_dim).