PANDA dataset
torchmil.datasets.PANDAMILDataset
Bases: BinaryClassificationDataset
, WSIDataset
Prostate cANcer graDe Assessment (PANDA) dataset for Multiple Instance Learning (MIL). Download it from Hugging Face Datasets.
About the original PANDA Dataset. The original PANDA dataset contains WSIs of hematoxylin and eosin (H&E) stained prostate biopsy samples. The task is to classify the severity of prostate cancer within each slide, and to localize the cancerous tissue precisely. The dataset includes high-quality pixel-level annotations marking the cancerous tissue.
Dataset Description.
We have preprocessed the whole-slide images (WSIs) by extracting relevant patches and computing features for each patch using various feature extractors.
- A patch is labeled as positive (
patch_label=1
) if more than 50% of its pixels are annotated as cancerous. - A WSI is labeled as positive (
label=1
) if it contains at least one positive patch.
This means a slide is considered positive if there is any evidence of cancerous tissue.
Directory Structure.
After extracting the contents of the .tar.gz
archives, the following directory structure is expected:
root
├── patches_{patch_size}
│ ├── features
│ │ ├── features_{features_name}
│ │ │ ├── wsi1.npy
│ │ │ ├── wsi2.npy
│ │ │ └── ...
│ ├── labels
│ │ ├── wsi1.npy
│ │ ├── wsi2.npy
│ │ └── ...
│ ├── patch_labels
│ │ ├── wsi1.npy
│ │ ├── wsi2.npy
│ │ └── ...
│ ├── coords
│ │ ├── wsi1.npy
│ │ ├── wsi2.npy
│ │ └── ...
└── splits.csv
.npy
file corresponds to a single WSI. The splits.csv
file defines train/test splits for standardized experimentation.
__init__(root, features='UNI', partition='train', bag_keys=['X', 'Y', 'y_inst', 'adj', 'coords'], patch_size=512, adj_with_dist=False, norm_adj=True, load_at_init=True)
Parameters:
-
root
(str
) –Path to the root directory of the dataset.
-
features
(str
, default:'UNI'
) –Type of features to use. Must be one of ['UNI', 'resnet50_bt'].
-
partition
(str
, default:'train'
) –Partition of the dataset. Must be one of ['train', 'test'].
-
bag_keys
(list
, default:['X', 'Y', 'y_inst', 'adj', 'coords']
) –List of keys to use for the bags. Must be in ['X', 'Y', 'y_inst', 'coords'].
-
patch_size
(int
, default:512
) –Size of the patches. Currently, only 512 is supported.
-
adj_with_dist
(bool
, default:False
) –If True, the adjacency matrix is built using the Euclidean distance between the patches features. If False, the adjacency matrix is binary.
-
norm_adj
(bool
, default:True
) –If True, normalize the adjacency matrix.
-
load_at_init
(bool
, default:True
) –If True, load the bags at initialization. If False, load the bags on demand.
__getitem__(index)
Parameters:
-
index
(int
) –Index of the bag to retrieve.
Returns:
-
bag_dict
(TensorDict
) –Dictionary containing the keys defined in
bag_keys
and their corresponding values.- X: Features of the bag, of shape
(bag_size, ...)
. - Y: Label of the bag.
- y_inst: Instance labels of the bag, of shape
(bag_size, ...)
. - adj: Adjacency matrix of the bag. It is a sparse COO tensor of shape
(bag_size, bag_size)
. Ifnorm_adj=True
, the adjacency matrix is normalized. - coords: Coordinates of the bag, of shape
(bag_size, coords_dim)
.
- X: Features of the bag, of shape