Processed MIL dataset
torchmil.datasets.ProcessedMILDataset
Bases: Dataset
This class represents a general MIL dataset where the bags have been processed and saved as numpy files. It enforces strict data availability for core components, failing fast if expected files are missing.
MIL processing and directory structure.
The dataset expects pre-processed bags saved as individual numpy files.
- A feature file should yield an array of shape (bag_size, ...)
, where ...
represents the shape of the features.
- A label file should yield an array of shape arbitrary shape, e.g., (1,)
for binary classification.
- An instance label file should yield an array of shape (bag_size, ...)
, where ...
represents the shape of the instance labels.
- A coordinates file should yield an array of shape (bag_size, coords_dim)
, where coords_dim
is the dimension of the coordinates.
Bag keys and directory structure. The dataset can be initialized with a list of bag keys, which are used to choose which data to load. This dataset expects the following directory structure:
features_path/ (if "X" in bag_keys)
├── bag1.npy
├── bag2.npy
└── ...
labels_path/ (if "Y" in bag_keys)
├── bag1.npy
├── bag2.npy
└── ...
inst_labels_path/ (if "y_inst" in bag_keys)
├── bag1.npy
├── bag2.npy
└── ...
coords_path/ (if "coords" or "adj" in bag_keys)
├── bag1.npy
├── bag2.npy
└── ...
Adjacency matrix. If the coordinates of the instances are available, the adjacency matrix will be built using the Euclidean distance between the coordinates. Formally, the adjacency matrix \(\mathbf{A} = \left[ A_{ij} \right]\) is defined as:
\begin{equation} A_{ij} = \begin{cases} d_{ij}, & \text{if } \left| \mathbf{c}i - \mathbf{c}_j \right| \leq \text{dist_thr}, \ 0, & \text{otherwise}, \end{cases} \quad d{ij} = \begin{cases} 1, & \text{if } \text{adj_with_dist=False}, \ \exp\left( -\frac{\left| \mathbf{x}_i - \mathbf{x}_j \right|}{d} \right), & \text{if } \text{adj_with_dist=True}. \end{cases} \end{equation}
where \(\mathbf{c}_i\) and \(\mathbf{c}_j\) are the coordinates of the instances \(i\) and \(j\), respectively, \(\text{dist_thr}\) is a threshold distance, and \(\mathbf{x}_i \in \mathbb{R}^d\) and \(\mathbf{x}_j \in \mathbb{R}^d\) are the features of instances \(i\) and \(j\), respectively.
How bags are built.
When the __getitem__
method is called, the bag is built as follows (pseudocode):
1. The __getitem__
method is called with an index.
2. The bag name is retrieved from the list of bag names.
3. The _build_bag
method is called with the bag name:
3.1. The _build_bag
method loads the bag from disk using the _load_bag
method. This method loads the features, labels, instance labels and coordinates from disk using the _load_features
, _load_labels
, _load_inst_labels
and _load_coords
methods.
3.2. If the coordinates have been provided, it builds the adjacency matrix using the _build_adj
method.
4. The bag is returned as a dictionary containing the keys defined in bag_keys
and their corresponding values.
This behaviour can be extended or modified by overriding the corresponding methods.
__init__(features_path=None, labels_path=None, inst_labels_path=None, coords_path=None, bag_names=None, bag_keys=['X', 'Y', 'y_inst', 'adj', 'coords'], dist_thr=1.5, adj_with_dist=False, norm_adj=True, load_at_init=False)
Class constructor.
Parameters:
-
features_path
(str
, default:None
) –Path to the directory containing the features.
-
labels_path
(str
, default:None
) –Path to the directory containing the bag labels.
-
inst_labels_path
(str
, default:None
) –Path to the directory containing the instance labels.
-
coords_path
(str
, default:None
) –Path to the directory containing the coordinates.
-
bag_keys
(list
, default:['X', 'Y', 'y_inst', 'adj', 'coords']
) –List of keys to load the bags data. The TensorDict returned by the
__getitem__
method will have these keys. Possible keys are: - "X": Load the features of the bag. - "Y": Load the label of the bag. - "y_inst": Load the instance labels of the bag. - "adj": Load the adjacency matrix of the bag. It requires the coordinates to be loaded. - "coords": Load the coordinates of the bag. -
bag_names
(list
, default:None
) –List of bag names to load. If None, all bags from the
features_path
are loaded. -
dist_thr
(float
, default:1.5
) –Distance threshold for building the adjacency matrix.
-
adj_with_dist
(bool
, default:False
) –If True, the adjacency matrix is built using the Euclidean distance between the instance features. If False, the adjacency matrix is binary.
-
norm_adj
(bool
, default:True
) –If True, normalize the adjacency matrix.
-
load_at_init
(bool
, default:False
) –If True, load the bags at initialization. If False, load the bags on demand.
__getitem__(index)
Parameters:
-
index
(int
) –Index of the bag to retrieve.
Returns:
-
bag_dict
(TensorDict
) –Dictionary containing the keys defined in
bag_keys
and their corresponding values.- X: Features of the bag, of shape
(bag_size, ...)
. - Y: Label of the bag.
- y_inst: Instance labels of the bag, of shape
(bag_size, ...)
. - adj: Adjacency matrix of the bag. It is a sparse COO tensor of shape
(bag_size, bag_size)
. Ifnorm_adj=True
, the adjacency matrix is normalized. - coords: Coordinates of the bag, of shape
(bag_size, coords_dim)
.
- X: Features of the bag, of shape