Skip to content

Tokens-to-Token (T2T)

torchmil.nn.transformers.T2TLayer

Bases: Module

Tokens-to-Token (T2T) Transformer layer from Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

__init__(in_dim, out_dim=None, att_dim=512, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(1, 1), n_heads=4, use_mlp=True, dropout=0.0)

Parameters:

  • in_dim (int) –

    Input dimension.

  • out_dim (int, default: None ) –

    Output dimension. If None, output dimension will be kernel_size[0] * kernel_size[1] * att_dim.

  • att_dim (int, default: 512 ) –

    Attention dimension.

  • kernel_size (tuple[int, int], default: (3, 3) ) –

    Kernel size.

  • stride (tuple[int, int], default: (1, 1) ) –

    Stride.

  • padding (tuple[int, int], default: (2, 2) ) –

    Padding.

  • dilation (tuple[int, int], default: (1, 1) ) –

    Dilation.

  • n_heads (int, default: 4 ) –

    Number of heads.

  • use_mlp (bool, default: True ) –

    Whether to use feedforward layer.

  • dropout (float, default: 0.0 ) –

    Dropout rate.

forward(X)

Parameters:

  • X (Tensor) –

    Input tensor of shape (batch_size, seq_len, in_dim).

Returns: Y: Output tensor of shape (batch_size, new_seq_len, out_dim). If out_dim is None, out_dim will be att_dim * kernel_size[0] * kernel_size[1].