Tokens-to-Token (T2T)
torchmil.nn.transformers.T2TLayer
Bases: Module
Tokens-to-Token (T2T) Transformer layer from Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
__init__(in_dim, out_dim=None, att_dim=512, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(1, 1), n_heads=4, use_mlp=True, dropout=0.0)
Parameters:
-
in_dim
(int
) –Input dimension.
-
out_dim
(int
, default:None
) –Output dimension. If None, output dimension will be
kernel_size[0] * kernel_size[1] * att_dim
. -
att_dim
(int
, default:512
) –Attention dimension.
-
kernel_size
(tuple[int, int]
, default:(3, 3)
) –Kernel size.
-
stride
(tuple[int, int]
, default:(1, 1)
) –Stride.
-
padding
(tuple[int, int]
, default:(2, 2)
) –Padding.
-
dilation
(tuple[int, int]
, default:(1, 1)
) –Dilation.
-
n_heads
(int
, default:4
) –Number of heads.
-
use_mlp
(bool
, default:True
) –Whether to use feedforward layer.
-
dropout
(float
, default:0.0
) –Dropout rate.
forward(X)
Parameters:
-
X
(Tensor
) –Input tensor of shape
(batch_size, seq_len, in_dim)
.
Returns:
Y: Output tensor of shape (batch_size, new_seq_len, out_dim)
. If out_dim
is None, out_dim
will be att_dim * kernel_size[0] * kernel_size[1]
.