(left) Comparison between the performance of the momentum teacher and the student during training.
(right) Comparison between different types of teacher network. The momentum encoder leads to the best performance but is not the only viable option.
Analyzing the training dynamic
A key observation is that this teacherconstantly outperformsthe student during the training, and we observe the same behavior when training with a ResNet-50.
Polyak-Ruppert averaging is often used to simulate model ensembling to improve the performance of a network at the end of the training.
Our method can be interpreted as applying Polyak-Ruppert averaging during
the training to constantly build a model ensembling that has superior performances. This model ensembling then guides the training of the student network.
we plot the entropy and KL during training with and without centering and sharpening. If one operation is missing, the KL converges to zero, indicating a collapse.
(left): evolution of the teacher’s target
entropy along training epochs;
(right): evolution of KL divergence between teacher and student outputs.
4.Compute requirements
we detail the time and GPU memory require ments when running ViT-S/16 DINO models on two 8-GPU machines.
Overall, training DINO with Vision Transformers achieves 76.1 top-1 accuracy using two 8-GPU servers for 3 days.
5.Training with small batches
Note that the experiment with batch size of 128 runs on only 1 GPU.We have explored training a model with a batch size of 8, reaching 35.2% after 50 epochs, showing the potential fortraining large models that barely fit(不适合) an image per GPU.
总结
self-supervised pretraining a standard ViT models,achieving performance that
are comparable with the best convnets specifically designed for this setting.
We have also seen emerged two properties that can be leveraged in future applications:
the quality of the features in k-NN classification has a potential for image retrieval where ViT are already showing promising results.
the presence of information about the scene layout in the features can also benefit weakly supervised image segmentation.
the main result of this paper:自监督学习可能是开发基于ViT的BERT-like模型的关键。
与知识蒸馏不同,我们没有预先给定的教师gθt,因此,我们从学生网络的过去迭代中构建它。我们在Building different teachers from the student中研究了教师的不同更新规则,并表明在我们的框架中,在一个周期内冻结教师网络令人惊讶地有效,而为教师复制学生的权重却无法收敛。特别有趣的是,在学生的权重上使用指数移动平均线(EMA),即动量编码器,特别适合我们的框架。
Self-training and knowledge distillation:自我训练旨在通过将少量的初始注释集传播到大量的未标记的实例集来提高特征的质量。这种传播可以通过标签的硬分配或软分配来完成。当使用软标签时,该方法通常被称为知识蒸馏,主要被设计用来训练一个小网络来模拟一个更大网络的输出来压缩模型。研究表明,蒸馏可以在自我训练管道的过程中传播软伪标签到未标记的数据,在自我训练和知识蒸馏之间建立了重要的联系。我们的工作建立在这种关系之上,并将知识蒸馏扩展到没有标签可用的情况,我们的teacher是在训练过程中动态地建立起来的。知识蒸馏,而不是被用作自我监督的预训练的后处理步骤,而是直接转化为一个自我监督的目标。最后,我们的工作也与共蒸馏有关,其中学生和教师有相同的架构,并在训练过程中使用蒸馏。然而,辅蒸馏的老师也从学生身上蒸馏,而它更新了我们工作中学生的平均水平。
Note that the experiment with batch size of 128 runs on only 1 GPU. We have explored training a model with a batch size of 8, reaching 35.2% after 50 epochs, showing the potential for training large models that barely fifit an image per GPU.
论文地址:lUFO- ViT : High Performance Linear Vision Transfrmer without Softmax Unit Force Operated Vision Transformer, UFO-ViT associative law of matrix multiplication → eliminates softmax function to use 矩阵乘法结合律代替softmax函数XNorm / …
本节书摘来自异步社区《UNIX网络编程 卷1:套接字联网API(第3版)》一书中的第8章,第8.11节,作者:【美】W. Richard Stevens , Bill Fenner , Andrew M. Rudoff著,更多章节内容可以访问云栖社区“…