# Masked AutoEncoder ## 1. Pretrain We have kindly provided the bash script `main_pretrain.sh` file for pretraining. You can modify some hyperparameters in the script file according to your own needs. ```Shell cd Vision-Pretraining-Tutorial/masked_image_modeling/ python main_pretrain.py --cuda \ --dataset cifar10 \ --model vit_t \ --mask_ratio 0.75 \ --batch_size 128 \ --optimizer adamw \ --weight_decay 0.05 \ --lr_scheduler cosine \ --base_lr 0.00015 \ --min_lr 0.0 \ --max_epoch 400 \ --eval_epoch 20 ``` ## 2. Finetune We have kindly provided the bash script `main_finetune.sh` file for finetuning. You can modify some hyperparameters in the script file according to your own needs. ```Shell cd Vision-Pretraining-Tutorial/masked_image_modeling/ python main_finetune.py --cuda \ --dataset cifar10 \ --model vit_t \ --batch_size 256 \ --optimizer adamw \ --weight_decay 0.05 \ --base_lr 0.0005 \ --min_lr 0.000001 \ --max_epoch 100 \ --wp_epoch 5 \ --eval_epoch 5 \ --pretrained path/to/vit_t.pth ``` ## 3. Evaluate - Evaluate the `top1 & top5` accuracy of `ViT-Tiny` on CIFAR10 dataset: ```Shell python main_finetune.py --cuda \ --dataset cifar10 \ -m vit_t \ --batch_size 256 \ --eval \ --resume path/to/vit_t_cifar10.pth ``` ## 4. Visualize Image Reconstruction - Evaluate `ViT-Tiny` pretrained by MAE framework on CIFAR10 dataset: ```Shell python main_pretrain.py --cuda \ --dataset cifar10 \ -m vit_t \ --resume path/to/mae_vit_t_cifar10.pth \ --eval \ --batch_size 1 ``` ## 5. Experiments - On CIFAR10 | Method | Model | Epoch | Top 1 | Weight | MAE weight | | :---: | :---: | :---: | :---: | :---: | :---: | | MAE | ViT-T | 100 | 91.2 | [ckpt](https://github.com/yjh0410/MAE/releases/download/checkpoints/ViT-T_Cifar10.pth) | [ckpt](https://github.com/yjh0410/MAE/releases/download/checkpoints/MAE_ViT-T_Cifar10.pth) | ## 6. Acknowledgment Thank you to **Kaiming He** for his inspiring work on [MAE](http://openaccess.thecvf.com/content/CVPR2022/papers/He_Masked_Autoencoders_Are_Scalable_Vision_Learners_CVPR_2022_paper.pdf). His research effectively elucidates the semantic distinctions between vision and language, offering valuable insights for subsequent vision-related studies. I would also like to express my gratitude for the official source code of [MAE](https://github.com/facebookresearch/mae). Additionally, I appreciate the efforts of [**IcarusWizard**](https://github.com/IcarusWizard) for reproducing the [MAE](https://github.com/IcarusWizard/MAE) implementation.