PH#15: Big Transfer (BiT): General Visual Representation Learning
The paper that showed us how to pre-train with extremely larger datasets and get improvements on downstream tasks.
Haiku:
Pick a huge model (ResNet152x4),
Pre-trained on huge dataset (JFT-300M),
But with new heuristics.
Take the pre-trained large model,
Fine-tune on a smaller dataset,
= SOTA!
Big datasets?
Don’t use batch norm,
use group norm and weight std.
Big datasets?
Don’t use MixUp
Use it for smaller datasets only.