Databricks
Dynamic Masking Rate Scheduling for MLM Pretraining
Pages
11
Time to read
30 mins
Publication
Language
English
Pages
11
Time to read
30 mins
Publication
Language
English
This research article presents a novel approach to Masked Language Modeling (MLM) by proposing dynamic scheduling of the masking rate during pretraining. Traditionally, a fixed masking rate of 15% is used, but the authors argue that varying the masking rate throughout training can enhance model performance. The study demonstrates that linearly decreasing the masking rate improves average GLUE accuracy for both BERT-base and BERT-large models, achieving significant performance gains compared to fixed rate baselines. The authors conduct experiments to validate their approach, showing that dynamic masking not only enhances linguistic performance but also increases training efficiency, yielding speedups in pretraining time. The findings suggest that the proposed method allows for better utilization of different masking rates, leading to improved model quality and efficiency in downstream tasks. This work contributes to the understanding of hyperparameter scheduling in deep learning, specifically in the context of language model training.