Dynamic Masking Rate Scheduling for MLM Pretraining preview page 1

Databricks

Dynamic Masking Rate Scheduling for MLM Pretraining

Pages

Time to read

30 mins

Publication

09/18/23

Language

English

Summary

This research article presents a novel approach to Masked Language Modeling (MLM) by proposing dynamic scheduling of the masking rate during pretraining. Traditionally, a fixed masking rate of 15% is used, but the authors argue that varying the masking rate throughout training can enhance model performance. The study demonstrates that linearly decreasing the masking rate improves average GLUE accuracy for both BERT-base and BERT-large models, achieving significant performance gains compared to fixed rate baselines. The authors conduct experiments to validate their approach, showing that dynamic masking not only enhances linguistic performance but also increases training efficiency, yielding speedups in pretraining time. The findings suggest that the proposed method allows for better utilization of different masking rates, leading to improved model quality and efficiency in downstream tasks. This work contributes to the understanding of hyperparameter scheduling in deep learning, specifically in the context of language model training.

Databricks

Dynamic Masking Rate Scheduling for MLM Pretraining

Summary

Get the Full Copy