Which layer preserves the best cross-lingual representations in multilingual-BERT?

Small experiments on multilingual-BERT

Posted by Jexus on January 27, 2020

Recently by the same author:


美國 EECS 博士班申請經驗分享 (ML/DL/NLP/Speech)

2021 Fall NLP/Speech PhD Application

You may find interesting:


My Work - Lifelong Language Knowledge Distillation

EMNLP 2020 long paper


My Work - Dual Inference for Improving Language Understanding and Generation

EMNLP 2020 findings paper

Which layer preserves the best cross-lingual representations in multilingual-BERT?

In the context of NLP research, variant in different languages is a non-negligible issue that does not appear in other DL research fields (e.g. Computer Vision). There are over 7000 languages in the world, while most of the NLP datasets/corpus are in English. Cross-lingual transfering from English to other languages is desirable especially for the languages with fewer training data.

Before BERT appeared, the main research about cross-lingual transfering focusing on aligning independent monolingual word embeddings for different languages in supervised/adversarial methods (e.g. MUSE). However, after Google released the Multi-lingual version of BERT (Multilingual-BERT), people surprisingly found cross-lingual transferability in BERT for many languages without any supervision or adversarial training process (How multilingual is Multilingual BERT?, Zero-shot Reading Comprehension by Cross-lingual Transfer Learning with Multi-lingual Language Representation Model).

I found this interesting phenomenon last summer, and I did some t-SNE visualization on BERT hidden contextualized representations (from 8-th layer) using WMT-17 paired en-zh corpus.

if you understand Chinese, it is a shock to see perfectly matching between the Chinese words with the English counterparts.

I also visualized each different layer in BERT, it seems that the most aligned layer is layer 7~8. On the contrary, the first few layers (1~4) and the last (11~12) are not aligned nicely.

features from different layers

there are 12 pictures in this GIF from 12 different BERT layers.

To examine the alignment quality of BERT in different layers, I did some experiments on XNLI, a dataset to test the cross-lingual transferability of NLI models. To be fairly compared, I did not extract the features directly from each layer to feed into a XNLI model, because different layers have different level of functions to natural language understanding task, and lower level features may not get good enough results on XNLI even if they are aligned correctly.

Instead, I fixed the first N layers of BERT model and train the layers after N on XNLI dataset (so the whole model size that data propagate through is maintained).

experimental results of XNLI

The results show that performance on XNLI directly correlated with the visualized alignment quality of BERT. The the representation from layer 8 is better for cross-lingual transfer. On the other hand, the performance of English dataset is not significantly affected by the layer number we fixed.

visualization of XNLI results -1 means finetune all
0 means finetune except embedding layer
1~11 means the first N-th layer is fixed