Pay Less Attention with Lightweight and Dynamic Convolutions​

ICLR 2019 (Oral) paper - slide

Posted by Jexus on May 6, 2019

Recently by the same author:


美國 EECS 博士班申請經驗分享 (ML/DL/NLP/Speech)

2021 Fall NLP/Speech PhD Application

You may find interesting:


My Work - Lifelong Language Knowledge Distillation

EMNLP 2020 long paper


My Work - Dual Inference for Improving Language Understanding and Generation

EMNLP 2020 findings paper

Pay Less Attention with Lightweight and Dynamic Convolutions​

Paper Link

TL;DR

把 Transformer 架構中的 self-attention 部分全部換成 Convolution Layer,發現其實表現不差,且多加一些小 trick 就能贏過 self-attention-based Transformer。

此實驗結果並不是要說 Convolution Layer 很強,而是要說 Transformer 能表現這麼好,其架構中除了Attention 部分外的其他部分 (如:FFN blocks between the self-attention module) 可能才是扮演了重要的角色。

然而其實,這篇在處理 Encoder 跟 Decoder 之間的連接時,還是用了 Attention,沒有全部換掉,所以拿這篇說 Attention 過譽了是有點不公平,只能說拿來說 self-attention 過譽,不過但這部分也不方便用 Convolution Layer 取代就是了。

https://openreview.net/forum?id=SkVhlh09tX

此作品是 Felix Wu (是 NTU CSIE 校友) 在 Facebook AI Research 實習的作品,code 已放在 fairseq

另有 Talk Video on ICLR 2019,從55分附近開始。

Slide:

Please wait a minute for the embedded frame to be displayed. Reading it on a computer screen is better.