pointwiseFFN

Transformer 為什麼需要一層 pointwiseFFN ?

一開始看 “Attention is all you need” 時，心裡隱隱有這個疑問。我以為對於 contexual representation 而言 “Attention is all I need”，self-attention 的機制可以讓模型直接獲得前後文本資訊，且經由訓練學習每個 token 要花多少注意力在哪些其他 token 上。那 pointwiseFFN 究竟是要學習什麼呢？後來上網搜尋發現鄉民也有同樣的疑問，以下就統整我目前所發現的解釋跟我一些想法。

Language Understanding: Parsing/Composition

文字意義是經由上下文本賦予，這是在學 word2vec 時很重要的概念，我也一直用這個想法來理解 transformer。但後來我發現 context 的概念應該要再拆成 parsing/composition 兩個元素，再用這兩個元素來理解 transformer 的架構。

composition（前後文）

顧名思義就是將兩個或以上的字合在一起看，而相同的字跟不同的字合在一起看可能就代表不同的意思，例如，(“river”, “bank”), (“bank”, “account”)，兩個的 bank 是不同的意思。

parsing (句子結構)

之前上 stanford cs224N 時其實有一堂課專門在講這個概念，但一心追求 BERT 的我就沒放在心上。其實 parsing 要做的事就是拆解句子的階層架構，經典的例子如，

Bart watched a squirrel with binoculars

可以將句子拆解成以下的架構，可以看到 with binocular 就是用來修飾 watched a squirrel，他們是在樹狀圖的同一階層下。

但同時也可以其畫成以下的結構，可以發現這時 with binocular 跟 a squirrel 。是在同一層，這時 with binocular 變成修飾 a squirrel 。

下面的拆解法其實文法是正確的，但意義上其實是不合理的。
我們還可以舉出很多例子，又例如，

Scientists count whales from space

這個 from space 同樣有模糊空間，因為你不知道這是來修飾 count whales 還是 whales 本人。

這時要正確的 parsing 其實就需要正確的 composition ，也就是告訴 parsing ，（“a squirrel”,“with binocular”）放在一起不合理。

正確的解析

完整的句子解析同時需要 composition 跟 parsing，而他們又是互相依存的，也就是好的 parsing 需要 composition，好的 composition 需要 parsing 。因為 composition 會根據 parsing 的結果進行組合，例如，(“a”, “squirrel”), (“watched”, “a squirrel”)。而正確的 composition 可以幫助 parsing，例如，(“watched a squirrel”,“with binocular”) v.s. (“a squirrel”,“with binocular”)，如果模型可以給前項的 watch 有意義的向量，後項 squirrel 的向量很小，就可以幫助 parsing 進行判斷。

BERT 當中的 parsing/composition

了解了 composition/parsing，其實不難發現，attention 的機制就如同 parsing，克服句子很長的情形，attend 到同一階層的其他 token。而這時 pointwiseFFN 就扮演 composition 的角色，將 attention 當中全部相加的 embedding，透過 autoencoding 的方式解析。

非線性的組合

假設 self-attention 層真的把 with binocular attend 到 a squirrel ，這時只是單純的將所有 token 的 embedding 進行線性組合可能不夠，畢竟， watched 跟 a squirrel 還有 with 跟 binoculars 不在同一階層，可能需要非線性的組合。

Reference

上面那篇 reddit 有留言分享一篇很有內容的文章，Understanding BERT Transformer: Attention isn’t all you need。