[논문 리뷰 : 서베이] Multimodal Learning With Transformers: A Survey

Published : 2023 , Citation: 138(23.11.11기준)

Multimodal Learning with Transformers: A Survey

Transformer is a promising neural network learner, and has achieved great success in various machine learning tasks. Thanks to the recent prevalence of multimodal applications and big data, Transformer-based multimodal learning has become a hot topic in AI

arxiv.org

❏ 글 목차 (논문 목차 아님, 중요하고 필요로 하는 정보만 읽고 정리함) :

논문의 Features, Contributions
Transformers : a Brief History and Milestones
Vanilla Transformer, Vision Transformer, Multimodal Transformer 설명
~~Challenges and Designs( 미정)~~

❏ 1. 논문의 Features, Contributions

Features
1. 트랜스포머는 modality-agnostic 방식 의 이점을 갖고있다. (modality-agnostic : 다양한 방식으로 적용가능)
2. 멀티모달에 사용되는 트랜스포머 주요 components 에 대해 수학적으로 설명함.
3. 트랜스포머를 기반으로 한 Multimodal learning(MML)에 대해 self-attention 관점에서 설명함.
Contributions
1. Vanila Transformer, Vision Transformer, Multimodal Transformer 에 대해 리뷰 함.
2. 트랜스포머 기반 MML에 대해 2가지 관점에서 분류함 ( Application 관점, Challenge 관점)
3. Bottlenecks, problems , 추후 연구 방향에 대해 논의함.

❏ 2. Transformers : a Brief History and Milestones

Vanilla Transformer : self-attention mechansim을 사용하고 있으며, NLP 분야에서 우세하게 사용되다가 현재에는 다양한
분야에 사용됨
Vision Transformer : 이미지 분야에서 end to end 솔루션으로 기여함. ;
파생 : low-level task, recognition, detection 등등.
VideoBERT : 트랜스포머를 활용한 첫번째 멀티모달 task 임. ;
파생 : ViLBERT, LXMERT,Unicoder-VLm B2T2 등등.
CLIP : Pretrain 된 멀티모달을 사용하여 zero shot 인식을 수행함. ;
파생 : ALIGN, CLIP-TD, ALBEF 등등

❏ 3. Vanilla Transformer, Vision Transformer, Multimodal Transformer 설명

Vanilla Transformer (Fig 1)
- 특징: 입력이 토큰; encoder/decoder 는 트랜스포머 layer/block 에 의해 stack 됨;
각각의 block은 Multi-head slef-attention(MHSA), feed-forward network(FFN) 두개로 구성됨;
back propagation of the gradient 를 예방하기 위해 MHSA, FFN에 Residual Connection 사용(eq1)

- Discussion : post-normalization(사후 정규화) 와 pre-normalization(사전 정규화) 문제 존재
(설명 : 원래의 Vanilla Transformer는 post-normalization 사용했으나, 수학적 (Gram-Schmidt process에
의해)으로보면 pre-normalization 이 좋아보임. 더 연구가 필요함 )
1. Input Tokenization
  - Tokenization
    - 정의 : Vanilla Transformer는 원래 기계번역을 위해 seq2seq 모델로 제안되었음으로 vocabulary 시퀀스를 입력
      으로 받음.
    - 기타 : Vanilla Transformer 와 Variant Transformers 는 각 토큰을 노드로 간주할수 있다.
  - Special/Customized Tokens
    - 정의 : 의미론적으로 Place-holders ( 특정한 값이나 데이터로 채워질수 있는 위치) 라고 할수 있다.
      (ex. mask token)
  - Position Embedding
    - 정의 : 임베딩된 토큰들이 positional information(위치 정보)를 갖도록 해줌.
    - 종류 : Sine, Cosine 함수 (Vanilla Transformer에서 사용) ; 이것말고도 많다고 함.
  - Discussion
    - 장점 : 1. 일반적인 접근방식 ; 2.입력정보 ( concatenation/stack, weighted summation 등등 )를 처리하기 위한
      유연한 방식임. ; 3. 특정 작업을 위한 커스터 마이징 된 토큰과 호환이 잘됨. (e.g. [Mask], [Class] 토큰 등)
    - Position embedding 의 미해결 문제: 1. Cloud point, stetch drawing stroke 에서는 토큰 요소가 이미 좌표임으로 position embedding이 필수적이지 않다. ; 2. 수학점 관점에서 position embedding과 같은 추가적인 정보를 어떠한 형태로 추가할수 있다. ; 3. 대부분의 경우 Transformer는 position embedding 이 필요하다.
2. Self-Attention (eq2,3)and Multi-Head Self-Attention
  - Self-Attention (SA)
    - Vanilla Transformer 의 핵심기능은 Self-Attention(SA) 이며 Scaled Dot-Product Attention으로도 부름.
      (추가로 입력의 전처리 과정으로 positional embedding을 수행하는데 z ← x (+) position embedding
      (summation) 또는 z ← x (concat) position embedding (concatenation) 를 함. )
    - 정의 : 입력이 들어오면 self-attentiond을 통해 각요소가 다른요소에 attention(집중)할수있도록 함.
    - 따라서 Vanilla Transformer는 완전 연결된 GNN encoder 이고 Transformer 종류들은 non-local of global perception 능력을 갖고 있다
  - Masked Self-Attention(MSA, eq4)
    - 정의 : Transformer decoder가 문맥 의존성을 학습하고, 후속 포지션에 참석하는것을 방지하기 위한것.
      즉, 현재 위치 이전의 위치에만 어텐션을 적용하겠다.
    - uni-modal, multimodal 에서 특정 마스크는 domain knowledge와 prior knowledge 기반으로 설계됨. 기본적으로 MSA는 Transformer 모델에 additional knowledge를 추가하는데 사용됨.
  - Multi-Head Self-Attention(MHSA, eq5)
    - 여러 self-attention를 병렬로 하위 layer로 쌓을수 있다. 연결된 출력은 Projection matrix W에 의해 concatenate 되어 Muli-Head Self-Attention을 이룬다.
    - MHSA의 아이디어는 ensemble(앙상블)에서 나오고, MHSA 는 모델이 multiple representation sub-spaces에 정보가 연결되어 attend(집중) 하도록 함.
3. Feed-Forward Network(FFN , eq6)
  - non-linear activation 이 있는 linear layer로 구성된 FFN을 통과함.
  - 몇몇 문헌은 FFN을 Multi-LAyer Perceptron(MLP)라고 함.

Vision Transformer (ViT. fig2)
- 정의 : Vision Transformer 는 이미지를 입력으로 하는 pipeline을 갖고있고, 입력으로 이미지가 고정된 patch 사이즈로 들어감.
(e.g. 16 * 16, 32 * 32)
- linearly embedded layer통과 후 positioning embeddings를 추가함, 모든 patch sequence 들은 Transformer encoder로
encoding 됨.
- 이미지가 H(Height) * W(Width) * C(channels) 라면 ViT 는 N *(P^2 * C) 로 reshape을 해야함.
여기서 P*P 는 패치의 해상도, N=(HW) / P^2 이다.
- 추가로, 분류를 위해서는 학습가능한 [CLASS] 토큰을 필요로 한다. (eq7)

Multimodal Transformers
1. Multimodal Input
  - Transformer family 는 일반적인 grpah neural network의 타입으로 공식화 할수 있는 아키텍처임.
  - 특히 self-attention 은 각각의 입력을 global (non-local) 로 집중하여 fullyconnected graph 로 처리함.
  - 이러한 특징으로 인해 Transformer가 다양하게 처리하고 호환하며 작업할 수 있음.
  - Tokenization and Embedding Processing
    - 임의의 modality에서 입력이 주어지면 다음의 2가지 step을 수행함.
      (1) 입력을 토큰화 함.(매우중요한 작업이며, 다양한 방식으로 수행됨)
      (2) 토큰을 나타낼 embedding space를 선택하면 됨.(트랜스포머의 입력으로 들어가야 하니깐.)
  - Discussion (TABLE 1. 정리 참고.)
    - RGB : neat grid graph in the pixel space
    - Video, Audio : clip/segment based grpahs over a complec space 이며 시간적,의미적 패턴이 포함됨.
    - 2D, 3D : Spare graph
  - Token Embedding Fusion
    - 각 토큰 포지션에 여러개의 임베딩을 수행할 수 있다 (초기 융합의 일종.)
    - Token Embedding fusion 은 Multimodal Transformer에서 상당히 중요한 역할을 함.
    - 가장 일반적인 fusion 은 여러개의 embedding을 Token 단위로 Summing 하는 Token-wise summing 이다.
      (e.g., a specific token embedding ⊕ position embedding)
      (ex.1 , RGB 와 grey-scale 이미지를 token-wise weighted 를 수행함.
      ex.2, VL-BERT(global visual context를 linguistic domain 에 주입함.) → linguistic token embedding ⊕ full image visual feature embedding
      ex.3, InterBERT( add location information for ROI) → ROI embedding ⊕ location embedding
      ex.4, ImageBERT(5종류의 embedding 이 섞임.) → image embedding ⊕ position embedding ⊕ linguistic embedding ⊕ segment embedding ⊕ sequence position embedding )
2. Self-Attention Variants(변형) in Multimodal Context (Table 2, Fig2)
  - Multimodal Transformer 에서 cross-modal interactions(e.g fusion(융합), alignment(정렬)) 는 self-attention 과
  그외는 self attention variants 에 의해 처리됨. - (1) early summation (token-wise, weighted),
  (2) early concatenation, (3) hierarchical attention (multi- stream to one-stream), (4) hierarchical attention
  (one-stream to multi-stream), (5) cross-attention, and (6) cross-attention to concatenation.
  6가지 방식의 멀티모달 트랜스포머에 대해 살펴봄 (필자 추가 : Token-wise : 토큰 단위, Weight: 토큰에 가중치를 적용)
  - (1) Early Summation(eq8)
    - simple하고 effective multimodal interaction 임.
    - 여러 modalities는 각각의 Token Position(토큰위치)에서 가중치 합산 후 transformer layer 에서 처리함.
    - 장점 : 계산복잡도를 증가시키지 않음.
    - 단점 : 수동으로 가중치를 설정해야 함.
  - (2) Early Concatenation(Table2 공식 참고.)
    - 정의 : 여러 multiple modalities 의 Token embedding sequences 들을 concated(연결)하여 Transformer
      입력으로 사용함.
    - 모든 멀티모달의 Token positions 들은 모든 sequence에서 attended(집중)에 사용이 될수 있다. 이는 다른 모달리티의 context를 고려하여 잘 인코드 된다.
    - 단점: concatenation 후 sequence가 길어질수록 계산복잡도가 증가함.
    - all-attention, CoTransformer 라고 부르기도 함.
  - (3) Hierarchical Attention (multi-stream to one-stream) (Table2 공식 참고.)
    - Transformer layers 는 cross-modal 과 상호작용하기 위해 계층적으로 결합됨.
    - 멀티모달 입력은 독립적인 Transformer에서 수행되고 출력은 concatenated 되어 다른 트랜스포머에 연결됨.
  - (4) Hierarchical Attention (one-stream to multi-stream) (Table2 공식 참고.)
    - Concatenated multimodal 입력이 하나의 Transfomer에서 수행된 후, 두개의 분리된 Transformer의 입력으로 들어감.
    - Cross-modal interaction 을 인식하며, 독립된 uni-modal representation 을 유지함.
  - (5) Cross-Attention (Table2 공식 참고.)
    - 정의 : 2-stream Transformer의 경우 Query embeddings이 crossstream 방식으로 exchanged/swapped 됨.
      (이러한 방식이 cross-modal interactions 를 인지할수 있음.)
    - VilBERT 에서 처음 제안됨.
    - cross-attention은 각 모달리티를 attention(집중)함. 계산 복잡도를 높이지 않음.
    - cross-modal attention을 globally하게 수행하지 못함으로, 전체 context를 잃어버림.; two-stream cross-attention은 cross-modal interaction을 학습할 수 있지만, 각각의 모달리티 내부의 self-attention이 없음.
  - (6) Cross-Attention to Concatenation (Table2 공식 참고.)
    - 2-stream cross - attention은 다른 Transformer에 의해 concatenation을 진행함. 이는 global context를 모델링함.
    - 이러한 종류의 계층적 cross-modal interaction은 cross-attention의 단점을 완화함.
  - Discussion
    - 이러한 interaction은 flexibly combined and nested 수 있다. ( 정보를 결합 및 끼워넣을수 있다)
3. Network Architectures
  - 기본적으로 다양한 멀티모달 Transformer는 앞서 언급한 self-attention의 변형인 internal multimodal attention(내부 멀
  티모달 어텐션) 으로 작동한다.
  - Fig2 에서 보듯이 이러한 어텐션은 멀티모달 트랜스포머가 내장된 외부 네트워크 구조를 결정함.
  - 아래와 같이 구조를 총 3개로 나눌 수 있다.
  (1) early summation and early concatenation work in single-stream,
  (2) cross-attention work in multistreams,
  (3) hierarchical attention and cross-attention to concatenation work in hybrid-streams.

Multi-modal 에 따른 Tokenization 과 Token Embedding 과정. (출처 : https://arxiv.org/abs/2206.06488)

Self-Attention variants for multi-modal interaction/fusion. Table 2.

Transformer-based cross-modal interactions. Fig2.

❏ 4. Challenges and Designs

~~Fusion~~
- 일반적으로 MML Transformer 는 세가지 입력 수준에 의해 정보를 융합함
  : input (i.e., early fusion), intermediate representation (i.e., middle fusion), and prediction (i.e., late fusion)
- ~~일반적으로 early fusion MML Transformer는 one-stream architecture라고 알려져있고, 최고한의 구조 수정으로 BERT의 장점을 채택함.~~
- ~~late fusion 이 MML Transformer에서 덜 사용된다.(연구해볼 방향이다.)~~
~~Alignment~~
- ~~Cross-modal alignment 는 실제 세계의 multimodal 적용에 있어 핵심이다.~~
- ~~대표적인 방법으로는 쌍을 이루는 샘플에 대해 contrastiva learning을 통해 두가지 모달리티를 common representation space에 매핑 하는것이다.~~
- ~~이러한 alignment model은 prompt engineering 을 통한 zero-shot 전송 기능을 갖고 있다.~~

'논문 리뷰' 카테고리의 다른 글

[논문리뷰 : 개념] In-flight positional and energy use data set of a DJI Matrice 100quadcopter for small package delivery (0)	2024.01.02
[논문리뷰 : 개념] CapERA: Captioning Events in Aerial Videos (0)	2023.12.06
[논문리뷰 : 코드] Swin Transformer : Hierarchical Vision Transformer using Shifted Windows (0)	2023.11.06
[논문리뷰 : 개념] Swin Transformer : Hierarchical Vision Transformer using Shifted Windows (0)	2023.10.30
[논문리뷰 : 개념] VATT : Transformers for Multimodal Self-Supervised Learning From Raw Video, Audio and Text (0)	2023.09.10

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

시작은 미약하였으나 , 그 끝은 창대하리라

[논문 리뷰 : 서베이] Multimodal Learning With Transformers: A Survey

❏ 글 목차 (논문 목차 아님, 중요하고 필요로 하는 정보만 읽고 정리함) :

❏ 1. 논문의 Features, Contributions

❏ 2. Transformers : a Brief History and Milestones

❏ 3. Vanilla Transformer, Vision Transformer, Multimodal Transformer 설명

❏ 4. Challenges and Designs

'논문 리뷰' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[논문 리뷰 : 서베이] Multimodal Learning With Transformers: A Survey

❏ 글 목차 (논문 목차 아님, 중요하고 필요로 하는 정보만 읽고 정리함) :

❏ 1. 논문의 Features, Contributions

❏ 2. Transformers : a Brief History and Milestones

❏ 3. Vanilla Transformer, Vision Transformer, Multimodal Transformer 설명

❏ 4. Challenges and Designs

'논문 리뷰' 카테고리의 다른 글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역