본문 바로가기

논문리뷰

I-Bert 논문리뷰

by estela19 2022. 3. 19.

용어정리

Quantization : 실수형 변수를 정수형 변수로 변환하는 과정. weight 나 activation fuction이 어느정도 범위안에 있다는 것을 가정하는 모델 경량화 방법

성능

RoBERTa-Base/Large를 사용한 GLUE downsteram task에서 I-BERT가 비슷한 성능을 보였으며, inference speedsms 2.4~4배 빨랐다

배경

컴퓨팅 자원의 한계나 real time inference, edge device를 위해서 경량화 할 필요가 있다.

방법

quantization이전의 방법들은 일부에서만 quantization한 fake → floating point를 지원하지 않는 기기에서는 사용 불가
linear 한 layer 에만 적용 가능해서 CNN, Batch Norm, 등에만 적용되고 trasformer architecture (GELU, Softmax, LN)에는 적용불가
32bit floatingn point 대신 8bit integer를 사용하여 경량화

integer-only kernel (GELU, Softmax etc)

Embedding, MatMul : int8 multiplication, int32 accumulation

non linear(GELU, Softmax, LN): int32 accumulation, requantized int8

경량화연구

purning
knowledge distillation
efficient neural archtecture design
hardware-aware NN co-design
quantization
parameter를 low bit precision으로 표현

Quantization

uniform 하고 static한 경우만 상정 (non-uniform 한 경우 distribution을 더 잘 파악할 수 있지만 general hardware에 배포하기 어렵다)

non linear-Function

$$GELU(S_q) \neq S*GELU(q)$$

해결법

lookup table
- overhead가 심함
- 병목이 발생
dequantization
- not integer only
- 특정 하드웨어에서는 float point 지원 x
Non-linear activation function
- 덧셈과 곱셈만으로 표현되게 근사

Polynomial Approximation of Non-linear Functions

To find the best polynomial approximation

장점
- 정확한 interpolating point (xi, fi)를 잡을 수 있다.
- polynomial 정도를 조절할 수 있다.
단점
- high-order polynomials는 계산과 메모리 오버헤드가 있다.
- low-precision integer는 overflow가 발생할 수 있다.

Integer only GELU

GELU란? transformer models에 사용되는 non-linear activation function

그러나 erf (loss func)를 계산하기 어렵기 때문에 근사해서 사용

이 또한 시그모이드 함수가 non linear해서 변형

그러나 두 함수의 차이가 커서 성능이 떨어짐

그래서, 아래 식을 통해 최적화

Integer-only Softmax

지수함수를 근사하려면 고차방정식이 필요하나 GELU와 마찬가지로 고차방정식으로 근사할 수 없음

제한된 범위를 지정해서 문제를 해결할 수 있다.

지수함수에서 최댓값을 가져온다.

음수 범위의 non-linear func로 변환

음수를 decompose

x와 p는 real number (-ln2, 0]

지수를 shift로 처리

cf. 비슷한 처리가 Itanium2 machine from HP에 있었지만, exp를 lookup table로 처리했었다.

exp를 근사하기 위해 2차방정식 사용 -> 즉 L2 distance in (-ln2, 0]

요약

Integer only softmax

Integer only Layer Norm

layer normalization

NLP 태스크에서 평균은 바뀌지 않지만, 표준편차는 빠르게 바뀌므로 runtime에 dynamic하게 계산해줘야한다. (표준편차는 square root function사용)

square root 는 $\left \lfloor \sqrt {n} \right \rfloor$ 로 계산할 수 있으며 newton’s method로 integer 계산 할 수 있다.

Result

Conculusion

I-Bert 는 non linear한 GELU, Softmax, LayerNorm을 int 연산으로 근사하는 것이다.
RoBERTa-Base/Large에서 I-BERT를 평가했으며, GELUscore 가 0.3/0.5 였다.
inference latency는 4배정도 차이가 났다
차후 연구에는 train speed도 향상시킬 에정

'논문리뷰' 카테고리의 다른 글

StyleGAN-V 논문리뷰 (0)	2022.03.19
DETR 논문리뷰 (0)	2022.03.19

댓글

티스토리툴바