Network Pruning 视频 李宏毅助教

Network Pruning 视频 李宏毅助教

PPT:https://slides.com/arvinliu/model-compressionhttps://slides.com/arvinliu/model-compression/fullscreen?print-pdf=true#/0/1https://slides.com/arvinliu/model-compression/fullscreen?print-pptx=true#/0/2

视频:P50 Network Compression (2_2) - Network Pruning (选学)39:33

对于dense全连接来说,少1个参数,总参数量(矩阵参数)从 $a * b + b * c = (a + c) * b$,变为 $(a + c) * (b -1)$ 个weight.

image-20230421174655789

对于CNN来说,kernel size $kk$,参数原本是 $(a+c)bkk$ ,少一个参数,变为 $(a+c)*(b-1)kk$ 个 weight。

image-20230421174715246

Network Pruning

Main Question: Prune what? 要对“不重要的”剪枝

Which is most important?

How to evaluate importance? 如何评估重要性
  • Evaluate by Weight 通过weight大小来衡量重不重要
  • Evaluate by Activation 通过neuron经过激活函数是不是为0来评估重要性
  • Evaluate by Gradient 通过梯度值来评估重要性 对neuron算梯度,值很大说明更重要

After Evalutaion? 要prune多少呢

  • Sort by importance and prune by rank. 按重要性排序
  • prune by handcrafted threshold. 按一个阈值手动prune
  • prune by generated threshold 自动生成阈值prune

1)每一层都prune固定比例 x%(这个方式不太实用,因为model一般是金字塔型,顶层参数少,因为prune要多一些)

2) 每一层prune的比例不同。(autoML)

image-20230421174844154

Evaluate Importance 衡量基准

  1. sum of L1 norm (这里的对象是weight,prune weight)

卷积过程,得到输出neuron,输入是feature map,kernel size $k*k$,所以第一个橙色的weight,计算sum,第二个蓝色的weight计算后的的sum。所以根据sum的L1或L2大小,来决定要prune掉谁。

image-20230421174910562
  1. Filter pruning via geometric media 几何中心上prune

根据输出的分布,把分布边缘的weight prune掉

image-20230422233426616

但是这样存在一定危险:

  1. 当分布的方差很小时,砍掉会很影响效果,不能轻易砍(prune);
  2. norm要接近0才能砍,当发现不接近0,就是说这些都对结果很有贡献,就不能轻易砍。
image-20230422233448565

改进方法:FPGM:出发点是“小的norm对model一定没用吗?大的norm对model一定有用吗”

找出对model都有贡献,并且彼此都不太一样的filter。如果很像的filter(比如norm值很接近),这种filter可以删掉其中一个。

就是虽然filter 的norm值大或小,也许都对model有贡献。但是如果重复的filter,这种filter才是对model没有贡献,要被prune掉的。

怎么找重复filter :通过几何中心。

image-20230422233505323

找几何中心,把不是几何中心的prune掉。???

image-20230422233526366

Other parameters we can use? 上面讲的是prune conv的weight,下面讲prune其他weight:

  • Network Slimming
    • BN层中,$\gamma$ is a learnable vector. $\gamma$ 是一个可学习向量,因此可以根据$\gamma$向量里每个元素的值大小,从而评估weight的重要性,因为weight重要一点,$\gamma$对应值也会大一点。
    • We can just use this parameters to evaluate importance.
    • colab tutorial (only pruned by gamma)
image-20230422233556827

但是如果$\gamma$的值都不接近零,这样强行根据$\gamma$值去prune会造成精度下降。因此要做归一化一下。对$\gamma$做L1-penalty。(?)

image-20230422233621537
  • average percentage of zeros (APoZ)

经过relu激活函数后很多值是0,prune按照0的比例来做决定。把非零的值sum起来(是数量sum还是数值sum?)?然后根据0的比例来决定prune多少。比如0的比例很高,则可以prune的也可以多一些。

image-20230422233637790

0的比例还挺高的:

image-20230422233651602

More About Lottery Ticket Hypothesis

image-20230422233724553

作者用L1 norm的值(其实就是weight值取绝对值)大小来决定prune哪些weight。

问题1:会不会有这么一种可能:最后train完大数值的那些weight,也许是因为它们的初始值就大?那么初始值小的weight,最后如果train完也变成大weight,那么这类weight(初始小、train完大)是否会更有意义呢?

image-20230422233812747

实验对比

image-20230422233829593

实验结果

image-20230422233849055

问题2:什么影响了winning ticket参数

Question: Why winning tickets can perform better accuracy?

Experiment: 做了如下实验

  • sign: 改变sign函数
    • rewind init sign 把sign函数rewind(颠倒)
    • random 随机把sign正负值颠倒
  • value:
    • rewind init value 把值颠倒
    • reshuffle weight in same layer
    • constant α (std of initializer)
    • random
image-20230422233913423

实验结果:

image-20230422233938836

Conclusion

Experiment 1: Choose Which Mask

  • Usual L1-norm pruning :

Experiment 2: Remain what properties in $w$

  • Under same architecture, init sign is important. 用sign函数更重要,sign函数曲线不要轻易改变长相

Based on Experiment 1 & 2, we can construct a “supermask”.

image-20230422233958304

详情见:Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask (ICML 2019)

Rethink vs Lottery

Recap 它的观点和上述相反,它认为weight和neuron不重要,重要的是架构、构造(structure、architecture),经过prune后再finetune的结果,可能没有在这个架构下(和prune后相同的架构)随机初始化来的好。

image-20230422234017228
  • Rethink val:架构更重要,random init好
  • Lottery Ticket:weight更重要,random init不好
image-20230422234033439

???

对于Lottery Ticket来说:学习率lr要小,weight和train后的weight要接近

image-20230422234049299
  • Rethink val:prune的是neuron/feature map
  • Lottery Ticket:prune的是weight
image-20230423092000001 image-20230423092023886

Paper Reference

  • Network Pruning
    • Pruning Filters for Efficient ConvNets (ICLR 2017)
    • Learning Efficient Convolutional Networks Through Network Slimming (ICCV 2017)
    • Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration (CVPR 2019)
    • Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures
    • The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks (ICLR 2019)
    • Rethinking the value of network pruning (ICLR 2019)
    • Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask (ICML 2019)