[Dy2St] Disable `test_bert` on CPU by gouzil · Pull Request #60173

[Dy2St] Disable `test_bert` on CPU by gouzil · Pull Request #60173 · PaddlePaddle/Paddle
* fix windows bug for common lib (#60308)

* fix windows bug

* fix windows bug

* fix windows bug

* fix windows bug

* fix windows bug

* fix windows bug

* Update inference_lib.cmake

* [Dy2St] Disable `test_bert` on CPU (#60173) (#60324)

Co-authored-by: gouzil <66515297+gouzil@users.noreply.github.com>

* [Cherry-pick] fix weight quant kernel bug when n div 64 != 0 (#60184)

* fix weight-only quant kernel error for n div 64 !=0

* code style fix

* tile (#60261)

* add chunk allocator posix_memalign return value check (#60208) (#60495)

* fix chunk allocator posix_memalign return value check;test=develop

* fix chunk allocator posix_memalign return value check;test=develop

* fix chunk allocator posix_memalign return value check;test=develop

* update 2023 security advisory, test=document_fix (#60532)

* fix fleetutil get_online_pass_interval bug2; test=develop (#60545)

* fix fused_rope diff (#60217) (#60593)

* [cherry-pick]fix fleetutil get_online_pass_interval bug3 (#60620)

* fix fleetutil get_online_pass_interval bug3; test=develop

* fix fleetutil get_online_pass_interval bug3; test=develop

* fix fleetutil get_online_pass_interval bug3; test=develop

* [cherry-pick]update pdsa-2023-019 (#60649)

* update 2023 security advisory, test=document_fix

* update pdsa-2023-019, test=document_fix

* [Dy2St][2.6] Disable `test_grad` on release/2.6 (#60662)

* fix bug of ci (#59926) (#60785)

* [Dy2St][2.6] Disable `test_transformer` on `release/2.6` and update README (#60786)

* [Dy2St][2.6] Disable `test_transformer` on release/2.6 and update README

* [Docs] Update latest release version in README (#60691)

* restore order

* [Dy2St][2.6] Increase `test_transformer` and `test_mobile_net` ut time (#60829) (#60875)

* [Cherry-pick] fix set_value with scalar grad (#60930)

* Fix set value grad (#59034)

* first fix the UT

* fix set value grad

* polish code

* add static mode backward test

* always has input valuetensor

* add dygraph test

* Fix shape error in combined-indexing setitem (#60447)

* add ut

* fix shape error in combine-indexing

* fix ut

* Set value with scalar (#60452)

* set_value with scalar

* fix ut

* remove test_pir

* remove one test since 2.6 not support uint8-add

* [cherry-pick] This PR enable offset of generator for custom device. (#60616) (#60772)

* fix core dump when fallback gather_nd_grad and MemoryAllocateHost (#61067)

* fix qat tests (#61211) (#61284)

* [Security] fix draw security problem (#61161) (#61338)

* fix draw security problem

* fix _decompress security problem (#61294) (#61337)

* Fix CVE-2024-0521 (#61032) (#61287)

This uses shlex for safe command parsing to fix arbitrary code injection

Co-authored-by: ndren <andreien@proton.me>

* [Security] fix security problem for prune_by_memory_estimation (#61382)

* OS Command Injection prune_by_memory_estimation fix

* Fix StyleCode

* [Security] fix security problem for run_cmd (#61285) (#61398)

* fix security problem for run_cmd

* [Security] fix download security problem (#61162) (#61388)

* fix download security problem

* check eval for security (#61389)

* [cherry-pick] adapt c_embedding to phi namespace for custom devices (#60774) (#61045)

Co-authored-by: Tian <121000916+SylarTiaNII@users.noreply.github.com>

* [CherryPick] Fix issue 60092 (#61427)

* fix issue 60092

* update

* update

* update

* Fix unique (#60840) (#61044)

* fix unique kernel, row to num_out

* cinn(py-dsl): skip eval string in python-dsl (#61380) (#61586)

* remove _wget (#61356) (#61569)

* remove _wget

* remove _wget

* remove wget test

* fix layer_norm decompose dtyte bugs, polish codes (#61631)

* fix doc style (#61688)

* merge (#61866)

* [security] refine _get_program_cache_key (#61827) (#61896)

* security, refine _get_program_cache_key

* repeat_interleave support bf16 dtype (#61854) (#61899)

* repeat_interleave support bf16 dtype

* support bf16 on cpu

* Support Fake GroupWise Quant (#61900)

* fix launch when elastic run (#61847) (#61878)

* [Paddle-TRT] fix solve (#61806)

* [Cherry-Pick] Fix CacheKV Quant Bug (#61966)

* fix cachekv quant problem

* add unittest

* Sychronized the paddle2.4 adaptation changes

* clear third_part dependencies

* change submodules to right commits

* build pass with cpu only

* build success with maca

* build success with cutlass and fused kernels

* build with flash_attn and mccl

* build with test, fix some bugs

* fix some bugs

* fixed some compilation bugs

* fix bug in previous commit

* fix bug with split when col_size biger than 256

* add row_limit to show full kernel name

* add env.sh

Change-Id: I6fded2761a44af952a4599691e19a1976bd9b9d1

* add shape record

Change-Id: I273f5a5e97e2a31c1c8987ee1c3ce44a6acd6738

* modify paddle version

Change-Id: I97384323c38066e22562a6fe8f44b245cbd68f98

* wuzhao optimized the performance of elementwise kernel.

Change-Id: I607bc990415ab5ff7fb3337f628b3ac765d3186c

* fix split when dtype is fp16

Change-Id: Ia55d31d11e6fa214d555326a553eaee3e928e597

* fix bug in previous commit

Change-Id: I0fa66120160374da5a774ef2c04f133a54517069

* adapt flash_attn  new capi

Change-Id: Ic669be18daee9cecbc8542a14e02cdc4b8d429ba

* change eigen path

Change-Id: I514c0028e16d19a3084656cc9aa0838a115fc75c

* modify mcname -> replaced_name

Change-Id: Idc520d2db200ed5aa32da9573b19483d81a0fe9e

* fix some build bugs

Change-Id: I50067dfa3fcaa019b5736f4426df6d4e5f64107d

* add PADDLE_ENABLE_SAME_RAND_A100

Change-Id: I2d4ab6ed0b5fac3568562860b0ba1c4f8e346c61
done

* remove redundant warning, add patch from 2.6.1

Change-Id: I958d5bebdc68eb42fe433c76a3737330e00a72aa

* improve VectorizedBroadcastKernel

(cherry picked from commit 19069b26c0bf05a80cc834162db072f6b8aa2536)
Change-Id: Iaf5719d72ab52adbedc40d4788c52eb1ce4d517c
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* fix bugs

(cherry picked from commit b007853a75dbd5de63028f4af82c15a5d3d81f7c)
Change-Id: Iaec0418c384ad2c81c354ef09d81f3e9dfcf82f1
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* split ElementwiseDivGrad

(cherry picked from commit eb6470406b7d440c135a3f7ff68fbed9494e9c1f)
Change-Id: I60e8912be8f8d40ca83a54af1493adfa2962b2d6
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* in VectorizedElementwiseKernel, it can now use vecSize = 8

(cherry picked from commit a873000a6c3bc9e2540e178d460e74e15a3d4de5)
Change-Id: Ia703b1e9e959558988fcd09182387da839d33922
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve ModulatedDeformableCol2imCoordGpuKernel:1.block size 512->64;2.FastDivMod;3.fix VL1;4.remove DmcnGetCoordinateWeight divergent branches.

(cherry picked from commit 82c914bdd29f0eef87a52b229ff84bc456a1beeb)
Change-Id: I60b1fa9a9c89ade25e6b057c38e08616a24fa5e3
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Optimize depthwise_conv2d_grad compute (InputGrad):
1.use shared memory to optimize data load from global memory;
2.different blocksize for different input shape
3.FastDivMod for input shape div, >> and & for stride div.

(cherry picked from commit b34a5634d848f3799f5a8bcf884731dba72d3b20)
Change-Id: I0d8f22f2a2b9d99dc9fbfc1fb69b7bed66010229
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve VectorizedBroadcastKernel with LoadType =
 2(kMixed)

(cherry picked from commit 728b9547f65e096b45f39f096783d2bb49e8556f)
Change-Id: I282dd8284a7cde54061780a22b397133303f51e5
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* fix ElementwiseDivGrad

(cherry picked from commit 5f99c31904e94fd073bdd1696c3431cccaa376cb)
Change-Id: I3ae0d6c01eec124d12fa226a002b10d0c40f820c
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Revert "Optimize depthwise_conv2d_grad compute (InputGrad):"

This reverts commit b34a5634d848f3799f5a8bcf884731dba72d3b20.

(cherry picked from commit 398f5cde81e2131ff7014edfe1d7beaaf806adbb)
Change-Id: I637685b91860a7dea6df6cbba0ff2cf31363e766
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve ElementwiseDivGrad and ElementwiseMulGrad

(cherry picked from commit fe32db418d8f075e083f31dca7010398636a6e67)
Change-Id: I4f7e0f2b5afd4e704ffcd7258def63afc43eea9c
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve FilterBBoxes

(cherry picked from commit fe4655e86b92f5053fa886af49bf199307960a05)
Change-Id: I35003420292359f8a41b19b7ca2cbaae17dc5b45
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve deformable_conv_grad op:1.adaptive block size;2.FastDivMod;3.move ldg up.

(cherry picked from commit a7cb0ed275a3488f79445ef31456ab6560e9de43)
Change-Id: Ia89df4e5a26de64baae4152837d2ce3076c56df1
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve ModulatedDeformableIm2colGpuKernel:1.adaptive block size;2.FastDivMod;3.move ldg up.

(cherry picked from commit 4fb857655d09f55783d9445b91a2d953ed14d0b8)
Change-Id: I7df7f3af7b4615e5e96d33b439e5276be6ddb732
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve KeBNBackwardData:replace 1.0/sqrt with rsqrt

(cherry picked from commit 333cba7aca1edf7a0e87623a0e55e230cd1e9451)
Change-Id: Ic808d42003677ed543621eb22a797f0ab7751baa
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Improve KeBNBackwardData, FilterGradAddupGpuKernel kernels. Improve nonzero and masked_select (forward only) OP.

(cherry picked from commit c907b40eb3f9ded6ee751e522c2a97a353ac93bd)
Change-Id: I7f4845405e64e7599134a8c497f464ac04dead88
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Optimize depthwise_conv2d:
1. 256 Blocksize launch for small shape inputgrad;
2. FastDivMod in inputgrad and filtergrad;
3. shared memory to put output_grad_data in small shape.

(cherry picked from commit f9f29bf7b8d929fb95eb1153a79d8a6b96d5b6d2)
Change-Id: I1a3818201784031dbedc320286ea5f4802dbb6b1
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Improve CheckFiniteAndUnscaleKernel by splitting the kernel into multiple tensors.

(cherry picked from commit 3bd200f262271a333b3947326442b86af7fb6da1)
Change-Id: I57c94cc5e709be8926e1b21da14b653cb18eabc3
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Revert "Improve CheckFiniteAndUnscaleKernel by splitting the kernel into multiple tensors."

This reverts commit 3bd200f262271a333b3947326442b86af7fb6da1.

(cherry picked from commit 86ed8adaa8c20d3c824eecb0ee1e10d365bcea37)
Change-Id: I5b8b7819fdf99255c65fe832d5d77f8e439bdecb
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve ScatterInitCUDAKernel and ScatterCUDAKernel

(cherry picked from commit cddb01a83411c45f68363248291c0c4685e60b24)
Change-Id: Ie106ff8d65c21a8545c40636f021b73f3ad84587
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* fix bugs and make the code easier to read

(cherry picked from commit 07ea3acf347fda434959c8c9cc3533c0686d1836)
Change-Id: Id7a727fd18fac4a662f8af1bf6c6b5ebc6233c9f
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Optimize FilterGard and InputGradSpL

Use tmp to store ldg data in the loop so calculate and ldg time
can fold each other.

(cherry picked from commit 7ddab49d868cdb6deb7c3e17c5ef9bbdbab86c3e)
Change-Id: I46399594d1d7f76b78b9860e483716fdae8fc7d6
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Improve CheckFiniteAndUnscaleKernel by putting address access to shared memory and making single thread do more tasks.

(cherry picked from commit 631ffdda2847cda9562e591dc87b3f529a51a978)
Change-Id: Ie9ffdd872ab06ff34d4daf3134d6744f5221e41e
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Optimize SwinTransformer

1.LayerNormBackward: remove if statement, now will always loop VPT
times for ldg128 in compiler, bool flag to control if write action
will be taken or not;
2.ContiguousCaseOneFunc: tmp saving division result for less division

(cherry picked from commit 422d676507308d26f6107bed924424166aa350d3)
Change-Id: I37aab7e2f97ae6b61c0f50ae4134f5eb1743d429
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Optimize LayerNormBackwardComputeGradInputWithSmallFeatureSize

Set BlockDim.z to make blockSize always be 512, each block can
handle several batches.
Then all threads will loop 4 times for better performance.

(cherry picked from commit 7550c90ca29758952fde13eeea74857ece41908b)
Change-Id: If24de87a0af19ee07e29ac2e7e237800f0181148
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve KeMatrixTopK:1.fix private memory;2.modify max grid size;3.change it to 64 warp reduce.

(cherry picked from commit a346af182b139dfc7737e5f6473dc394b21635d7)
Change-Id: I6c8d8105fd77947c662e6d22a0d15d7bad076bde
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* Modify LayerNorm Optimization

Might have lossdiff with old optimization without atomicAdd.

(cherry picked from commit 80b0bcaa9a307c94dbeda658236fd75e104ccccc)
Change-Id: I4a7c4ec2a0e885c2d581dcebc74464830dae7637
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* improve roi_align op:1.adaptive block size;2.FastDivMod.

(cherry picked from commit cc421d7861c359740de0d2870abcfde4354d8c71)
Change-Id: I55c049e951f93782af1c374331f44b521ed75dfe
Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>

* add workaround for parameters dislocation when calling BatchedGEMM<float16>.

Change-Id: I5788c73a9c45f65e60ed5a88d16a473bbb888927

* fix McFlashAttn string

Change-Id: I8b34f02958ddccb3467f639daaac8044022f3d34

* [C500-27046] fix wb issue

Change-Id: I77730da567903f43ef7a9992925b90ed4ba179c7

* Support compiling external ops

Change-Id: I1b7eb58e7959daff8660ce7889ba390cdfae0c1a

* support flash attn varlen api and support arm build

Change-Id: I94d422c969bdb83ad74262e03efe38ca85ffa673

* Add a copyright notice

Change-Id: I8ece364d926596a40f42d973190525d9b8224d99

* Modify some third-party dependency addresses to public network addresses

---------

Signed-off-by: m00891 <Zequn.Yang@metax-tech.com>
Co-authored-by: risemeup1 <62429225+risemeup1@users.noreply.github.com>
Co-authored-by: Nyakku Shigure <sigure.qaq@gmail.com>
Co-authored-by: gouzil <66515297+gouzil@users.noreply.github.com>
Co-authored-by: Wang Bojun <105858416+wwbitejotunn@users.noreply.github.com>
Co-authored-by: lizexu123 <39205361+lizexu123@users.noreply.github.com>
Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com>
Co-authored-by: Vigi Zhang <VigiZhang@users.noreply.github.com>
Co-authored-by: tianhaodongbd <137985359+tianhaodongbd@users.noreply.github.com>
Co-authored-by: zyfncg <zhangyunfei07@baidu.com>
Co-authored-by: JYChen <zoooo0820@qq.com>
Co-authored-by: zhaohaixu <49297029+zhaohaixu@users.noreply.github.com>
Co-authored-by: Spelling <33216444+raining-dark@users.noreply.github.com>
Co-authored-by: zhouzj <41366441+zzjjay@users.noreply.github.com>
Co-authored-by: wanghuancoder <wanghuan29@baidu.com>
Co-authored-by: ndren <andreien@proton.me>
Co-authored-by: Nguyen Cong Vinh <80946737+vn-ncvinh@users.noreply.github.com>
Co-authored-by: Ruibin Cheung <beinggod@foxmail.com>
Co-authored-by: Tian <121000916+SylarTiaNII@users.noreply.github.com>
Co-authored-by: Yuanle Liu <yuanlehome@163.com>
Co-authored-by: zhuyipin <yipinzhu@outlook.com>
Co-authored-by: 6clc <chaoliu.lc@foxmail.com>
Co-authored-by: Wenyu <wenyu.lyu@gmail.com>
Co-authored-by: Xianduo Li <30922914+lxd-cumt@users.noreply.github.com>
Co-authored-by: Wang Xin <xinwang614@gmail.com>
Co-authored-by: Chang Xu <molixu7@gmail.com>
Co-authored-by: wentao yu <yuwentao126@126.com>
Co-authored-by: zhink <33270771+zhink@users.noreply.github.com>
Co-authored-by: handiz <35895648+ZhangHandi@users.noreply.github.com>
Co-authored-by: zhimin Pan <zhimin.pan@metax-tech.com>
Co-authored-by: m00891 <Zequn.Yang@metax-tech.com>
Co-authored-by: shuliu <shupeng.liu@metax-tech.com>
Co-authored-by: Yanxin Zhou <yanxin.zhou@metax-tech.com>
Co-authored-by: Zhao Wu <zhao.wu@metax-tech.com>
Co-authored-by: m00932 <xiangrong.yi@metax-tech.com>
Co-authored-by: Fangzhou Feng <fangzhou.feng@metax-tech.com>
Co-authored-by: junwang <jun.wang@metax-tech.com>
Co-authored-by: m01097 <qimeng.du@metax-tech.com>