M1 芯片与鲲鹏 920 数值计算性能对比

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

这是一个创建于 1647 天前的主题，其中的信息可能已经有所发展或是发生改变。

受此贴启发，除鲲鹏 920 外的数据也来自此贴: https://v2ex.com/t/733777

成绩对比选的是基于 Numpy 的数值计算（ Neon SIMD 加速），测试脚本为：

https://gist.github.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276

废话不多说，上成绩：

项目	M1	鲲鹏 920-12 核	鲲鹏 920-24 核	Core i9
4096x4096 矩阵乘法	0.53 s	1.48 s	0.76 s	0.45 s
524288 向量点积	0.25 ms	0.49 ms	0.48 ms	0.05 ms
2048x1024 SVD	0.59 s	1.10 s	0.93 s	0.32 s
2048x2048 Cholesky 分解	0.08 s	0.14 s	0.13 s	0.08 s
2048x2048 特征分解	4.74 s	8.36 s	7.66 s	3.53 s

结论：

由于是调用的底层加速库，Numpy 在数值计算方面可以有效使用多核进行运算。大体上看，哪怕是 24 核鲲鹏 920 的数值计算性能也比 M1 慢一半左右，向量乘法和 SVD 几乎慢一倍。

Core i9 是原帖网友 @pb941129 基于 16 寸 MBP i9 所得，由于数值计算是英特尔的传统强项，外加在 MKL 底层的加持下，各项方面性能均领先 M1 (原帖网友 @YUX 所测).

备注：

1 鲲鹏 920 是在华为云上测试的。

2 除 Core i9 外，Numpy 安装统一用的是 Miniforge，加速库配置为：

blas_info:
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/miniforge3/lib']
include_dirs = ['/root/miniforge3/include']
language = c
define_macros = [('HAVE_CBLAS', None)]

blas_opt_info:
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/miniforge3/lib']
include_dirs = ['/root/miniforge3/include']
language = c

lapack_info:
libraries = ['lapack', 'blas', 'lapack', 'blas']
library_dirs = ['/root/miniforge3/lib']
language = f77

lapack_opt_info:
libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/miniforge3/lib']
language = c
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
include_dirs = ['/root/miniforge3/include']

23 条回复 • 2021-04-25 16:28:50 +08:00

felixcode

2021-04-23 17:34:31 +08:00

华为云上用的是独服吗？
i9 达到功耗墙和温度墙了没有？

YRInc

2021-04-23 17:44:00 +08:00

@felixcode 功耗墙可能不会影响这个，因为就几秒钟的满负载运算。华为云用的是单节点，也是跑到满负载，不是到是不是独服

FurN1

2021-04-23 17:53:04 +08:00 via iPhone

0987363

2021-04-23 18:15:55 +08:00

多核参考意义不大啊，有没有单核成绩

YRInc

2021-04-23 18:18:16 +08:00 via iPhone

@0987363 因为测的是数值计算，主要是多核和 SIMD 指令的性能对比，通用计算没有进行对比

0987363

2021-04-23 18:39:05 +08:00

Dotted two 4096x4096 matrices in 0.35 s.
Dotted two vectors of length 524288 in 0.03 ms.
SVD of a 2048x1024 matrix in 0.30 s.
Cholesky decomposition of a 2048x2048 matrix in 0.05 s.
Eigendecomposition of a 2048x2048 matrix in 3.05 s.

黑果 10850k 跑了下
最高只能用到 10 线程

YRInc

2021-04-23 18:42:11 +08:00 via iPhone

@0987363 嗯嗯，取决于加速库的配置，默认情况下（具体为啥我也不知），最大线程数限制在物理核心数上，不使用超线程。这也是 Matlab 和 Mathmatica 所采用的策略。

0987363

2021-04-23 18:44:06 +08:00

@YRInc 在 debian 上，能用上超线程，2630l v4
Dotted two 4096x4096 matrices in 0.91 s.
Dotted two vectors of length 524288 in 0.05 ms.
SVD of a 2048x1024 matrix in 0.90 s.
Cholesky decomposition of a 2048x2048 matrix in 0.18 s.
Eigendecomposition of a 2048x2048 matrix in 12.48 s.

YRInc

2021-04-23 18:50:38 +08:00 via iPhone

@0987363 赞，志强一比确实差了点

YRInc

2021-04-23 18:51:47 +08:00 via iPhone

@0987363 与鲲鹏 24 核互有胜负

Deepseafish

2021-04-23 19:23:49 +08:00

非空载跑的，前两项波动比较大
E5-2680 v4
Dotted two 4096x4096 matrices in 0.33 s.
Dotted two vectors of length 524288 in 0.03 ms.
SVD of a 2048x1024 matrix in 0.34 s.
Cholesky decomposition of a 2048x2048 matrix in 0.10 s.
Eigendecomposition of a 2048x2048 matrix in 3.82 s.

Xeon(R) Platinum 8170
Dotted two 4096x4096 matrices in 0.77 s.
Dotted two vectors of length 524288 in 0.13 ms.
SVD of a 2048x1024 matrix in 0.48 s.
Cholesky decomposition of a 2048x2048 matrix in 0.30 s.
Eigendecomposition of a 2048x2048 matrix in 5.94 s.

E5-2690 v4
Dotted two 4096x4096 matrices in 0.93 s.
Dotted two vectors of length 524288 in 0.14 ms.
SVD of a 2048x1024 matrix in 1.60 s.
Cholesky decomposition of a 2048x2048 matrix in 0.13 s.
Eigendecomposition of a 2048x2048 matrix in 6.90 s.

yanwen

2021-04-23 19:25:49 +08:00

华为云上的。。性能打折扣了。

secondwtq

2021-04-23 19:36:23 +08:00 via iPhone

就鲲鹏 12 核和 24 核的结果对比来看，貌似除了矩阵乘之外的算法并不能”有效”利用多核啊

alphatoad

2021-04-23 19:44:06 +08:00

i9 还是很强，是用了 AVX 吗
不过考虑到 M1 只是个低功耗试水产品——很看到后续产品线

YRInc

2021-04-23 19:52:34 +08:00 via iPhone

@secondwtq 嗯，估计不是所有的运算项目都能并行化。具体也取决于底层加速库的实现了。

YRInc

2021-04-23 19:54:24 +08:00 via iPhone

@y
@alphatoad 嗯，是 AVX，然后 Arm 用 Neon 。M1 如此低功耗加性能不俗，未来着实可期

jr55475f112iz2tu

2021-04-23 23:43:17 +08:00 via Android

鲲鹏是服务器 U
M1 是消费级 U
不知道有什么好比的

YRInc

2021-04-23 23:53:48 +08:00 via iPhone

@czfy 额，这不才说明都服务器级别了，这么多核心了，功耗这么大了，差距还是存在一些，进步空间还不小么

dayeye2006199

2021-04-24 03:55:07 +08:00

arm 的数值计算有什么技术进展吗？指令集带来的差异，下层的库能拉平性能差异么？求科普

YRInc

2021-04-24 15:15:47 +08:00 via iPhone

@dayeye2006199 只知道下一代 Arm v9 更新了 SIMD 指令集，SVE2 。以后的数值计算能力也会越来越强吧

dabaibai

2021-04-24 16:21:57 +08:00

华为云上用的是独服吗？如果是云主机的话毫无参考价值

datou

2021-04-25 13:40:06 +08:00

Dotted two 4096x4096 matrices in 0.67 s.
Dotted two vectors of length 524288 in 0.05 ms.
SVD of a 2048x1024 matrix in 0.96 s.
Cholesky decomposition of a 2048x2048 matrix in 0.28 s.
Eigendecomposition of a 2048x2048 matrix in 6.69 s.

3500X win10

neosfung

2021-04-25 16:28:50 +08:00

拿了 2019 年 16 寸 MacBook 和服务器分别测了一下，供参考

Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz * 2
Dotted two 4096x4096 matrices in 1.45 s.
Dotted two vectors of length 524288 in 0.17 ms.
SVD of a 2048x1024 matrix in 0.90 s.
Cholesky decomposition of a 2048x2048 matrix in 0.20 s.
Eigendecomposition of a 2048x2048 matrix in 8.83 s.

Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Dotted two 4096x4096 matrices in 0.73 s.
Dotted two vectors of length 524288 in 0.08 ms.
SVD of a 2048x1024 matrix in 0.52 s.
Cholesky decomposition of a 2048x2048 matrix in 0.09 s.
Eigendecomposition of a 2048x2048 matrix in 4.89 s.

M1 芯片与 鲲鹏 920 数值计算性能对比

M1 芯片与鲲鹏 920 数值计算性能对比