m1 有原生 numpy scipy 了

YUX · 2020-12-09T07:37:56Z

https://github.com/conda-forge/miniforge 先下载对应版本的 Miniforge3, ====> OS X arm64 (Apple Silicon) 装上之后就有 conda 了,conda 里面装 numpy,scipy 什么的都是原生的性能提升很大无论对比 Rosetta 2 还是 intel i9

NumPy

scipy

conda

原生

42 replies • 2021-04-23 04:02:49 +08:00

1

pb941129

Dec 9, 2020 via iPhone

想知道对比 Intel i9 mkl 版 numpy 提升多少……

2

NoobX

Dec 9, 2020 via iPhone

然而 16g 封顶...

3

Goldilocks

Dec 9, 2020 via Android

期待 benchmark，估计被 avx512 吊打

4

felixcode

PRO

Dec 9, 2020 via Android

显存比你内存大

5

YUX

OP

PRO

Dec 9, 2020

@pb941129
@NoobX
@Goldilocks
@felixcode

找到了个 numpy 性能脚本跑了一下 https://gist.github.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276

```
Dotted two 4096x4096 matrices in 0.53 s.
Dotted two vectors of length 524288 in 0.25 ms.
SVD of a 2048x1024 matrix in 0.59 s.
Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
Eigendecomposition of a 2048x2048 matrix in 4.74 s.

This was obtained using the following Numpy configuration:
blas_info:
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/Users/yux/miniforge3/envs/maths/lib']
include_dirs = ['/Users/yux/miniforge3/envs/maths/include']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/Users/yux/miniforge3/envs/maths/lib']
include_dirs = ['/Users/yux/miniforge3/envs/maths/include']
language = c
lapack_info:
libraries = ['lapack', 'blas', 'lapack', 'blas']
library_dirs = ['/Users/yux/miniforge3/envs/maths/lib']
language = f77
lapack_opt_info:
libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/Users/yux/miniforge3/envs/maths/lib']
language = c
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
include_dirs = ['/Users/yux/miniforge3/envs/maths/include']
`
```

p.s. python 版本 3.9.1 -arm64 跑的时候关掉了所有后台

6

pb941129

Dec 9, 2020

1

@YUX Thx 这是我 16 寸 MBP i9 款跑出来的结果。没有关后台。环境 anaconda 3.8 。看上去比 M1 还是快一点的。（不然 Intel 真的要哭）

```
Dotted two 4096x4096 matrices in 0.45 s.
Dotted two vectors of length 524288 in 0.05 ms.
SVD of a 2048x1024 matrix in 0.32 s.
Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
Eigendecomposition of a 2048x2048 matrix in 3.53 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/Users/xxx/anaconda/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/Users/xxx/anaconda/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/Users/xxx/anaconda/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/Users/xxx/anaconda/include']
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/Users/xxx/anaconda/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/Users/xxx/anaconda/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/Users/xxx/anaconda/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/Users/xxx/anaconda/include']

```

7

changepc90

Dec 9, 2020

M1:Dotted two vectors of length 524288 in 0.25 ms
MBP16:Dotted two vectors of length 524288 in 0.05 ms.
这一项差的好多啊。

8

YUX

OP

PRO

Dec 9, 2020

@pb941129 不错还是 i9 强😂 是不是跑的时候 8 核 16 线程都占满了

9

YUX

OP

PRO

Dec 9, 2020

@changepc90 这应该就是指令集差异造成的叭

10

Aspector

Dec 9, 2020

1

T480s 上的 i7 8550u，库是 mkl_rt

Dotted two 4096x4096 matrices in 1.07 s.
Dotted two vectors of length 524288 in 0.13 ms.
SVD of a 2048x1024 matrix in 0.53 s.
Cholesky decomposition of a 2048x2048 matrix in 0.15 s.
Eigendecomposition of a 2048x2048 matrix in 5.07 s.

用 HWMonitor 读出来 8550u 的实时功耗大概在 40-45W，M1 应该才 20W 吧（悲

11

YUX

OP

PRO

Dec 9, 2020

分享一下朋友的 16inch 2.6 GHz 6-Core Intel Core i7

Dotted two 4096x4096 matrices in 0.49 s.
Dotted two vectors of length 524288 in 0.05 ms.
SVD of a 2048x1024 matrix in 0.32 s.
Cholesky decomposition of a 2048x2048 matrix in 0.07 s.
Eigendecomposition of a 2048x2048 matrix in 3.16 s.

12

YUX

OP

PRO

Dec 9, 2020

@Aspector air 的 m1 限制在 10 瓦😂

13

pb941129

Dec 9, 2020 via iPhone

@YUX 没看任务，不过以我对 numpy 尿性的理解，不至于不至于。可以等 lightgbm 适配了然后一起跑跑 CPU 版本（当时跑一个小项目找最优参数跑满整个 8700k 三小时

14

rock_cloud

Dec 9, 2020

1

2017 iMac 3.4Ghz Intel i5
Dotted two 4096x4096 matrices in 1.04 s.
Dotted two vectors of length 524288 in 0.17 ms.
SVD of a 2048x1024 matrix in 0.58 s.
Cholesky decomposition of a 2048x2048 matrix in 0.12 s.
Eigendecomposition of a 2048x2048 matrix in 5.37 s.
没关任何后台

15

YUX

OP

PRO

Dec 9, 2020

@pb941129 烤鸡仨小时啊我能在冰箱里测么😂 没风扇怕烤糊了

16

sxd96

Dec 9, 2020

1

18 年 13 寸 MBP i5-8259U

Dotted two 4096x4096 matrices in 0.80 s.
Dotted two vectors of length 524288 in 0.11 ms.
SVD of a 2048x1024 matrix in 0.35 s.
Cholesky decomposition of a 2048x2048 matrix in 0.09 s.
Eigendecomposition of a 2048x2048 matrix in 3.39 s.

17

sxd96

Dec 9, 2020

@sxd96 感觉心里平衡了一点点，也是没关后台，mkl 库。但是我发现在核心满负载的情况下，MBP 会有一点点电啸声。虽然现在 ARM 在这上面可能差了一点点，但是如果算能效比，可能并不差。我觉得移动设备重要的还是能效比。

18

Gandum

Dec 9, 2020 via iPhone

还是初步版本。不过现在是冬天还不用急，风扇不太吵。明年夏天再买。

19

FurN1

Dec 9, 2020 via iPhone

1

哈哈我五个月前发帖讲过啦 /t/688402

20

rock_cloud

Dec 9, 2020

1

Intel Xeon Silver 4114 2.2Ghz
Dotted two 4096x4096 matrices in 0.60 s.
Dotted two vectors of length 524288 in 0.04 ms.
SVD of a 2048x1024 matrix in 0.66 s.
Cholesky decomposition of a 2048x2048 matrix in 0.26 s.
Eigendecomposition of a 2048x2048 matrix in 6.67 s.

21

YUX

OP

PRO

Dec 9, 2020

1

@IgniteWhite 太超前啦😂确实是个好东西

22

Tilie

Dec 9, 2020

1

8 代 i7 mac mini
Dotted two 4096x4096 matrices in 0.76 s.
Dotted two vectors of length 524288 in 0.09 ms.
SVD of a 2048x1024 matrix in 0.56 s.
Cholesky decomposition of a 2048x2048 matrix in 0.09 s.
Eigendecomposition of a 2048x2048 matrix in 5.20 s.

23

YUX

OP

PRO

Dec 9, 2020

Google Colab - 2 Intel(R) Xeon(R) CPU @ 2.20GHz

Dotted two 4096x4096 matrices in 4.16 s.
Dotted two vectors of length 524288 in 0.25 ms.
SVD of a 2048x1024 matrix in 1.49 s.
Cholesky decomposition of a 2048x2048 matrix in 0.23 s.
Eigendecomposition of a 2048x2048 matrix in 13.11 s.

24

zr86

Dec 9, 2020

M1 Mac mini

Dotted two 4096x4096 matrices in 0.69 s.
Dotted two vectors of length 524288 in 0.25 ms.
SVD of a 2048x1024 matrix in 0.68 s.
Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
Eigendecomposition of a 2048x2048 matrix in 4.82 s.

25

kalimpong

Dec 9, 2020

M1 MacBook Pro

Dotted two 4096x4096 matrices in 0.68 s.
Dotted two vectors of length 524288 in 0.25 ms.
SVD of a 2048x1024 matrix in 0.71 s.
Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
Eigendecomposition of a 2048x2048 matrix in 5.03 s.

同时用 powermetrics 测量功耗，前两项约 26W，后三项约 16W

26

lovestudykid

Dec 10, 2020

这个测试拉不开差距
MF839，只是比楼主的 M1 慢了一倍
Dotted two 4096x4096 matrices in 2.33 s.
Dotted two vectors of length 524288 in 0.54 ms.
SVD of a 2048x1024 matrix in 1.05 s.
Cholesky decomposition of a 2048x2048 matrix in 0.20 s.
Eigendecomposition of a 2048x2048 matrix in 8.38 s.

Intel(R) Xeon(R) Gold 6134
Dotted two 4096x4096 matrices in 0.32 s.
Dotted two vectors of length 524288 in 0.05 ms.
SVD of a 2048x1024 matrix in 0.89 s.
Cholesky decomposition of a 2048x2048 matrix in 0.15 s.
Eigendecomposition of a 2048x2048 matrix in 8.19 s.
Anaconda 默认安装的 numpy 版本没有用 mkl，也没有开启 avx512，这个 cpu 是浪费了

27

pubby

Dec 10, 2020

3700X 黑苹果

Dotted two 4096x4096 matrices in 0.46 s.
Dotted two vectors of length 524288 in 0.08 ms.
SVD of a 2048x1024 matrix in 7.37 s.
Cholesky decomposition of a 2048x2048 matrix in 0.82 s.
Eigendecomposition of a 2048x2048 matrix in 49.05 s.

This was obtained using the following Numpy configuration:
atlas_threads_info:
NOT AVAILABLE
blas_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
extra_compile_args = ['-msse3', '-I/AppleInternal/BuildRoot/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.0.Internal.sdk/System/Library/Frameworks/vecLib.framework/Headers']
define_macros = [('NO_ATLAS_INFO', 3)]
atlas_blas_threads_info:
NOT AVAILABLE
openblas_info:
NOT AVAILABLE
lapack_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
extra_compile_args = ['-msse3']
define_macros = [('NO_ATLAS_INFO', 3)]
atlas_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
atlas_blas_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE

使用姿势不太对....

28

bnuliujing

Dec 10, 2020

i7-6950X 的成绩

Dotted two 4096x4096 matrices in 0.35 s.
Dotted two vectors of length 524288 in 0.03 ms.
SVD of a 2048x1024 matrix in 0.27 s.
Cholesky decomposition of a 2048x2048 matrix in 0.10 s.
Eigendecomposition of a 2048x2048 matrix in 3.39 s.

29

NoobX

Dec 10, 2020

Mac Mini i5 款的成绩

Dotted two 4096x4096 matrices in 0.58 s.
Dotted two vectors of length 524288 in 0.08 ms.
SVD of a 2048x1024 matrix in 0.32 s.
Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
Eigendecomposition of a 2048x2048 matrix in 3.30 s.

M1 成绩印象也不太深刻。。。
不过 16G 内存依旧是一个大问题，系统一般自己就吃掉 4G，16G 只有 12G 放 dataset，老实讲对我不太够用
处理器慢点问题不大，swap 吃满了，那速度是真的噩梦

30

MisakaTian

Dec 10, 2020

数据狗表示 anaconda 搞定就上

31

Goldilocks

Dec 10, 2020

Processor Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz, 3600 Mhz, 4 Core

Dotted two 4096x4096 matrices in 0.33s ，比 m1 快一倍。但是 m1 是 8 核哦。所以同等频率同样核数，intel 还是要比 m1 快 3-4 倍左右，这还是 3 年前的产品。

32

YUX

OP

PRO

Dec 10, 2020 via iPhone

@MisakaTian 用 mamba 啊

33

Goldilocks

Dec 10, 2020

现在是 2020 年。Intel 如果出个 2 核 3.6G 的 cpu，你肯定看不上它的性能。你要想的是 Intel 10 核、20 核。马上 AMD 都要发布 64 核桌面 CPU 了，apple 还停留在 2 核的水准。

34

meloyang05

Dec 10, 2020

@Goldilocks

“8 代 i7 mac mini
Dotted two 4096x4096 matrices in 0.76 s.
Dotted two vectors of length 524288 in 0.09 ms.
SVD of a 2048x1024 matrix in 0.56 s.
Cholesky decomposition of a 2048x2048 matrix in 0.09 s.
Eigendecomposition of a 2048x2048 matrix in 5.20 s.

M1 Mac mini

Dotted two 4096x4096 matrices in 0.69 s.
Dotted two vectors of length 524288 in 0.25 ms.
SVD of a 2048x1024 matrix in 0.68 s.
Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
Eigendecomposition of a 2048x2048 matrix in 4.82 s.”

你选择性无视其他测试成绩么。。时间在 ms 级别本来误差就可能很大，也可能是 numpy for m1 现在有 bug，你单独拎 vector 的成绩出来能说明什么问题？

35

Goldilocks

Dec 10, 2020

误差不会很大，一般都在 1%以内。因为矩阵乘法就受两个限制：

1. CPU flops
2. 内存带宽

36

Goldilocks

Dec 10, 2020

像矩阵乘法这样的数值计算是很成熟的领域，大家都研究的很透了。请参见这个： https://en.wikichip.org/wiki/flops

假设内存带宽能跟得上 cpu 的速度，要么要想跑的更快，就只有：
1. 增加核数
2. 增加 SIMD 的长度

比如 skylake 可以做到 64 FLOPs/cycle，但是同时代的 AMD CPU 只有 16 FLOPs/cycle 。大家主频都差不多，这其中的 4 倍就造成了主要的差距。而且这种差距很难追赶上，可以说一辈子都没希望。

37

Harry1993

Dec 10, 2020

用 Apple 的 numpy ( https://github.com/apple/tensorflow_macos)試了一下：

Dotted two 4096x4096 matrices in 0.84 s.
Dotted two vectors of length 524288 in 0.11 ms.
SVD of a 2048x1024 matrix in 0.54 s.
Cholesky decomposition of a 2048x2048 matrix in 0.06 s.
Eigendecomposition of a 2048x2048 matrix in 6.29 s.

38

FurN1

Dec 10, 2020

@MisakaTian miniforge 的包管理器不就是 conda 么…只是默认 channel 是 conda-forge

39

lly0514

Dec 11, 2020

@Goldilocks 实际上误差非常大，我实测 MKL vs openblas 的性能差距有一倍多

40

Richardyyz

Dec 13, 2020

@Goldilocks ZEN2 都已经 32 FLOPs/cycle 了，你这一辈子这么短吗？降频严重的 AVX512 并没有在 ZEN3 面前有多么大的优势。

41

YUX

OP

PRO

Jan 24, 2021

补充一个树莓派的😂

Dotted two 4096x4096 matrices in 10.18 s.
Dotted two vectors of length 524288 in 2.27 ms.
SVD of a 2048x1024 matrix in 6.67 s.
Cholesky decomposition of a 2048x2048 matrix in 0.85 s.
Eigendecomposition of a 2048x2048 matrix in 37.83 s.

This was obtained using the following Numpy configuration:
blas_info:
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/mambaforge/envs/maths/lib']
include_dirs = ['/root/mambaforge/envs/maths/include']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/mambaforge/envs/maths/lib']
include_dirs = ['/root/mambaforge/envs/maths/include']
language = c
lapack_info:
libraries = ['lapack', 'blas', 'lapack', 'blas']
library_dirs = ['/root/mambaforge/envs/maths/lib']
language = f77
lapack_opt_info:
libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/mambaforge/envs/maths/lib']
language = c
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
include_dirs = ['/root/mambaforge/envs/maths/include']

42

YRInc

Apr 23, 2021

提供一个国产的给大家参考：鲲鹏 920

12 核鲲鹏 920 24G 内存：
-------------------
Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 15:45:16)

Dotted two 4096x4096 matrices in 1.48 s.
Dotted two vectors of length 524288 in 0.49 ms.
SVD of a 2048x1024 matrix in 1.10 s.
Cholesky decomposition of a 2048x2048 matrix in 0.14 s.
Eigendecomposition of a 2048x2048 matrix in 8.36 s.
-------------------

24 核鲲鹏 920 48G 内存:
-------------------
Dotted two 4096x4096 matrices in 0.76 s.
Dotted two vectors of length 524288 in 0.48 ms.
SVD of a 2048x1024 matrix in 0.93 s.
Cholesky decomposition of a 2048x2048 matrix in 0.13 s.
Eigendecomposition of a 2048x2048 matrix in 7.66 s.

与 M1 Mac 用的同样的环境，Miniforge3，相关的加速库如下:
blas_info:
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/miniforge3/lib']
include_dirs = ['/root/miniforge3/include']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/miniforge3/lib']
include_dirs = ['/root/miniforge3/include']
language = c
lapack_info:
libraries = ['lapack', 'blas', 'lapack', 'blas']
library_dirs = ['/root/miniforge3/lib']
language = f77
lapack_opt_info:
libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/miniforge3/lib']
language = c
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
include_dirs = ['/root/miniforge3/include']