/(\w*)*$/.test('aaa#')这个正则导致我们的页面炸了……不同语言居然不一样

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

› 正则表达式可视化调试 regex101.com

这是一个创建于 244 天前的主题，其中的信息可能已经有所发展或是发生改变。

-2 本文用到的相关工具的版本

Chrome=122.0.6261.69
Nodejs=v10.19.0
PHP 7.4.3-4ubuntu2.20 (cli) (built: Feb 21 2024 13:54:34) ( NTS )
Python 3.8.10
go version go1.13.8 linux/amd64

(不要吐槽和讨论版本，除非你确定这玩意在新版本上没问题，生产环境随便找台机器测的）

-1 这玩意哪来的？

这玩意是我们前端同学问 GPT ，如何写一个匹配网址的正则问到的。

(/^( https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/).test('https://foo.com/a-long-url-contains-path-and-query-and-anchor/foo/bar/baz?boo=baa#anchor');

*** (这个真的可以执行，建议新窗口 F12 试下） ***

于是，真的匹配一段文本的时候，就导致浏览器卡死了，无法做后续渲染，在 profiling 的时候查到是这个正则的问题。

0 MVP 测一下

function testRegexPerformance(repeatCount) {
    var testString = 'a'.repeat(repeatCount) + '#';
    var regex = /(\w*)*$/;

    var startTime = process.hrtime();
    var result = regex.test(testString);
    var endTime = process.hrtime(startTime);
    var executionTime = endTime[0] * 1000 + endTime[1] / 1000000;

    console.log("Repeat Count:", repeatCount);
    console.log("Execution Time:", executionTime + " milliseconds");
    console.log("-----------------------------------" + result);
}

// 测试从 1 到 50 的重复次数
for (var i = 1; i <= 50; i++) {
    testRegexPerformance(i);
}
Repeat Count: 20
Execution Time: 35.191223 milliseconds
-----------------------------------true
Repeat Count: 21
Execution Time: 71.355698 milliseconds
-----------------------------------true
Repeat Count: 22
Execution Time: 140.852157 milliseconds
-----------------------------------true
Repeat Count: 23
Execution Time: 287.687666 milliseconds
-----------------------------------true
Repeat Count: 24
Execution Time: 577.368917 milliseconds
-----------------------------------true
Repeat Count: 25
Execution Time: 1148.243059 milliseconds
-----------------------------------true
Repeat Count: 26
Execution Time: 2297.804939 milliseconds

在i=25的时候，执行时间就到了秒级，之后都是指数级增长。

1 结果是 true 是符合预期的

*表示 0 个或者多个，没有任何一个\w 也是没问题的

2 Regexp.test vs String.match

# 不匹配
> 'a'.match(/(b)/)
null

# 匹配
> 'a'.match(/(b)/)
null

# 匹配
> 'aa'.match(/(a)/)
[ 'a', 'a', index: 0, input: 'aa', groups: undefined ]

# 不那么匹配
> 'aaa#'.match(/(\w*)*$/)
[ '', undefined, index: 4, input: 'aaa#', groups: undefined ]

# 匹配？
> /(\w*)*$/.test('aaa#')
true
>

起因是我旁边的同学说.net 没有 test ，只有 match ，而且结果是 false

所以，js 里面如果用 match 试下，大概有三种结果：

匹配：test=true
不匹配：test=false
不太匹配：test=true ，但是 match[0]是空，1 是undefined

3 其他语言的表现？

js：匹配，衰减
PHP: 不匹配，不衰减
Python：None(不匹配），衰减
Go: 匹配，不衰减
F#: 匹配，衰减

4 所以是为啥？

二楼放测试程序，不占地儿了

10 条回复 • 2024-03-01 22:31:12 +08:00

phpfpm

244 天前

```
package main

import (
"fmt"
"regexp"
"strings"
"time"
)

func testRegexPerformance(repeatCount int) {
testString := strings.Repeat("a", repeatCount) + "#"
regex := regexp.MustCompile(`(\w*)*$`)

startTime := time.Now()
result := regex.MatchString(testString)
endTime := time.Now()
executionTime := endTime.Sub(startTime).Milliseconds()

fmt.Println("Repeat Count:", repeatCount)
fmt.Println("Execution Time:", executionTime, "milliseconds")
fmt.Println("-----------------------------------")
fmt.Println(result)
}

func main() {
// 测试从 1 到 50 的重复次数
for i := 1; i <= 50; i++ {
testRegexPerformance(i)
}
}
```

```
function testRegexPerformance(repeatCount) {
var testString = 'a'.repeat(repeatCount) + '#';
var regex = /(\w*)*$/;

var startTime = process.hrtime();
var result = regex.test(testString);
var endTime = process.hrtime(startTime);
var executionTime = endTime[0] * 1000 + endTime[1] / 1000000;

console.log("Repeat Count:", repeatCount);
console.log("Execution Time:", executionTime + " milliseconds");
console.log("-----------------------------------" + result);
}

// 测试从 1 到 50 的重复次数
for (var i = 1; i <= 50; i++) {
testRegexPerformance(i);
}
```
<?php

function testRegexPerformance($repeatCount) {
$testString = str_repeat('a', $repeatCount) . '#';
$regex = '/(\w*)*$/';

$startTime = microtime(true);
$result = preg_match($regex, $testString);
$endTime = microtime(true);
$executionTime = ($endTime - $startTime) * 1000;

echo "Repeat Count: $repeatCount\n";
echo "Execution Time: $executionTime milliseconds\n";
echo "-----------------------------------\n";
var_dump($result);
}

// 测试从 1 到 50 的重复次数
for ($i = 1; $i <= 50; $i++) {
testRegexPerformance($i);
}
```

```
import re
import time

def test_regex_performance(repeat_count):
test_string = 'a' * repeat_count + '#'
regex = r'(\w*)*$'

start_time = time.time()
result = re.match(regex, test_string)
end_time = time.time()
execution_time = (end_time - start_time) * 1000

print("Repeat Count:", repeat_count)
print("Execution Time:", execution_time, "milliseconds")
print("-----------------------------------")
print(result)

# 测试从 1 到 50 的重复次数
for i in range(1, 51):
test_regex_performance(i)
```

yumusb

244 天前

确实会卡死，火狐直接 InternalError: too much recursion 。留个言等大佬解释。

URL 的正则可以用这个 https://github.com/gchq/CyberChef/blob/master/src/core/operations/RegularExpression.mjs#L52

phpfpm

244 天前

@yumusb 主要是这个现象无法描述，搜索引擎对于*支持的很差

我试了搜索 star after capture group 也没找到合适的结果

但是至少 ff 给你说 too much 了 hhh

sunfall

244 天前

你搜一下 `regex ReDoS`

ghjh

244 天前 via Android

问题出在后面的两个星号，如楼上所说可以查一下 ReDos
然后我主要是想推荐一下 regex101.com
测试正则很方便

phpfpm

244 天前

@sunfall
@ghjh
感谢，get ，学习了。
例如，正则表达式模式或量词^(a+)+$由以下 NFA 表示

收。

GeruzoniAnsasu

244 天前

在某些正则引擎上存在利用回溯进行 dos 攻击的手段，js 首当其冲（另一个我知道的是 PHP/PCRE ）：

https://www.geeksforgeeks.org/understanding-redos-attack/
https://github.com/owasp-modsecurity/ModSecurity/issues/2072

rrfeng

244 天前 via Android

正则引擎本来就没有严格一致的实现，除非你不同语言引用同一个实现比如 pcre 。
复杂正则的性能一直是个大坑，只能说尽量少用。

ZE3kr

244 天前 via iPhone

楼上说的对。因为大多数编程语言的 Regex 实现时间复杂度可以是指数级的，好处是常熟项小。具体是因为传统 Regex 是用 NFA 实现的

https://swtch.com/~rsc/regexp/regexp1.html Regular Expression Matching Can Be Simple And Fast
(but is slow in Java, Perl, PHP, Python, Ruby, ...)

/(\w*)*$/ 一看就是有问题的，因为它括号里面有一个*，括号外面也有个*，你细想一下排列组合，其实每一个字符串都可能有指数个匹配方式

如果不写出这种有问题的 Regex 那目前 NFA Regex 没啥问题，>90%的 Regex 也不会遇到这种问题

phpfpm

244 天前

@ZE3kr 但是为啥这个例子 php 和 go 没踩坑呢，js ，.net ，python 都挂了
@rrfeng php 石宪的应该是 pcre