Python 2.7 str 方法 isalpha 不支持 unicode 的一个小坑

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 3358 天前的主题，其中的信息可能已经有所发展或是发生改变。

今天在写一个搜索组件时，我想根据搜索的是否是全部字母来选择搜索的字段。
于是有下面的代码：

if q.isalpha():
    query = query.filter(User.username.ilike(like_str))
else:
    query = query.filter(User.realname.ilike(like_str))

但是发现就算里面有中文也被判断成 isalpha 为 true 了。
测试发现是 str 中方法 isalpha 对于 Unicode 的判断有不可靠。
而 Flask 中默认对参数解码都是 UTF-8 的。所以需要使用 encode('utf-8') 对其进行重新编码之后函数 isalpha() 才可用。
测试如下：

In [15]: u"张 x".isalpha()
Out[15]: True

In [16]: "张 x".isalpha()
Out[16]: False

In [17]: "aac".isalpha()
Out[17]: True

In [18]: u"张 x".encode('utf-8').isalpha()
Out[18]: False

12 条回复 • 2015-10-19 17:38:48 +08:00

mulog

2015-10-18 20:42:26 +08:00

isalpha 没有说他的功能是判断一个字符串是不是都是 <strong>英文</strong> 字符

banxi1988

2015-10-18 20:52:42 +08:00

@mulog

文档就是这样说的。

> str.isalpha()
Return true if all characters in the string are alphabetic and there is at least one character, false otherwise.

另外看下面的输出吧。

```python
In [23]: "a1".isalpha()
Out[23]: False

In [24]: "ab".isalpha()
Out[24]: True

In [25]: "a?".isalpha()
Out[25]: False
```

toooddchen

2015-10-18 23:18:40 +08:00

奇怪的问题, 为什么要用 str 的方法判断 unicode

用 unicode.isalpha()

toooddchen

2015-10-18 23:22:28 +08:00

@toooddchen sorry,光看标题了,上面说的都是错的

hahastudio

2015-10-19 10:16:33 +08:00

我在想咱们是不是用的不是同一个 Python 2.7
Python 2.7.10
In [10]: u'张 x'.isalpha()
Out[10]: False

另外
https://docs.python.org/2/library/stdtypes.html#str.isalpha
For 8-bit strings, this method is locale-dependent.

Clarencep

2015-10-19 10:25:53 +08:00

这个问题以前在 segmentfault 上有人问过了： http://segmentfault.com/q/1010000000732038/a-1020000000732447
> 对于 unicode string ， string.isalpha 会根据字符串中的字符是否属于 Unicode 编码的 LETTER 区域来判断是否都由字母组成。所以得出的结果为 True ，不一定表示只有 26 个英文字母。

banxi1988

2015-10-19 11:48:48 +08:00

@hahastudio
我用的是： `Python 2.7.10 (default, Aug 22 2015, 20:33:39)`

不过如 @Clarencep 指出。这个问题确实是存在的。

而且我在官方网站的 shell https://www.python.org/shell/
上试了下，在 python 3.4 中 isalpha() 的判断还是不可靠的。

```ipython
In [1]: "\u5f20".isalpha()
Out[1]: True
In [2]: "\u5f20".encode('utf-8').isalpha()
Out[2]: False
```

aro167

2015-10-19 12:29:36 +08:00

利用 translate 可靠
import string
notrans = string.maketrans('', '')
def containsAll(astr, strset):
return not strset.translate(notrans, astr)
containsAll(string.letters,'我是 aro167')

Clarencep

2015-10-19 13:27:10 +08:00

@banxi1988
unicode 的 isalpha 中所定义的字母范围不只是[a-zA-Z]，比如：

>>> u'测试'.isalpha()
True

但是，全角的数字和标点符号是不会被判作字母的：

>>> u'０１２３４５６７８９'.isalpha()
False
>>> u'，。；‘'.isalpha()
False

应该不是 python 的 bug

staticor

2015-10-19 13:46:15 +08:00

str.isalpha()

Return true if all characters in the string are alphabetic and there is at least one character, false otherwise. Alphabetic characters are those characters defined in the Unicode character database as “ Letter ”, i.e., those with general category property being one of “ Lm ”, “ Lt ”, “ Lu ”, “ Ll ”, or “ Lo ”. Note that this is different from the “ Alphabetic ” property defined in the Unicode Standard.

同上理解, isalpha() != English letters

mulog

2015-10-19 14:12:57 +08:00

对于 unicode 如果你的字符串全是「字母」组成的， isalpha 返回的就是 True ，没有什么不可靠的。
当然严格来讲汉字不算「字母」，也就无所谓 alphabetical, 但是这是另一回事了。。
你 encode 之后变成了 str, isalpha 判断的东西是编码的每个 byte, 根本没有意义。

banxi1988

2015-10-19 17:38:48 +08:00

@mulog
这个是有意义的。因为汉字经过 UTF-8 编码之后，
首字母必定不在 ascii 的基本字符（或字母）
范围之类。所以对于一个简单判断字母与汉字的区别来说足够了。

参考： http://www.unicode.org/charts/unihangridindex.html
常用汉字起始编码 U+4E00 through U+9FCC
扩展汉字起始编码： U+3400 through U+4DB5

```ipython
In [24]: u"\u3400".encode('utf-8')
Out[24]: '\xe3\x90\x80'

In [22]: u"\u4300".encode('utf-8')
Out[22]: '\xe4\x8c\x80'

In [23]: 0xe4
Out[23]: 228
```