Python 2 处理 MySQL latin 表里的中文数据

程序软件处理中文数据时，需要中文数据进行各种编码解码；
因为 latin 表的兼容性，同一张 latin 表里，可能同时存在了多种编码的数据，开发人员在写处理的程序软件需要考虑到这一部分兼容性；
用户使用 show create table 命令的时候，无法确认表中的中文数据的实际编码，而且终端的编码与数据编码不一致，在查询中文数据的时候会出现乱码，需要用户不断的测试来找到正确的编码

2. 处理建议

常见的在 latin 表中存储中文的数据一般为 GBK 编码和 UTF-8 编码，如果你不可避免的需要处理 latin 表里的中文数据，那么我这里可以提供两种处理方式（ Python 2 的方式）。

2.1 转换成对应编码的 str

处理要点：
- 在处理 latin 表的 Python 脚本开头指定了中文数据的对应编码（# -*- coding: utf-8 -*-或者# -*- coding: gbk -*-）;
- Python 脚本在与 DB 建立连接时，需要指定连接的charset为latin1;
- 往 DB 写入中文数据时，脚本里的中文数据字符串此时的编码即为脚本开头指定的编码；
- 读取中文数据时，从 DB 获取了中文数据后，需要将中文数据由unicode类型以latin编码的方式encode，还原成对应编码的str (这个时候可以根据自己的需要进行print、或者写入到文件里等等各种操作，读者可自由发挥)
缺点：
- 用户需要事先明确 latin 表里的中文数据编码；
- latin 表中需要处理的中文字段的编码需要是一致；
处理的 Python 脚本示例如下：

#!/usr/bin/python
# -*- coding: utf-8 -*-	 #在 Python 文件的开头指定编码，值为 GBK 或者 utf-8
import MySQLdb

# 作者为了方便，对原有的 MySQLdb 数据库类的一些 db 操作进行的一个简单的封装成 DataBase 这个类，大家也可以直接使用 MySQLdb
class DataBase(object):

    def __init__(self, host="127.0.0.1", user="root", passwd="123456", db="test", charset="latin1"):
        if not passwd:
            self.db = MySQLdb.connect(host=host, user=user,
                                      db=db, charset=charset)
        else:
            self.db = MySQLdb.connect(host=host, user=user, passwd=passwd,
                                      db=db, charset=charset)
        self.cursor = self.db.cursor()

    def execute(self, sql):
        try:
            self.cursor.execute(sql)
            self.db.commit()
            results = self.cursor.fetchall()
            return results
        except Exception as e:
            print(str(e))

    def executemany(self, sql, param_list):
        try:
            self.cursor.executemany(sql, param_list)
            self.db.commit()
        except Exception as e:
            print(str(e))

db = DataBase()
# 1. 写入数据
student_name = '小刚'	# 该中文数据此时为 utf-8 的字符串
sql = "insert into test1 values (1, '%s', 'male')" %(student_name)
db.execute(sql)
# 2. 获取数据
sql = "select name from test1"
results = db.execute(sql)
for result in results:
    print(type(result[0]))
    print([result[0]]) # 查看这里打印出来的数据，你就会发现这里的不是正常的 unicode 数据，如果你直接 print 的话，在显示的时候编码转换就会发生异常
    print(result[0].encode('latin1')) # 此时中文数据为 utf-8 编码，与脚本编码一致，可以正常打印

2.2 转换成 unicode

处理要点：
- 在处理 latin 表的 Python 脚本开头指定了中文数据的对应编码（# -*- coding: utf-8 -*-或者# -*- coding: gbk -*-，根据你的习惯来设置）;
- 与 DB 建立连接时，需要指定连接的charset为latin1;
- 写入中文数据时，脚本中涉及到的中文数据我们均让它成为unicode类型，如u'小红'
- 读取中文数据时，从 DB 获取了中文数据后，将中文数据由unicode以latin编码的方式encode，还原成对应编码的str ，最后再decode成unicode类型（我喜欢转换成unicode类型，python 脚本在使用 print 函数打印unicode类型的内容，unicode会自动转换成合适的编码）
处理的 Python 脚本示例如下：

#!/usr/bin/python
# -*- coding: utf-8 -*-	 #在 Python 文件的开头指定编码，值为 GBK 或者 utf-8
import MySQLdb

# 作者为了方便，对原有的 MySQLdb 数据库类的一些 db 操作进行的一个简单的封装成 DataBase 这个类，大家也可以直接使用 MySQLdb
class DataBase(object):

    def __init__(self, host="127.0.0.1", user="root", passwd="123456", db="test", charset="latin1"):
        if not passwd:
            self.db = MySQLdb.connect(host=host, user=user,
                                      db=db, charset=charset)
        else:
            self.db = MySQLdb.connect(host=host, user=user, passwd=passwd,
                                      db=db, charset=charset)
        self.cursor = self.db.cursor()

    def execute(self, sql):
        try:
            self.cursor.execute(sql)
            self.db.commit()
            results = self.cursor.fetchall()
            return results
        except Exception as e:
            print(str(e))

    def executemany(self, sql, param_list):
        try:
            self.cursor.executemany(sql, param_list)
            self.db.commit()
        except Exception as e:
            print(str(e))
            
db = DataBase()
# 1. 写入数据
student_name = u'小红'	# 该中文数据此时为 unicode 类型
sql = "insert into test1 values (2, '%s', 'female')" %(student_name.encode('gbk')) # 我们将 unicode 数据根据自己需要，转换成对应编码，比如这里我转换成 gbk 编码
db.execute(sql)
# 2. 获取数据
sql = "select name from test1"
results = db.execute(sql)
for result in results:
    print(type(result[0]))
    print([result[0]]) # 查看这里打印出来的数据，你就会发现这里的不是正常的 unicode 数据，如果你直接 print 的话，在显示的时候编码转换就会发生异常
    print(result[0].encode('latin1').decode('gbk')) # 此时中文数据为 unicode，转换成 unicode 则不会因为中文数据编码和脚本编码不一致而导致打印出现异常

3. 测试小实验

实践可以让我们加深如何使用 Python 2 处理 MySQL latin 表里的中文数据。如果你手上的 latin 表是线上环境，我相信你也是不敢随意测试。下面就让我们手把手的把测试环境给搭建起来，好好地实践一番。

3.1 运行 MySQL

在你的 Linux 虚拟机上，我们通过 docker 快速拉起一个 MySQL 实例：

$ docker run -itd --name mysql-test -p 3306:3306 -e MYSQL_ROOT_PASSWORD=123456 mysql:5.7

这里我简单解释一下这条命令的意思：我们以mysql:5.7这个镜像为模板，新启动一个命名为mysql-test的容器，容器的3306端口与母机的3306端口关联，同时我们设置了mysql-test容器里的 root 账号密码为 123456

注：docker 的安装和一些常用的 docker 命令这里就不展开篇幅了，网络上还是有不少不错的资源的，努力搜索一下。

3.2 建立测试用的 DB 和表

# 通过命令行的方式连接至 MySQL 上
$ mysql -h 127.0.0.1 -u root -p'123456' --default-character-set=latin1

# 在 MySQL 的命令行终端下，执行以下三条 SQL
# 创建数据库
create database test;
use test;
# 创建测试用的 test1 表，表中包含了三个字段，我们后续将会在 name 字段中插入中文数据
create table test1 (
    id INT PRIMARY KEY AUTO_INCREMENT,
    name varchar(1024) NOT NULL,
    sex varchar(1024) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1

操作过程如下所示：

$ mysql -h 127.0.0.1 -u root -p'123456' --default-character-set=latin1	#连接至我们新启动的 DB 上
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MySQL connection id is 37
Server version: 5.7.26 MySQL Community Server (GPL)

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MySQL [(none)]> create database test; 	# 建立测试用的 test 库；

MySQL [(none)]> use test;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MySQL [test]> create table test1 (
    id INT PRIMARY KEY AUTO_INCREMENT,
    name varchar(1024) NOT NULL,
    sex varchar(1024) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;	# 创建测试用的 test1 表

3.3 往表中写入和读取中文数据

下面我们将演示，如何往 latin 表中写入和读取 utf-8 编码的中文数据

3.3.1 创建测试脚本

在你的家目录下，我们来新建一个测试用的 python 脚本，文件名为 test.py

$ touch test.py
$ vim test.py

我们将以下的文本内容，拷贝至 test.py 文件里：

#!/usr/bin/python
# -*- coding: gbk -*-	 #在 Python 文件的开头指定编码，值为 gbk 或者 utf-8
import MySQLdb

# 作者为了方便，对原有的 MySQLdb 数据库类的一些 db 操作进行的一个简单的封装成 DataBase 这个类，大家也可以直接使用 MySQLdb
class DataBase(object):

    def __init__(self, host="127.0.0.1", user="root", passwd="123456", db="test", charset="latin1"):
        if not passwd:
            self.db = MySQLdb.connect(host=host, user=user,
                                      db=db, charset=charset)
        else:
            self.db = MySQLdb.connect(host=host, user=user, passwd=passwd,
                                      db=db, charset=charset)
        self.cursor = self.db.cursor()

    def execute(self, sql):
        try:
            self.cursor.execute(sql)
            self.db.commit()
            results = self.cursor.fetchall()
            return results
        except Exception as e:
            print(str(e))

    def executemany(self, sql, param_list):
        try:
            self.cursor.executemany(sql, param_list)
            self.db.commit()
        except Exception as e:
            print(str(e))
            
db = DataBase()
# 1. 写入数据
student_name = u'小强'	# 该中文数据此时为 unicode 类型
sql = "insert into test1 values (1, '%s', 'male')" %(student_name.encode('utf8')) # 我们将 unicode 数据根据自己需要，转换成对应编码，比如这里我转换成 utf8 编码
db.execute(sql)
# 2. 获取数据
sql = "select name from test1"
results = db.execute(sql)
for result in results:
    print(type(result[0]))
    print(result[0].encode('latin1').decode('utf8')) # 此时中文数据为 unicode，转换成 unicode 则不会因为中文数据编码和脚本编码不一致而导致打印出现异常