溫馨提示×

溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊×
其他方式登錄
點(diǎn)擊 登錄注冊 即表示同意《億速云用戶服務(wù)條款》

MySQL 字符集utf8、utf8mb3、utf8mb4

發(fā)布時(shí)間:2020-08-07 04:32:08 來源:ITPUB博客 閱讀:857 作者:perfeader 欄目:MySQL數(shù)據(jù)庫

首先想要了解MySQL的字符集,就需要去官方文檔看看字符集是如何介紹的。英語不錯(cuò)的話,看官方文檔應(yīng)該是沒問題。在搜索框里搜一下就可以找到相關(guān)的解釋。我就在這里整理一下,以便后期查看。字符集在官方文檔下面這一章節(jié):
Chapter 10 Character Sets, Collations, Unicode

https://dev.mysql.com/doc/refman/5.6/en/charset.html

一、字符集設(shè)置


MySQL數(shù)據(jù)庫可以做到:
1、使用多種字符集存儲(chǔ)字符串。
2、使用多種排序規(guī)則比較字符串。
3、在同一服務(wù)器、同一數(shù)據(jù)庫、甚至同一表中混合具有不同字符集或排序規(guī)則的字符串。
4、在任何級別啟用字符集和排序規(guī)則的規(guī)范。

MySQL可以設(shè)置如下40種字符:
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp850    | DOS West European           | cp850_general_ci    |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 |
| latin1   | cp1252 West European        | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                    | ascii_general_ci    |      1 |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |      2 |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |      1 |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |      1 |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |      2 |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |      1 |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |      2 |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |      1 |
| cp1250   | Windows Central European    | cp1250_general_ci   |      1 |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |      2 |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |      1 |
| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |      1 |
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 |
| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |      2 |
| cp866    | DOS Russian                 | cp866_general_ci    |      1 |
| keybcs2  | DOS Kamenicky Czech-Slovak  | keybcs2_general_ci  |      1 |
| macce    | Mac Central European        | macce_general_ci    |      1 |
| macroman | Mac West European           | macroman_general_ci |      1 |
| cp852    | DOS Central European        | cp852_general_ci    |      1 |
| latin7   | ISO 8859-13 Baltic          | latin7_general_ci   |      1 |
| utf8mb4  | UTF-8 Unicode               | utf8mb4_general_ci  |      4 |
| cp1251   | Windows Cyrillic            | cp1251_general_ci   |      1 |
| utf16    | UTF-16 Unicode              | utf16_general_ci    |      4 |
| utf16le  | UTF-16LE Unicode            | utf16le_general_ci  |      4 |
| cp1256   | Windows Arabic              | cp1256_general_ci   |      1 |
| cp1257   | Windows Baltic              | cp1257_general_ci   |      1 |
| utf32    | UTF-32 Unicode              | utf32_general_ci    |      4 |
| binary   | Binary pseudo charset       | binary              |      1 |
| geostd8  | GEOSTD8 Georgian            | geostd8_general_ci  |      1 |
| cp932    | SJIS for Windows Japanese   | cp932_japanese_ci   |      2 |
| eucjpms  | UJIS for Windows Japanese   | eucjpms_japanese_ci |      3 |
+----------+-----------------------------+---------------------+--------+
40 rows in set (0.00 sec)


String expressions have a repertoire attribute, which can have two values:

  • ASCII: The expression can contain only characters in the Unicode range U+0000 to U+007F.

  • UNICODE: The expression can contain characters in the Unicode range U+0000 to U+10FFFF. This includes characters in the Basic Multilingual Plane (BMP) range (U+0000 to U+FFFF) and supplementary characters outside the BMP range (U+10000 to U+10FFFF).

這里提到:Basic Multilingual Plane (BMP) 和 supplementary characters
Basic Multilingual Plane (BMP):基本多文種平面
Supplementary Multilingual Plane(SMP):多文種補(bǔ)充平面
BMP就已經(jīng)包含常用字符,而SMP只是一些不常用的字符,代碼點(diǎn)(字符)。如Emoji頭像的符號,撲克牌的符號等等。
關(guān)于BMP與SMP詳細(xì)可以查看wiki上的解釋:https://en.wikipedia.org/wiki/Plane_(Unicode)

系統(tǒng)默認(rèn)設(shè)置元數(shù)據(jù)表的字符集為utf8,是通過參數(shù)character_set_system設(shè)置。character_set_results這個(gè)參數(shù)默認(rèn)是utf8,當(dāng)查詢表數(shù)據(jù)返回給客戶端,這個(gè)參數(shù)是控制返回的結(jié)構(gòu)數(shù)據(jù)的字符集。如果希望服務(wù)器將元數(shù)據(jù)結(jié)果傳遞回不同的字符集,請使用SET NAMES語句強(qiáng)制服務(wù)器執(zhí)行字符集轉(zhuǎn)換??蛻舳顺绦蚩梢栽诮邮盏絹碜苑?wù)器的結(jié)果后執(zhí)行轉(zhuǎn)換??蛻舳藞?zhí)行轉(zhuǎn)換更為有效,但此選項(xiàng)并不總是適用于所有客戶端。

SETNAMES'utf8';

There are default settings for character sets and collations at four levels: server, database, table, and column.


Suffix Meaning
_ai Accent insensitive  重音不敏感
_as Accent sensitive     重音敏感
_ci Case insensitive     不區(qū)分大小寫
_cs case-sensitive        區(qū)分大小寫
_bin Binary                   二進(jìn)制

設(shè)置了_ci顧名思義_ai也是包含的,顯式不區(qū)分大小寫,隱式重音不敏感。
設(shè)置了
_cs,顧名思義_as也是包含的,顯式區(qū)分大小寫,隱式重音敏感。

設(shè)置MySQL server character:

character-set-server

方法一:
          mysqld
   mysqld
--character-set-server=latin1
    mysqld
--character-set-server=latin1 \
        
--collation-server=latin1_swedish_ci

方法二:
         cmake .-DDEFAULT_CHARSET=latin1
         或cmake .-DDEFAULT_CHARSET=latin1 \
          
-DDEFAULT_COLLATION=latin1_german1_ci

The current server character set and collation can be determined from the values of the character_set_server and collation_server system variables. These variables can be changed at runtime.


Database Character Set and Collation

CREATEDATABASEdb_name[[DEFAULT]CHARACTERSETcharset_name][[DEFAULT]COLLATEcollation_name]
ALTERDATABASEdb_name[[DEFAULT]CHARACTERSETcharset_name][[DEFAULT]COLLATEcollation_name]

The keyword SCHEMA can be used instead of DATABASE.

All database options are stored in a text file named db.opt that can be found in the database directory.

The CHARACTER SET and COLLATE clauses make it possible to create databases with different character sets and collations on the same MySQL server.

查看你數(shù)據(jù)庫這兩個(gè)參數(shù)設(shè)置:
USEdb_name;
SELECT@@character_set_database,@@collation_database;


Table Character Set and Collation

 The CREATE TABLE and ALTER TABLE statements have optional clauses for specifying the table character set and collation:
CREATETABLEtbl_name(column_list)[[DEFAULT]CHARACTERSETcharset_name][COLLATEcollation_name]]
ALTERTABLEtbl_name[[DEFAULT]CHARACTERSETcharset_name][COLLATEcollation_name]


Column Character Set and Collation

Every character column (that is, a column of type CHARVARCHAR, or TEXT) has a column character set and a column collation. Column definition syntax for CREATE TABLE and ALTER TABLE has optional clauses for specifying the column character set and collation:

col_name {CHAR|VARCHAR|TEXT} (col_length)[CHARACTERSETcharset_name][COLLATEcollation_name]

col_name {ENUM|SET} (val_list)[CHARACTERSETcharset_name][COLLATEcollation_name]


Character String Literal Character Set and Collation

For the simple statement SELECT 'string', the string has the connection default character set and collation defined by the character_set_connection and collation_connection system variables.

A character string literal may have an optional character set introducer and COLLATE clause, to designate it as a string that uses a particular character set and collation:
[_charset_name]'string'[COLLATEcollation_name]

Examples:
SELECT'abc';
SELECT _latin1'abc';
SELECT _binary'abc';
SELECT _utf8'abc'COLLATE utf8_danish_ci;

The National Character Set

Standard SQL defines NCHAR or NATIONAL CHAR as a way to indicate that a CHAR column should use some predefined character set. MySQL usesutf8 as this predefined character set. For example, these data type declarations are equivalent:

CHAR(10)CHARACTERSET utf8
NATIONALCHARACTER(10)
NCHAR(10)

As are these:
VARCHAR(10)CHARACTERSET utf8
NATIONALVARCHAR(10)
NVARCHAR(10)
NCHARVARCHAR(10)
NATIONALCHARACTERVARYING(10)
NATIONALCHARVARYING(10)


Character Set Introducers

A character string literal, hexadecimal literal, or bit-value literal may have an optional character set introducer and COLLATE clause, to designate it as a string that uses a particular character set and collation:

[_charset_name]literal[COLLATEcollation_name]

Character set introducers and the COLLATE clause are implemented according to standard SQL specifications.

Examples:
SELECT'abc';
SELECT _latin1'abc';
SELECT _binary'abc';
SELECT _utf8'abc'COLLATE utf8_danish_ci;

SELECT _latin1 X'4D7953514C';          --16進(jìn)制
SELECT _utf8 0x4D7953514CCOLLATE utf8_danish_ci;

SELECT _latin1 b'1000001';            --2進(jìn)制
SELECT _utf8 0b1000001COLLATE utf8_danish_ci;


二、utf8(utf8mb3)與utf8mb4的轉(zhuǎn)換


10.9.8 Converting Between 3-Byte and 4-Byte Unicode Character Sets


The utf8mb3 and utf8mb4 character sets differ as follows:


  • utf8mb3 supports only characters in the Basic Multilingual Plane (BMP). utf8mb4 additionally supports supplementary characters that lie outside the BMP.

  • utf8mb3 uses a maximum of three bytes per character. utf8mb4 uses a maximum of four bytes per character.

Note

This discussion refers to the utf8mb3 and utf8mb4 character set names to be explicit about referring to 3-byte and 4-byte UTF-8 character set data. The exception is that in table definitions, utf8 is used because MySQL converts instances of utf8mb3specified in such definitions to utf8, which is an alias for utf8mb3.


utf8mb4與utf8(utf8mb3)轉(zhuǎn)換也是特別好轉(zhuǎn)換的:

1.utf8(utf8mb3)轉(zhuǎn)成utf8mb4可以存儲(chǔ)supplementary characters;
2.utf8(utf8mb3)轉(zhuǎn)成utf8mb4可能會(huì)增加數(shù)據(jù)存儲(chǔ)空間;
3.對于BMP character字符,utf8(utf8mb3)轉(zhuǎn)成utf8mb4相同的代碼值、相同的編碼、相同的長度,不會(huì)有變化。
4.對于supplementary character字符,utf8mb4會(huì)以4字節(jié)存儲(chǔ),由于utf8mb3無法存儲(chǔ)supplementary character字符,因而在字符集轉(zhuǎn)換過程中,不用擔(dān)心字符無法轉(zhuǎn)換的問題。
5.表結(jié)構(gòu)在轉(zhuǎn)換過程中需要調(diào)整:utf8(utf8mb3)字符集可變長度字符數(shù)據(jù)類型(VARCHAR和text類型)設(shè)定的表中列的字段長度,utf8mb4中將會(huì)存儲(chǔ)更少的字符。對于所有字符數(shù)據(jù)類型(CHAR、VARCHAR和文本類型),UTF8Mb4列最多可被索引的字符數(shù)比UTF8Mb3列要少。因此在轉(zhuǎn)換之前,要檢查字段類型。防止轉(zhuǎn)換后表,索引存儲(chǔ)的數(shù)據(jù)超出該字段定義長度,字段類型長度可以存儲(chǔ)的最大字節(jié)數(shù)。innodb索引列:最大索引列長度767 bytes,對于utf8mb3就是可以索引255個(gè)字符,對于utf8mb4就是可以索引191個(gè)字符。在轉(zhuǎn)換后不能滿足那么就需要換一個(gè)列來索引。以下是通過壓縮方式使索引更多的字節(jié)。

Note

For InnoDB tables that use COMPRESSED or DYNAMIC row format, you can enable the innodb_large_prefix option to permit index key prefixes longer than 767 bytes (up to 3072 bytes). Creating such tables also requires the option valuesinnodb_file_format=barracuda and innodb_file_per_table=true.) In this case, enabling the innodb_large_prefixoption enables you to index a maximum of 1024 or 768 characters for utf8mb3 or utf8mb4 columns, respectively. For related information, see Section 14.8.1.7, “Limits on InnoDB Tables”.

The preceding types of changes are most likely to be required only if you have very long columns or indexes. Otherwise, you should be able to convert your tables from utf8mb3 to utf8mb4 without problems, using ALTER TABLE as described previously.

6.應(yīng)用于MySQL server 字符集也需要一一對應(yīng)。
7.master 實(shí)例改變字符集,那么slave也需要相應(yīng)的改變。


向AI問一下細(xì)節(jié)

免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。

AI