ZNHOO Whatever you are, be a good one!

Encoding

Terminology

  1. Code point: the raw code defined by people (Unicode, 区位码).
  2. Cdoe unit: code after encoding.
  3. Fixed witdth/length encoding: without transformation. Code points are directly indexable. Finding the Nth code point in a sequence of code points is a constant time operation.

    For example. read two bytes each time.

  4. Variable width/length encoding: MUST has transformation. Efficient space usage .

    Read byte by byte.

  5. Big/Little Endian (BE/LE):
    1. Decides whether the bigger part (MSB, high bytes) or little part (low bytes) is stored first.
    2. Firstly stored bytes actually resides on low disk/memory address.
  6. Byte order mark (BOM). Depending on BE or LE, a byte read may represent different characters. BOM is inserted into the very beginning to differentiate BE and LE.
    1. UTF-8: EF BB FF. BOM is not a must, but help mark it's UTF-8.
    2. UTF-16/UCS-2 BE: FE FF.
    3. UTF-16/UCS-2 LE: FF FE.
    4. UTF-32/UCS-4 BE: 00 00 FE FF  
    5. UTF-32/UCS-4 LE: FF FE 00 00

Unicode/UCS

At the very beginning, there were two independent institutes deciding to desgin a international coding system, namely ISO (ISO 10646 project, UCS) and unicode.org (Unicode). Finally they together merged their desgin and chosen Unicode as the scheme name.

  1. 16 planes: U+0000 ~ U+10FFFF, at most 21 bits.
  2. Basic Multilingual Plane (BMP): U+0000 ~ U+FFFF.
  3. Includes almost all possible characters in the world.

UTF-16 vs. UCS-2

  1. UTF-16 is downward compatibility with UCS-2.
  2. UCS-2 is fixed width (16 bits) encoding scheme while the successor UTF-16 is variable width (sequence of 16-bit words).
  3. UCS-2 covers code points in BMP from U+0000 to U+FFFF with any transformation.
  4. Apart from the BMP, UTF-16 covers other planes with the help of surrogate code points (from U+D800 - U+DFFF) and results in a pair of 16-bit words, together called surrogate pair.
    1. surrogate code point in Unicode has no corresponding characters. They are for furture extension.
    2. So UTF-16 covers all Unicode planes from U+0000 to U+10FFFF with ether one 16-bit word or a pair of 16-bit word.
    3. For BMP, Unicode = UCS-2 = UTF-16 (code point = code unit).
  5. To decode a UTF-16/UCS-2 code unit, read two bytes or four bytes. One byte or three bytes is meaningless.
  6. UCS-2 is obsolete.

UTF-32/UCS-4

  1. UTF-32 and UCS-4 is identical.
  2. Fixed-length encoding with exactly 32 bits. There are many leading zero bits.
  3. Without any transformation. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.

ANSI

  1. You may encounter it in Windows.
  2. It's not a concrete encoding, but scheme used Windows to select the correct encoding which is transparent to users.

    If the Windows system is English, then the encoding behind is codepage cp437, while Chinese corresponds to cp936 (namely GBK).

国家标准信息交换用汉字编码,简称国标码(GB2312-80标准)

所有的汉字编码都兼容ASCII。更多细节参考外码?内码?

  1. 区位码:把汉字列成一94行94列的表,每行称区,每列称位。区位座标定位一个汉字,如“汉”的区位码是十进制2626,十六进制1A1AH,表示“汉”在表格的第26行第26列。
    1. 区位码是编码设计的草稿。默认用十进制表示。
    2. 2字节表示一个汉字。
  2. 国标码:区位码+2020H,得到国家标准编码GB2312(GB表示“国标”).“汉”的GB2312码是1A1AH+2020H=3A3AH.

    国标码是国家标准,不可乱改。注意,这里国标码最高位是0.

  3. GB2312区位码和国标码都是人认识的编码,计算机是不知道的,通常称之为“(机)外码”。

    字面意思是在计算外实用的码。

  4. (机)内码:国标码+8080H或区位码+A0A0H,得到汉字在计算机内存储的编码。“汉”的机内码是3A3AH+8080H=BABAH.

    把国标码的最高位置1,即得到机内码。ASCII字节最高位为0,计算机很容易区分汉字和英语。

  5. 计算机存储的是GB2312的机内码,而不是国标码。这与ASCII、UTF-8等直接存储编码不同。
  6. Unicode和国标码不兼容,具体说是Unicode的code point和GB2312国标码之间没有换算关系。

GBK GB18030

  1. GB2312的扩展,GB2312、GBK(K表示“扩展”)到GB18030向下兼容。
  2. 不同之处:
    1. 后面的包含更多的汉字;
    2. GBK GB18030只要求机内码第一字节开头置1(只有第一字节加80H),但兼容部分国标码两字节都必须是1开头。这不影响对GBK、GB18030的解码,只要遇到开头是1的,就接着都下一字节,合并解码。
  3. 从GBK、GB18030开始,机内码就是国标码,不需要什么区位码、国标码、到机内码转换。

    所以通常叫GB2312为区位码,GBK为机内码,以示这个微妙地区分。

  4. 现在的PC平台必须支持GB18030,对嵌入式产品暂不作要求。所以手机、MP3一般只支持GB2312。
  5. 从编程角度看,GB2312、GBK到GB18030都属于“双字节字符集” (DBCS),并且都是大端先存(BE)。属于Fixed Width Encoding的一种.
  6. 中文Windows的缺省内码是GBK,可以通过GB18030升级包升级到GB18030。不过GB18030相对GBK增加的字符,普通人很难用到,通常我们还是用GBK指代中文Windows内码。