MIME | GB2312 |
---|---|
Alias(es) | csGB2312 |
Standard | GB 2312 (1980) |
Language(s) | zh |
Succeeded by | GBK, GB 18030 |
MIME | GB_2312-80 |
---|---|
Alias(es) | iso-ir-58, chinese, csISO58GB231280 |
Standard | GB 2312 (1980), RFC 1345 |
Language(s) | zh |
Succeeded by | GBK, GB 18030 |
GB2312 is the registered internet name for a key official character set of the People's Republic of China, used for simplified Chinese characters. GB abbreviates Guojia Biaozhun (国家标准), which means national standard in Chinese. GB2312 (1980) has been superseded by GBK and GB18030, which include additional characters, but GB2312 is nonetheless still in widespread use.
While GB2312 covers over 99% of the characters of contemporary usage, historical texts and many names remain out of scope. GB2312 includes 6,763 Chinese characters (on two levels: the first is arranged by reading, the second by radical then number of strokes), along with symbols and punctuation, Japanese kana, the Greek and Cyrillic alphabets, Zhuyin, and a double-byte set of Pinyin letters with tone marks. 0.8% of all web pages use GB2312 in April 2016, a drop from 3.5% in January 2010.
There is an analogous character set known as GB/T 12345, closely related to GB2312, but with traditional character forms replacing simplified forms, and some extra 62 supplemental characters. GB-encoded fonts often come in pairs, one with the GB 2312 (simplified) character set and the other with the GB/T 12345 (traditional) character set.
Characters in GB2312 are arranged in a 94x94 grid (as in ISO 2022), and the two-byte codepoint of each character is expressed in the kuten (or quwei) form, which specifies a row (ku or qu) and the position of the character within the row (cell, ten or wei).
The rows (numbered from 1 to 94) contain characters as follows:
The rows 10-15 and 90-94 are unassigned.
For GB2312-80, it contains 682 signs and 6763 Chinese Charactes.
EUC-CN is often used as the character encoding (i.e. for external storage) in programs that deal with GB2312, thus maintaining compatibility with ASCII. Two bytes are used to represent every character not found in ASCII. The value of the first byte is from 0xA1-0xF7 (161-247), while the value of the second byte is from 0xA1-0xFE (161-254). Since all of these ranges are beyond ASCII, like UTF-8, it is possible to check if a byte is part of a multi-byte construct when using EUC-CN, but not if a byte is first or last.