String and encoding

Character encoding

We have already talked about the character string is also a data type, but the string is more special is that there is an encoding problem.

Because computers can only process numbers, if you want to process text, you must first convert the text to numbers before you can process it. The earliest computers used 8 bits as a byte in the design. Therefore, the largest integer that a byte can represent is 255 (binary 11111111=decimal 255). If you want to represent a larger integer, You must use more bytes. For example, the largest integer that can be represented by two bytes is 65535, and the largest integer that can be represented by 4 bytes is 4294967295.

Since the computer was invented by the Americans, only 127 characters were coded into the computer at the earliest, that is, uppercase and lowercase English letters, numbers and some symbols. This code table is called ASCII encoding, for example, the encoding of uppercase letter A is 65, and the encoding of lowercase letter z is 122 .

But obviously one byte is not enough to process Chinese, at least two bytes are needed, and it cannot conflict with the ASCII encoding. Therefore, China has formulated the GB2312 encoding. Come and compile Chinese into it.

What you can imagine is that there are hundreds of languages ​​in the world, Japan compiled Japanese into Shift_JIS, and Korea compiled Korean into Euc-kr, each country has its own standards, and conflicts will inevitably occur. As a result, there will be garbled characters displayed in a multi-language mixed text.

char-encoding-problem

Therefore, Unicode came into being. Unicode unifies all languages ​​into a set of encodings, so that there will be no more garbled problems.

The Unicode standard is also evolving, but the most commonly used is to use two bytes to represent a character (if you want to use a very remote character, you need 4 bytes). Modern operating systems and most programming languages ​​directly support Unicode.

Now, let’s take a look at the difference between ASCII encoding and Unicode encoding: ASCII encoding is 1 byte, and Unicode encoding is usually 2 bytes.

The letter A in ASCII code is decimal 65, binary 01000001;

characters 0 in ASCII code is decimal 48, binary 00110000, pay attention to the characters '0' and integer >0 is different;

Chinese characters have exceeded the range of ASCII encoding, and the Unicode encoding is decimal 20013, binary 01001110 00101101.

You can guess that if the ASCII encoding of A is encoded in Unicode, you only need to add 0 in the front. Therefore, the Unicode encoding of A It is 00000000 01000001.

A new problem has appeared again: if it is unified into Unicode, the garbled problem will disappear. However, if the text you write is basically all in English, Unicode encoding requires twice the storage space than ASCII encoding, which is very uneconomical in storage and transmission.

So, in the spirit of economy, the UTF-8 encoding that converts Unicode encoding into “variable-length encoding” has appeared. UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different number sizes. Commonly used English letters are encoded into 1 byte, and Chinese characters are usually 3 bytes. Only very rare characters will be Encoded into 4-6 bytes. If the text you want to transmit contains a lot of English characters, using UTF-8 encoding can save space:

< th>ASCII

Characters Unicode UTF-8
A 01000001 00000000 01000001 01000001
x 01001110 00101101 11100100 10111000 10101101

From the above table, you can also find that UTF-8 encoding has an additional advantage, that is, ASCII encoding can actually be regarded as Part of UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.

After clarifying the relationship between ASCII, Unicode and UTF-8, we can summarize the current general character encoding working methods of computer systems:

In the computer memory, uniform use of Unicode Encoding, when it needs to be saved to the hard disk or needs to be transferred, it is converted to UTF-8 encoding.

When editing with Notepad, the UTF-8 characters read from the file are converted to Unicode characters into the memory. After editing, when saving, convert Unicode to UTF-8 and save to the file. :

rw-file-utf-8

When browsing the web, the server will convert the dynamically generated Unicode content to UTF-8 and then transfer it to the browser:

web-utf-8

So you see a lot of webpage source code will have similar The information of indicates that the webpage is encoded in UTF-8.

Leave a Comment

Your email address will not be published.