Detailed information in software development

Foreword?
I think that for software developers, the concept of “coding” is not unfamiliar, even “frequent contact”. In the process of writing code, “coding problem” is a The programmer has no choice but to have a headache.

The “coding problem” is not difficult to solve, but the principle, I believe that many programmers are specious, then we will discuss this problem together.

Computer code?
Computer code refers to a way of recording data that represents letters or numbers inside a computer.

Why does the encoding appear? We know that the data in the computer is stored by electronic originals, and because of the limitation of industrial technology, electronic components can only record two stable states “on” and “off”, which are represented by numbers, that is, 0 and 1. In other words, in essence, the computer can only record the two numbers 0 and 1. Each 0 or 1, we call it a bit, which is the smallest unit of a computer. This type of number with only 0 and 1, we call it a binary number.

But obviously, we need to record a lot of things, so only two numbers will definitely not work, so three bits together represent a number, and there is an octal system. Four bits together represent a digit, and there is a hexadecimal system.

The number problem is solved, but if you want to store a character ‘a’ in your computer, you can’t do it. In order to solve this problem, people thought of a solution: to uniformly number all the commonly used characters, such as the number of’a’ is 97, so that when we need to store’a’, we don’t directly store’a’ ‘, but to store the number 97. When it is taken out, turn this 97 into’a’, which perfectly solves this problem.

And what we usually call “encoding” is the number of these characters.

The table corresponding to all characters and their numbers is called “coding table”.

Common encoding tables: ASCII encoding, GB2312 encoding (Simplified Chinese), GBK, BIG5 encoding (Traditional Chinese), utf-8 encoding, etc.

ASCII encoding:
Computer At the beginning of its creation, it was popular in the “Western world” or “English-speaking countries”. When you take it apart, the languages, characters, etc. of the Western world are at best 26 English letters plus some symbols, even if the English letters are case-sensitive , And never more than 128. Each character is represented by one byte, which is enough. This encoding method that uses one byte to represent a character is the earliest: ASCII encoding.

Note 1: Byte is a basic unit composed of 8 bits, and the range is: -128—127

Note 2: There is no negative number in the encoding

Note 3: ASCII does not support Chinese.

?

?

?

GBK encoding:
Later, with the popularity of computers, the whole world needs to use For computers to store data, it will not work if you still use ASCII encoding (characters other than the Roman alphabet cannot be stored), and the ASCII encoding stipulates that one byte represents one character. Obviously, it cannot be applied to the entire world (at least Chinese characters are also A few thousand, right). Therefore, all countries have expanded the ASCII code, from the original one byte to represent one character, and the conversion to multiple bytes to represent one character.

For example, the two encoding formats of GB2312 and GBK in China are two bytes to represent one character. Of course, the range that can be represented by two bytes is larger, and almost all common Chinese characters can be included.

Note: No matter what the code is, the characters represented in the initial range of 0-127 are completely consistent with the ASCII code.

UTF-8 encoding:
Of course, if all countries in the world use their own encoding, then the communication between countries will be more troublesome. There, because the encoding is different, the analysis is cursing, which is not good. Therefore, in order to solve this problem, an organization called the Unicode Academic Society formulated a set of encoding rules-Unicode encoding. The rule supports more than 650 languages ​​in the world. It is a universal character rule.

UTF-8 is an international encoding table launched according to this rule.

UTF-8 is an international encoding table, which supports Chinese encoding. Chinese generally occupies 3 bytes in the encoding table

Note 1: Unicode encoding is not an encoding table, but an encoding rule. UTF-8 is the encoding table, and UTF-8 is the encoding table specified according to this rule.

Note 2: Chinese in UTF-8 is not all 3 bytes, and it usually takes up 3 characters. Section, some special, very rare words may occupy 6 bytes.

?

Encoding issues:
During development, the so-called “encoding issues” are actually Chinese garbled characters. Why does this problem occur?

We Chinese, generally use Chinese operating systems, and the default encoding format of Chinese operating systems is GBK. Internationally, in order to be understood by the whole world, UTF-8 encoding is generally used. (International websites generally use UTF-8 encoding)

GBK encoding, a Chinese character generally occupies 2 bytes.

UTF-8 encoding, a Chinese character generally occupies 3 bytes.

?

If it is: UTF-8 –> GBK

?

?

?< /p>

?

If it is: GBK –> UTF-8

?

?

Solve the encoding problem :
The emergence of “encoding problem” is nothing more than the encoding error when we parse the Chinese characters given to us by others. If we get GBK, we will use GBK to parse, if we get UTF-8 , Is to use UTF-8 parsing, this will solve it?

So, if you encounter Chinese garbled characters in the string:

1. Rewrite and break up the Chinese garbled string into bytes.

2. Use the String constructor String(byte[]?bytes, String?charsetName) to reorganize the string.

?

For example: ??UTF-8 parsing

?

?

Note: Chinese garbled characters It’s just because we made a mistake in the assembly when we parsed the bytes, which is similar to misplaced when we were playing with building blocks, but the essential bytes have not changed.

Leave a Comment

Your email address will not be published.