1

https://docs.oracle.com/en/java/javase/18/docs/api/java.base/java/lang/Character.html

Correspondence between Java version and Unicode version

Java release Unicode version
Java SE 15 Unicode 13.0
Java SE 13 Unicode 12.1
Java SE 12 Unicode 11.0
Java SE 11 Unicode 10.0
Java SE 9 Unicode 8.0
Java SE 8 Unicode 6.2
Java SE 7 Unicode 6.0
Java SE 5.0 Unicode 4.0
Java SE 1.4 Unicode 3.0
JDK 1.1 Unicode 2.0
JDK 1.0.2 Unicode 1.1.5

Unicode Conformance

The char type and the fields and methods of the encapsulating class java.lang.Character are defined in terms of character information in the Unicode standard, specifically the UnicodeData file that is part of the Unicode character database. This file specifies properties, including name and category, for each assigned Unicode code point or character range. This file is available from the Unicode Consortium http://www.unicode.org .

Unicode character representation

The char data type (and the value encapsulated by the object) Character is based on the original Unicode specification, which defines characters as fixed-width 16-bit entities. The Unicode standard has since changed to allow characters that require more than 16 bits to be represented. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar values. (See the definition of the U+n symbol in the Unicode Standard.)

The character set from U+0000 to U+FFFF is sometimes called the Basic Multilingual Plane (BMP). Characters with code points greater than U+FFFF are called supplementary characters. The Java platform uses UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range (\uD800-\uDBFF) and the second from the low-surrogates range (\uDC00- \uDFFF).

Therefore, a char value represents a Basic Multilingual Plane (BMP) code point, including surrogate code points or UTF-16 encoded code units. An int value representing all Unicode code points, including supplementary code points. Use the lower 21 bits of an integer type int to represent a Unicode code point, and the upper 11 bits must be zero. Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows:

  • Methods that only accept char values do not support supplementary characters. They treat values in the char surrogate range as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this particular value followed by any lower surrogate value in the string would represent a letter.
  • Methods that accept int values support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (CJK ideograph).

In the Java SE API documentation, Unicode code point (Unicode code point) is used for character values in the range U+0000 to U+10FFFF, Unicode code unit (Unicode code unit) is used for UTF-16 encoded char code unit 16 bit value. For more information on Unicode terms, see the Unicode Glossary .

Summarize

After reading the above document, can you answer the question of the title?
Feel free to write your answer in the comments


Yujiaao
12.7k 声望4.7k 粉丝

[链接]