UTF-8 Validation
题目链接:https://leetcode.com/problems...
这道题关键是搞懂题目意思。
UTF-8
1 byte: characters from 0 to 127 == ASCII
2 bytes: characters from 127 to 2047
3 bytes: characters from 2048 to 65535
4 bytes: characters from 65536 to 1112064
The leading bits tell: the length of the bytes
1 byte: the 1st bit is 0
-
2 bytes:
1st byte: start with "110"
2nd byte: start with "10"
-
3 bytes:
1st byte: start with "1110"
2nd byte: start with "10"
3rd byte: start with "10"
-
4 bytes:
1st byte: start with "11110"
2nd byte: start with "10"
3rd byte: start with "10"
4th byte: start with "10"
思路及代码
知道意思之后,这道题就很简单了。
一个loop,每次分三步来做,loop invariant是每次data[i]都是first byte of 新的character
统计data[i]后8位里面,从前开始有多少个1,用变量ones来保存,其中ones可能的值只有0, 2, 3, 4
从 data[i+1] 开始检查,后八位中的前两位是否为'10',一共检查ones - 1
更新i的值为 i + ones
public class Solution {
public boolean validUtf8(int[] data) {
/* 1. check how many '1's = ones
* 2. check (i + 1, i + ones - 1) for '10'
* 3. update i = i + ones
* valid ones: 0, 2, 3, 4
*/
int i = 0;
while(i < data.length) {
// 1. find ones
int ones = 0;
while(((data[i] >> (7 - ones)) & 1) == 1) {
ones++;
}
// invalid ones
if(ones == 1 || ones > 4) return false;
// 2. check 1s
i++;
while(ones-- > 1) {
if(i >= data.length || ((data[i] >> 6) & 3) != 2) return false;
// 3. update i
i++;
}
}
return true;
}
}
Advantage of UTF-8
implement Unicode: encode different symbols(Chinese...)
web pages are often coded in UTF-8, XML, JSON
only use binary representation: 0 and 1
endianness independent
Disadvantage of UTF-8
space: use more bytes, larger
time: calculate
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。