[译]C ++中的UTF-8实用程序功能(平台独立代码)

By robot-v1.0

本文链接 https://www.kyfws.com/applications/utf-8-utility-functions-in-c-platform-independent-zh/

01月01日, 0001 - 4 分钟阅读 - 1882 个词 阅读量 0

[译]C ++中的UTF-8实用程序功能(平台独立代码)

原文地址：https://www.codeproject.com/Articles/18297/UTF-8-UTILITY-FUNCTIONS-IN-C-Platform-Independent

原文作者：Boby Thomas P

译文由本站 robot-v1.0 翻译

前言

This article describes the basics of UTF-8 and provides some utility functions for handling UTF-8. The code can be compiled for Windows as well as Linux.

本文介绍了UTF-8的基础知识,并提供了一些用于处理UTF-8的实用程序功能.该代码可以针对Windows和Linux进行编译.

介绍(Introduction)

最近,我需要验证外部客户提供的一些UTF-8格式的文件.我用Google搜索了一些示例应用程序和代码,但是找不到.然后我决定自己写一个.(Recently I needed to validate some files provided by an external customer for UTF-8 format. I Googled for some sample applications and code, but I couldn’t find one. Then I decided to write one myself.)

UTF-8格式(UTF-8 Format)

顾名思义,UTF-8(Unicode转换格式-8)是Unicode的可变长度编码格式. Unicode包含代表几乎所有已知语言所需的字符.这包括世界上大多数语言,包括大多数印度语言,例如马拉雅拉姆语,孟加拉语,古吉拉特语,奥里亚语,泰米尔语,泰卢固语和卡纳达语.(UTF-8 (Unicode Transformation Format -8) as the name suggests, is a variable length-encoding format for Unicode. Unicode contains the characters required to represent practically all known languages. This includes most of the languages in the world including most of the Indian languages like Malayalam, Bengali, Gujarati, Oriya, Tamil, Telugu and Kannada.)

Unicode将整数定义为字符.但是没有定义如何存储/编码.这已经定义为许多编码格式,例如UCS-2,UTF-7,UTF-8,UTF-16等.(Unicode defines integer numbers to characters. But how this has to be stored/ encoded is not defined. This has been defined in many of the encoding formats like UCS-2, UTF-7, UTF-8, UTF-16 etc.)

与其他编码相比,UTF-8更具吸引力的是,所有标准ASCII字符在UTF格式中也将继续保持相同.这意味着为处理ASCII字符而编写的代码将保持原样.(What makes UTF-8 attractive compared to other encoding is the fact that all the standard ASCII characters will continue to be the same in UTF format also. That means code written to handle ASCII characters will remain as it is.)

我想在这里强调几点.(Few points I would like to highlight here.)

普通的ASCII文件(从0x00到0x7f的二进制数据)是有效的UTF-8文件,因为对于0x00到0x7f范围内的字符,UTF-8编码的字符串保持不变.(A plain ASCII file (Binary data from 0x00 to 0x7f) is a valid UTF-8 file because of the fact that the UTF-8 encoded string remains the same for characters in the range 0x00 to 0x7f.)
0xEE和0xFF是两个字符,在UTF-8文件中根本不可能.(0xEE and 0xFF are two characters, which are not possible at all in a UTF-8 file.)
从UTF编码字符的第一个字节,我们可以找到UTF-8字符的字节总数.(From the first byte of a UTF-encoded character, we can find out the total number of bytes for the UTF-8 character.)
可以对所有2个进行编码(It is possible to encode all the 2)31(31)UCS字符转换为UTF.(UCS characters to UTF.)
非ASCII字符(> 0x007f)的第一个字节将在0xC0至0xFD的范围内.(First byte of a non-ASCII character (>0x007f) will be in the range of 0xC0 to 0xFD.)
非ASCII字符序列中的所有字节都将大于0x80.这意味着在任何多字节编码的UTF序列中都不会有任何ASCII字符字节.(All the bytes in the sequence for a non-ASCII character will be above 0x80. That means, there won’t be any ASCII character byte in any of the multi byte encoded UTF sequence.)
字节流以大端格式存储.(Byte streams are stored in big endian format.) 如果要在UTF-8和ASCII之间转换,请使用此应用程序.(Use this application if you want to convert between UTF-8 and ASCII.)

字符的字节序列计算(Calculation of Byte Sequence for a Character)

U-00000000 – U-0000007F(U-00000000 – U-0000007F)	0(0)*xxxxxxx(xxxxxxx)*
U-00000080 – U-000007FF(U-00000080 – U-000007FF)	110xxxxx10xxxxxx
U-00000800 – U-0000FFFF(U-00000800 – U-0000FFFF)	1110xxxx10xxxxxx10xxxxxx
U-00010000 – U-001FFFFF(U-00010000 – U-001FFFFF)	11110xxx10xxxxxx10xxxxxx10xxxxxx
U-00200000 – U-03FFFFFF(U-00200000 – U-03FFFFFF)	111110xx10xxxxxx10xxxxxx10xxxxxx10xxxxxx
U-04000000 – U-7FFFFFFF(U-04000000 – U-7FFFFFFF)	1111110x10xxxxxx10xxxxxx10xxxxxx10xxxxxx10xxxxxx

超长序列(Overlong Sequences)

输入序列过长时必须小心. UTF-8解码器不得接受字符编码,其字节数超过必要.(Care must be taken when you get an input with overlong sequence. UTF-8 decoder must not accept character coded with more bytes than necessary.)

例如,字符" A"(0x41)本身应编码为0x41.其他长期的可能性是:(For example, character ‘A’ (0x41) should be encoded to 0x41 itself. Other long run possibilities are:)

0xC1 0x81

0xE0 0x81 0x81

0xF0 0x80 0x81 0x81

0xF8 0x80 0x80 0x81 0x81

0xFC 0x80 0x80 0x80 0x81 0x81

这些序列使用普通解码器会将其自身解码为0x41.但是这些在UTF-8中是不允许的,应被视为无效的UTF字符序列.(These sequences, with a normal decoder will decode it to 0x41 itself. But these are not permitted in UTF-8 and should be considered as invalid UTF character sequences.)

使用代码(Using the Code)

以下是可用的功能:(Following are the functions available:)

结论(Conclusion)

上面的文章对UTF-8进行了基本介绍,并提供了一些实用程序功能.谷歌获取有关UTF-8的更多详细信息.请将您的宝贵建议和意见发送给我,地址为(The above article gives a basic introduction to UTF-8 and provides some utility functions. Google for more details about UTF-8. Please send me your valuable suggestions and comments at)bobypt@gmail.com(bobypt@gmail.com).(.)

历史(History)

6(6)日(th)2007年4月:初始职位(April, 2007: Initial post)

许可

本文以及所有相关的源代码和文件均已获得The Code Project Open License (CPOL)的许可。

C++ VC6 Windows Visual-Studio Dev 新闻翻译