UTF-8

8-bit U(niversal Character Set) T(ransformation) F(ormat)

This encoding of the unicode character set is backward-compatible with ASCII and avoids problems with endianess.

Encoding

Unicode character	UTF-8
(codepoint)	(encoding)
0x00 .. 0x7F	0b0xxxxxxx
0x80 .. 0x7FF	0b110xxxxx 0b10xxxxxx
0x800 .. 0xFFFF	0b1110xxxx 0b10xxxxxx 0b10xxxxxx
0x10000 .. 0x1FFFFF	0b11110xxx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx
0x200000 .. 0x3FFFFFF	0b111110xx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx
0x4000000 .. 0x7FFFFFFF	0b1111110x 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx

The UTF-8 encoding is restricted to max. 4 bytes per character to be compatible with UTF-16 (0 .. 0x10FFFF).

Summary

UTF-8	8-bit U(niversal Character Set) T(ransformation) F(ormat)
Copyright	This program is free software.
Files
C-kern/api/string/utf8.h	Header file UTF-8.
C-kern/string/utf8.c	Implementation file UTF-8 impl.
Types
struct utf8validator_t	Export utf8validator_t into global namespace.
Variables
g_utf8_bytesperchar	Stores the length in bytes of an encoded utf8 character indexed by the first encoded byte.
Functions
test
unittest_string_utf8	Test <escape_char>.
utf8
query
maxchar_utf8	Returns the maximum character value (unicode code point) which can be encoded into utf-8.
maxsize_utf8	Returns the maximum size in bytes of an utf-8 encoded multibyte sequence.
isvalidfirstbyte_utf8	Returns true if the byte is a legal first byte of an utf8 encoded multibyte sequence.
isfirstbyte_utf8	Returns true if this byte is a possible first (start) byte of an utf-8 encoded multibyte sequence.
sizefromfirstbyte_utf8	Returns the size in bytes of a correct encoded mb-sequence by means of the value of its first byte.
sizechar_utf8	Returns the size in bytes of uchar as encoded mb-sequence.
length_utf8	Returns number of UTF-8 characters encoded in string buffer.
encode-decode
decodechar_utf8	Decodes utf-8 encoded bytes beginning from strstart and returns character in uchar.
encodechar_utf8	Encodes uchar into UTF-8 enocoded string of size strsize starting at strstart.
skipchar_utf8	Skips the next utf-8 encoded character.
utf8validator_t	Allows to validate a blocked stream of bytes.
lifetime
utf8validator_INIT	Static initializer.
init_utf8validator	Same as assigning utf8validator_INIT.
free_utf8validator	Clear data members and checks that there is no internal prefix stored.
query
sizeprefix_utf8validator	Returns a value != 0 if the last multibyte sequence was not fully contained in the last validated buffer.
validate
validate_utf8validator	Validates a data block of length size in bytes.
stringstream_t
read-utf8
nextutf8_stringstream	Reads next utf-8 encoded character from strstream.
peekutf8_stringstream	Same as nextutf8_stringstream except the strstream is not changed.
skiputf8_stringstream	Skips next utf-8 encoded character from strstream.
skipillegalutf8_strstream	Skips bytes until end of stream or the begin of a valid utf-8 encoding is found.
find-utf8
findutf8_stringstream	Searches for unicode character in utf8 encoded stringstream.
inline implementation
utf8_t
maxchar_utf8	Implements utf8.maxchar_utf8.
isfirstbyte_utf8	Implements utf8.isfirstbyte_utf8.
isvalidfirstbyte_utf8	Implements utf8.isvalidfirstbyte_utf8.
maxsize_utf8	Implements utf8.maxsize_utf8.
sizefromfirstbyte_utf8	Implements utf8.sizefromfirstbyte_utf8.
sizechar_utf8	Implements utf8.sizechar_utf8.
skipchar_utf8	Implements utf8.skipchar_utf8.
utf8validator_t
init_utf8validator	Implements utf8validator_t.init_utf8validator.
free_utf8validator	Implements utf8validator_t.free_utf8validator.
sizeprefix_utf8validator	Implements utf8validator_t.sizeprefix_utf8validator.
stringstream_t
peekutf8_stringstream	Implements stringstream_t.peekutf8_stringstream.
skiputf8_stringstream	Implements stringstream_t.skiputf8_stringstream.

Copyright

This program is free software. You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Author

Files

C-kern/api/string/utf8.h

Header file UTF-8.

C-kern/string/utf8.c

Implementation file UTF-8 impl.

Types

struct utf8validator_t

typedef struct utf8validator_t utf8validator_t

Export utf8validator_t into global namespace.

Variables

g_utf8_bytesperchar

extern uint8_t g_utf8_bytesperchar[256]

Stores the length in bytes of an encoded utf8 character indexed by the first encoded byte.

Functions

Summary

test
unittest_string_utf8	Test <escape_char>.

test

unittest_string_utf8

int unittest_string_utf8( void )

Test <escape_char>.

utf8

Summary

query
maxchar_utf8	Returns the maximum character value (unicode code point) which can be encoded into utf-8.
maxsize_utf8	Returns the maximum size in bytes of an utf-8 encoded multibyte sequence.
isvalidfirstbyte_utf8	Returns true if the byte is a legal first byte of an utf8 encoded multibyte sequence.
isfirstbyte_utf8	Returns true if this byte is a possible first (start) byte of an utf-8 encoded multibyte sequence.
sizefromfirstbyte_utf8	Returns the size in bytes of a correct encoded mb-sequence by means of the value of its first byte.
sizechar_utf8	Returns the size in bytes of uchar as encoded mb-sequence.
length_utf8	Returns number of UTF-8 characters encoded in string buffer.
encode-decode
decodechar_utf8	Decodes utf-8 encoded bytes beginning from strstart and returns character in uchar.
encodechar_utf8	Encodes uchar into UTF-8 enocoded string of size strsize starting at strstart.
skipchar_utf8	Skips the next utf-8 encoded character.

query

maxchar_utf8

char32_t maxchar_utf8( void )

Returns the maximum character value (unicode code point) which can be encoded into utf-8. The minumum unicode code point is 0. The returned value is 0x10FFFF.

maxsize_utf8

uint8_t maxsize_utf8( void )

Returns the maximum size in bytes of an utf-8 encoded multibyte sequence.

isvalidfirstbyte_utf8

bool isvalidfirstbyte_utf8( const uint8_t firstbyte )

Returns true if the byte is a legal first byte of an utf8 encoded multibyte sequence.

isfirstbyte_utf8

bool isfirstbyte_utf8( const uint8_t firstbyte )

Returns true if this byte is a possible first (start) byte of an utf-8 encoded multibyte sequence. This function assumes correct encoding therefore it is possible that isfirstbyte_utf8 returns true and isvalidfirstbyte_utf8 returns false.

sizefromfirstbyte_utf8

uint8_t sizefromfirstbyte_utf8( const uint8_t firstbyte )

Returns the size in bytes of a correct encoded mb-sequence by means of the value of its first byte. The number of bytes is calculated from firstbyte - the first byte of the encoded byte sequence. The returned values are in the range 0..4 (0..<maxsize_utf8>). A return value between 1 and 4 describes a valid first byte. A value of 0 indicates that firstbyte is not a valid first byte of an utf8 encoded byte sequence.

sizechar_utf8

uint8_t sizechar_utf8( char32_t uchar )

Returns the size in bytes of uchar as encoded mb-sequence. The returned values are in the range 1..<maxchar_utf8>. If uchar is bigger than maxchar_utf8 no error is reported and the function returns maxsize_utf8.

length_utf8

size_t length_utf8( const uint8_t * strstart,
const uint8_t * strend )

Returns number of UTF-8 characters encoded in string buffer. The first byte of a multibyte sequence determines its size. This function assumes that utf8 encodings are correct and does not check the encoding of bytes following the first. Illegal encoded start bytes are not counted but skipped. The last multibyte sequence is counted as one character even if one or more bytes are missing.

Parameter

strstart	Start of string buffer (lowest address)
strend	Highest memory address of byte after last byte in the string buffer. If strend <= strstart then the string is considered the empty string. Set this value to strstart + length_of_string_in_bytes.

encode-decode

decodechar_utf8

uint8_t decodechar_utf8(
const uint8_t strstart[/*maxsize_utf8() or big enough*/],
/*out*/char32_t * uchar
)

Decodes utf-8 encoded bytes beginning from strstart and returns character in uchar. The string must be big enough but needs never larger as maxsize_utf8. Use sizefromfirstbyte_utf8 to determine the size if strstart contains less then maxsize_utf8 bytes.

The number of decoded bytes is returned.

A return value of 0 indicates an invalid first byte of the multibyte sequence (EILSEQ). The function assumes that all other bytes except the first are encoded correctly. Use utf8validator_t to make sure that a string contains only a valid encoded utf8 string.

Example

uint8_t * str    = &strbuffer[0] ;
uint8_t * strend = strbuffer + sizeof(strbuffer) ;
while (str < strend) {
   if (strend-str < maxsize_utf8()) {
      if (sizefromfirstbyte_utf8(str[0]) > (strend-str)) {
         ...not enough data for last character...
         break ;
      }
   }
   char32_t uchar ;
   uint8_t  len = decodechar_utf8(str, &uchar) ;
   str += len ;
   ... do something with uchar ...
}

encodechar_utf8

uint8_t encodechar_utf8( size_t strsize,
/*out*/uint8_t strstart[strsize],
char32_t uchar )

Encodes uchar into UTF-8 enocoded string of size strsize starting at strstart. The number of written bytes are returned. The maximum return value is maxsize_utf8. A return value of 0 indicates an error. Either uchar is greater then maxchar_utf8 or strsize is not big enough.

skipchar_utf8

uint8_t skipchar_utf8( const uint8_t strstart[/*maxsize_utf8() or big enough*/] )

Skips the next utf-8 encoded character. The encoded byte sequence is not checked for correctness. The number of skipped bytes is returned. The maximum return value is maxsize_utf8. A return value of 0 indicates an error, i.e. the first byte of the multibyte sequence is invalid (EILSEQ).

utf8validator_t

struct utf8validator_t

Allows to validate a blocked stream of bytes. If a multibyte sequence crosses a two data blocks the first part of it is stored internally as prefix data for the next block.

Summary

lifetime
utf8validator_INIT	Static initializer.
init_utf8validator	Same as assigning utf8validator_INIT.
free_utf8validator	Clear data members and checks that there is no internal prefix stored.
query
sizeprefix_utf8validator	Returns a value != 0 if the last multibyte sequence was not fully contained in the last validated buffer.
validate
validate_utf8validator	Validates a data block of length size in bytes.

lifetime

utf8validator_INIT

#define utf8validator_INIT { 0, { 0, 0, 0, 0} }

Static initializer.

init_utf8validator

void init_utf8validator( /*out*/utf8validator_t * utf8validator )

Same as assigning utf8validator_INIT.

free_utf8validator

int free_utf8validator( utf8validator_t * utf8validator )

Clear data members and checks that there is no internal prefix stored.

Returns

0	All multi-byte character sequences fit into last buffer.
EILSEQ	Last multi-byte character sequence was incomplete. Need more data.

query

sizeprefix_utf8validator

uint8_t sizeprefix_utf8validator( const utf8validator_t * utf8validator )

Returns a value != 0 if the last multibyte sequence was not fully contained in the last validated buffer.

validate

validate_utf8validator

int validate_utf8validator( utf8validator_t * utf8validator,
size_t size,
const uint8_t data[size],
/*err*/size_t * erroffset )

Validates a data block of length size in bytes. If the last multibyte sequence is not fully contained in the data block but a valid prefix it is stored internally as prefix. If this function is called another time the internal prefix is prepended to the data block. If an error occurs EILSEQ is returned the parameter offset is set to the offset of the byte which is not encoded correctly.

stringstream_t

struct stringstream_t

Summary

read-utf8
nextutf8_stringstream	Reads next utf-8 encoded character from strstream.
peekutf8_stringstream	Same as nextutf8_stringstream except the strstream is not changed.
skiputf8_stringstream	Skips next utf-8 encoded character from strstream.
skipillegalutf8_strstream	Skips bytes until end of stream or the begin of a valid utf-8 encoding is found.
find-utf8
findutf8_stringstream	Searches for unicode character in utf8 encoded stringstream.

read-utf8

nextutf8_stringstream

int nextutf8_stringstream( struct stringstream_t * strstream,
/*out*/char32_t * uchar )

Reads next utf-8 encoded character from strstream. The character is returned as unicode character (codepoint) in uchar. The next pointer of strstream is incremented with the number of decoded bytes.

Returns

0	UTF8 character decoded and returned in uchar and memory pointer is moved to next character.
ENODATA	strstream is empty.
ENOTEMPTY	The string is not empty but another character could not be decoded cause there are not enough bytes left in the string.
EILSEQ	The next multibyte sequence is not encoded in a correct way. strstream is not changed. Use skipillegalutf8_strstream to skip all illegal bytes.

peekutf8_stringstream

int peekutf8_stringstream( const struct stringstream_t * strstream,
/*out*/char32_t * uchar )

Same as nextutf8_stringstream except the strstream is not changed. Calling this function more than once returns always the same value in uchar.

skiputf8_stringstream

int skiputf8_stringstream( struct stringstream_t * strstream )

Skips next utf-8 encoded character from strstream. The next pointer of strstream is incremented with the size of the next character.

Returns

0	Memory pointer is moved to next character.
ENODATA	strstream is empty.
ENOTEMPTY	The string is not empty but another character could not be decoded cause there are not enough bytes left in the string.
EILSEQ	The next multibyte sequence is not encoded in a correct way. strstream is not changed. Use skipillegalutf8_strstream to skip all illegal bytes.

skipillegalutf8_strstream

void skipillegalutf8_strstream( struct stringstream_t * strstream )

Skips bytes until end of stream or the begin of a valid utf-8 encoding is found.

find-utf8

findutf8_stringstream

const uint8_t * findutf8_stringstream( const struct stringstream_t * strstream,
char32_t uchar )

Searches for unicode character in utf8 encoded stringstream. The returned value points to the start addr of the multibyte sequence in the unread buffer. A return value of 0 inidcates that strstream does not contain the multibyte sequence or that uchar is bigger than maxchar_utf8 and therefore invalid.

inline implementation

Summary

utf8_t
maxchar_utf8	Implements utf8.maxchar_utf8.
isfirstbyte_utf8	Implements utf8.isfirstbyte_utf8.
isvalidfirstbyte_utf8	Implements utf8.isvalidfirstbyte_utf8.
maxsize_utf8	Implements utf8.maxsize_utf8.
sizefromfirstbyte_utf8	Implements utf8.sizefromfirstbyte_utf8.
sizechar_utf8	Implements utf8.sizechar_utf8.
skipchar_utf8	Implements utf8.skipchar_utf8.
utf8validator_t
init_utf8validator	Implements utf8validator_t.init_utf8validator.
free_utf8validator	Implements utf8validator_t.free_utf8validator.
sizeprefix_utf8validator	Implements utf8validator_t.sizeprefix_utf8validator.
stringstream_t
peekutf8_stringstream	Implements stringstream_t.peekutf8_stringstream.
skiputf8_stringstream	Implements stringstream_t.skiputf8_stringstream.

utf8_t

utf8validator_t

init_utf8validator

#define init_utf8validator(
utf8validator
) ((void)(*(utf8validator) = (utf8validator_t) utf8validator_INIT))

Implements utf8validator_t.init_utf8validator.

free_utf8validator

#define free_utf8validator(
utf8validator
) ( __extension__ ({ int _err ; utf8validator_t * _v ; _v = (utf8validator) ; _err = _v->size_of_prefix ? EILSEQ : 0 ; _v->size_of_prefix = 0 ; _err ; }))

Implements utf8validator_t.free_utf8validator.

sizeprefix_utf8validator

#define sizeprefix_utf8validator(
utf8validator
) ((utf8validator)->size_of_prefix)

Implements utf8validator_t.sizeprefix_utf8validator.

stringstream_t

peekutf8_stringstream

#define peekutf8_stringstream(
strstream,
uchar
) ( __extension__ ({ stringstream_t * _strstr = (strstream) ; nextutf8_stringstream( &(stringstream_t) stringstream_INIT( _strstr->next, _strstr->end), uchar) ; }))

Implements stringstream_t.peekutf8_stringstream.

skiputf8_stringstream

#define skiputf8_stringstream(
strstream
) ( __extension__ ({ char32_t _uchar ; nextutf8_stringstream( (strstream), &_uchar ) ; }))

Implements stringstream_t.skiputf8_stringstream.