UTF-8

8-bit U(niversal Character Set) T(ransformation) F(ormat)

This encoding of the unicode character set is backward-compatible with ASCII and avoids problems with endianess.

Encoding

Unicode characterUTF-8
(codepoint)(encoding)
0x00 ..  0x7F0b0xxxxxxx
0x80 ..  0x7FF0b110xxxxx 0b10xxxxxx
0x800 ..  0xFFFF0b1110xxxx 0b10xxxxxx 0b10xxxxxx
0x10000 ..  0x1FFFFF0b11110xxx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx
0x200000 ..  0x3FFFFFF0b111110xx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx
0x4000000 ..  0x7FFFFFFF0b1111110x 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx 0b10xxxxxx

The UTF-8 encoding is restricted to max.  4 bytes per character to be compatible with UTF-16 (0 ..  0x10FFFF).

Summary
UTF-88-bit U(niversal Character Set) T(ransformation) F(ormat)
CopyrightThis program is free software.
Files
C-kern/api/string/utf8.hHeader file UTF-8.
C-kern/string/utf8.cImplementation file UTF-8 impl.
Types
struct utf8validator_tExport utf8validator_t into global namespace.
Variables
g_utf8_bytespercharStores the length in bytes of an encoded utf8 character indexed by the first encoded byte.
Functions
test
unittest_string_utf8Test <escape_char>.
utf8
query
maxchar_utf8Returns the maximum character value (unicode code point) which can be encoded into utf-8.
maxsize_utf8Returns the maximum size in bytes of an utf-8 encoded multibyte sequence.
isvalidfirstbyte_utf8Returns true if the byte is a legal first byte of an utf8 encoded multibyte sequence.
isfirstbyte_utf8Returns true if this byte is a possible first (start) byte of an utf-8 encoded multibyte sequence.
sizefromfirstbyte_utf8Returns the size in bytes of a correct encoded mb-sequence by means of the value of its first byte.
sizechar_utf8Returns the size in bytes of uchar as encoded mb-sequence.
length_utf8Returns number of UTF-8 characters encoded in string buffer.
encode-decode
decodechar_utf8Decodes utf-8 encoded bytes beginning from strstart and returns character in uchar.
encodechar_utf8Encodes uchar into UTF-8 enocoded string of size strsize starting at strstart.
skipchar_utf8Skips the next utf-8 encoded character.
utf8validator_tAllows to validate a blocked stream of bytes.
lifetime
utf8validator_INITStatic initializer.
init_utf8validatorSame as assigning utf8validator_INIT.
free_utf8validatorClear data members and checks that there is no internal prefix stored.
query
sizeprefix_utf8validatorReturns a value != 0 if the last multibyte sequence was not fully contained in the last validated buffer.
validate
validate_utf8validatorValidates a data block of length size in bytes.
stringstream_t
read-utf8
nextutf8_stringstreamReads next utf-8 encoded character from strstream.
peekutf8_stringstreamSame as nextutf8_stringstream except the strstream is not changed.
skiputf8_stringstreamSkips next utf-8 encoded character from strstream.
skipillegalutf8_strstreamSkips bytes until end of stream or the begin of a valid utf-8 encoding is found.
find-utf8
findutf8_stringstreamSearches for unicode character in utf8 encoded stringstream.
inline implementation
utf8_t
maxchar_utf8Implements utf8.maxchar_utf8.
isfirstbyte_utf8Implements utf8.isfirstbyte_utf8.
isvalidfirstbyte_utf8Implements utf8.isvalidfirstbyte_utf8.
maxsize_utf8Implements utf8.maxsize_utf8.
sizefromfirstbyte_utf8Implements utf8.sizefromfirstbyte_utf8.
sizechar_utf8Implements utf8.sizechar_utf8.
skipchar_utf8Implements utf8.skipchar_utf8.
utf8validator_t
init_utf8validatorImplements utf8validator_t.init_utf8validator.
free_utf8validatorImplements utf8validator_t.free_utf8validator.
sizeprefix_utf8validatorImplements utf8validator_t.sizeprefix_utf8validator.
stringstream_t
peekutf8_stringstreamImplements stringstream_t.peekutf8_stringstream.
skiputf8_stringstreamImplements stringstream_t.skiputf8_stringstream.

Copyright

This program is free software.  You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

Author

© 2013 Jörg Seebohn

Files

C-kern/api/string/utf8.h

Header file UTF-8.

C-kern/string/utf8.c

Implementation file UTF-8 impl.

Types

struct utf8validator_t

typedef struct utf8validator_t utf8validator_t

Export utf8validator_t into global namespace.

Variables

g_utf8_bytesperchar

extern uint8_t g_utf8_bytesperchar[256]

Stores the length in bytes of an encoded utf8 character indexed by the first encoded byte.

Functions

Summary
test
unittest_string_utf8Test <escape_char>.

test

unittest_string_utf8

int unittest_string_utf8(void)

Test <escape_char>.

utf8

Summary
query
maxchar_utf8Returns the maximum character value (unicode code point) which can be encoded into utf-8.
maxsize_utf8Returns the maximum size in bytes of an utf-8 encoded multibyte sequence.
isvalidfirstbyte_utf8Returns true if the byte is a legal first byte of an utf8 encoded multibyte sequence.
isfirstbyte_utf8Returns true if this byte is a possible first (start) byte of an utf-8 encoded multibyte sequence.
sizefromfirstbyte_utf8Returns the size in bytes of a correct encoded mb-sequence by means of the value of its first byte.
sizechar_utf8Returns the size in bytes of uchar as encoded mb-sequence.
length_utf8Returns number of UTF-8 characters encoded in string buffer.
encode-decode
decodechar_utf8Decodes utf-8 encoded bytes beginning from strstart and returns character in uchar.
encodechar_utf8Encodes uchar into UTF-8 enocoded string of size strsize starting at strstart.
skipchar_utf8Skips the next utf-8 encoded character.

query

maxchar_utf8

char32_t maxchar_utf8(void)

Returns the maximum character value (unicode code point) which can be encoded into utf-8.  The minumum unicode code point is 0.  The returned value is 0x10FFFF.

maxsize_utf8

uint8_t maxsize_utf8(void)

Returns the maximum size in bytes of an utf-8 encoded multibyte sequence.

isvalidfirstbyte_utf8

bool isvalidfirstbyte_utf8(const uint8_t firstbyte)

Returns true if the byte is a legal first byte of an utf8 encoded multibyte sequence.

isfirstbyte_utf8

bool isfirstbyte_utf8(const uint8_t firstbyte)

Returns true if this byte is a possible first (start) byte of an utf-8 encoded multibyte sequence.  This function assumes correct encoding therefore it is possible that isfirstbyte_utf8 returns true and isvalidfirstbyte_utf8 returns false.

sizefromfirstbyte_utf8

uint8_t sizefromfirstbyte_utf8(const uint8_t firstbyte)

Returns the size in bytes of a correct encoded mb-sequence by means of the value of its first byte.  The number of bytes is calculated from firstbyte - the first byte of the encoded byte sequence.  The returned values are in the range 0..4 (0..<maxsize_utf8>).  A return value between 1 and 4 describes a valid first byte.  A value of 0 indicates that firstbyte is not a valid first byte of an utf8 encoded byte sequence.

sizechar_utf8

uint8_t sizechar_utf8(char32_t uchar)

Returns the size in bytes of uchar as encoded mb-sequence.  The returned values are in the range 1..<maxchar_utf8>.  If uchar is bigger than maxchar_utf8 no error is reported and the function returns maxsize_utf8.

length_utf8

size_t length_utf8(const uint8_t *strstart,
const uint8_t *strend)

Returns number of UTF-8 characters encoded in string buffer.  The first byte of a multibyte sequence determines its size.  This function assumes that utf8 encodings are correct and does not check the encoding of bytes following the first.  Illegal encoded start bytes are not counted but skipped.  The last multibyte sequence is counted as one character even if one or more bytes are missing.

Parameter

strstartStart of string buffer (lowest address)
strendHighest memory address of byte after last byte in the string buffer.  If strend <= strstart then the string is considered the empty string.  Set this value to strstart + length_of_string_in_bytes.

encode-decode

decodechar_utf8

uint8_t decodechar_utf8(
   const uint8_t strstart[/*maxsize_utf8() or big enough*/],
   /*out*/char32_t *uchar
)

Decodes utf-8 encoded bytes beginning from strstart and returns character in uchar.  The string must be big enough but needs never larger as maxsize_utf8.  Use sizefromfirstbyte_utf8 to determine the size if strstart contains less then maxsize_utf8 bytes.

The number of decoded bytes is returned.

A return value of 0 indicates an invalid first byte of the multibyte sequence (EILSEQ).  The function assumes that all other bytes except the first are encoded correctly.  Use utf8validator_t to make sure that a string contains only a valid encoded utf8 string.

Example

uint8_t * str    = &strbuffer[0] ;
uint8_t * strend = strbuffer + sizeof(strbuffer) ;
while (str < strend) {
   if (strend-str < maxsize_utf8()) {
      if (sizefromfirstbyte_utf8(str[0]) > (strend-str)) {
         ...not enough data for last character...
         break ;
      }
   }
   char32_t uchar ;
   uint8_t  len = decodechar_utf8(str, &uchar) ;
   str += len ;
   ... do something with uchar ...
}

encodechar_utf8

uint8_t encodechar_utf8(size_t strsize,
/*out*/uint8_t strstart[strsize],
char32_t uchar)

Encodes uchar into UTF-8 enocoded string of size strsize starting at strstart.  The number of written bytes are returned.  The maximum return value is maxsize_utf8.  A return value of 0 indicates an error.  Either uchar is greater then maxchar_utf8 or strsize is not big enough.

skipchar_utf8

uint8_t skipchar_utf8(const uint8_t strstart[/*maxsize_utf8() or big enough*/])

Skips the next utf-8 encoded character.  The encoded byte sequence is not checked for correctness.  The number of skipped bytes is returned.  The maximum return value is maxsize_utf8.  A return value of 0 indicates an error, i.e. the first byte of the multibyte sequence is invalid (EILSEQ).

utf8validator_t

struct utf8validator_t

Allows to validate a blocked stream of bytes.  If a multibyte sequence crosses a two data blocks the first part of it is stored internally as prefix data for the next block.

Summary
lifetime
utf8validator_INITStatic initializer.
init_utf8validatorSame as assigning utf8validator_INIT.
free_utf8validatorClear data members and checks that there is no internal prefix stored.
query
sizeprefix_utf8validatorReturns a value != 0 if the last multibyte sequence was not fully contained in the last validated buffer.
validate
validate_utf8validatorValidates a data block of length size in bytes.

lifetime

utf8validator_INIT

#define utf8validator_INIT { 0, { 0, 0, 0, 0} }

Static initializer.

init_utf8validator

void init_utf8validator(/*out*/utf8validator_t *utf8validator)

Same as assigning utf8validator_INIT.

free_utf8validator

int free_utf8validator(utf8validator_t *utf8validator)

Clear data members and checks that there is no internal prefix stored.

Returns

0All multi-byte character sequences fit into last buffer.
EILSEQLast multi-byte character sequence was incomplete.  Need more data.

query

sizeprefix_utf8validator

uint8_t sizeprefix_utf8validator(const utf8validator_t *utf8validator)

Returns a value != 0 if the last multibyte sequence was not fully contained in the last validated buffer.

validate

validate_utf8validator

int validate_utf8validator(utf8validator_t *utf8validator,
size_t size,
const uint8_t data[size],
/*err*/size_t *erroffset)

Validates a data block of length size in bytes.  If the last multibyte sequence is not fully contained in the data block but a valid prefix it is stored internally as prefix.  If this function is called another time the internal prefix is prepended to the data block.  If an error occurs EILSEQ is returned the parameter offset is set to the offset of the byte which is not encoded correctly.

stringstream_t

struct stringstream_t
Summary
read-utf8
nextutf8_stringstreamReads next utf-8 encoded character from strstream.
peekutf8_stringstreamSame as nextutf8_stringstream except the strstream is not changed.
skiputf8_stringstreamSkips next utf-8 encoded character from strstream.
skipillegalutf8_strstreamSkips bytes until end of stream or the begin of a valid utf-8 encoding is found.
find-utf8
findutf8_stringstreamSearches for unicode character in utf8 encoded stringstream.

read-utf8

nextutf8_stringstream

int nextutf8_stringstream(struct stringstream_t *strstream,
/*out*/char32_t *uchar)

Reads next utf-8 encoded character from strstream.  The character is returned as unicode character (codepoint) in uchar.  The next pointer of strstream is incremented with the number of decoded bytes.

Returns

0UTF8 character decoded and returned in uchar and memory pointer is moved to next character.
ENODATAstrstream is empty.
ENOTEMPTYThe string is not empty but another character could not be decoded cause there are not enough bytes left in the string.
EILSEQThe next multibyte sequence is not encoded in a correct way. strstream is not changed.  Use skipillegalutf8_strstream to skip all illegal bytes.

peekutf8_stringstream

int peekutf8_stringstream(const struct stringstream_t *strstream,
/*out*/char32_t *uchar)

Same as nextutf8_stringstream except the strstream is not changed.  Calling this function more than once returns always the same value in uchar.

skiputf8_stringstream

int skiputf8_stringstream(struct stringstream_t *strstream)

Skips next utf-8 encoded character from strstream.  The next pointer of strstream is incremented with the size of the next character.

Returns

0Memory pointer is moved to next character.
ENODATAstrstream is empty.
ENOTEMPTYThe string is not empty but another character could not be decoded cause there are not enough bytes left in the string.
EILSEQThe next multibyte sequence is not encoded in a correct way. strstream is not changed.  Use skipillegalutf8_strstream to skip all illegal bytes.

skipillegalutf8_strstream

void skipillegalutf8_strstream(struct stringstream_t *strstream)

Skips bytes until end of stream or the begin of a valid utf-8 encoding is found.

find-utf8

findutf8_stringstream

const uint8_t * findutf8_stringstream(const struct stringstream_t *strstream,
char32_t uchar)

Searches for unicode character in utf8 encoded stringstream.  The returned value points to the start addr of the multibyte sequence in the unread buffer.  A return value of 0 inidcates that strstream does not contain the multibyte sequence or that uchar is bigger than maxchar_utf8 and therefore invalid.

utf8_t

maxchar_utf8

Implements utf8.maxchar_utf8.

isfirstbyte_utf8

isvalidfirstbyte_utf8

maxsize_utf8

Implements utf8.maxsize_utf8.

sizefromfirstbyte_utf8

sizechar_utf8

Implements utf8.sizechar_utf8.

skipchar_utf8

Implements utf8.skipchar_utf8.

utf8validator_t

init_utf8validator

#define init_utf8validator(
   utf8validator
) ((void)(*(utf8validator) = (utf8validator_t) utf8validator_INIT))

Implements utf8validator_t.init_utf8validator.

free_utf8validator

#define free_utf8validator(
   utf8validator
) ( __extension__ ({ int _err ; utf8validator_t * _v ; _v = (utf8validator) ; _err = _v->size_of_prefix ? EILSEQ : 0 ; _v->size_of_prefix = 0 ; _err ; }))

Implements utf8validator_t.free_utf8validator.

sizeprefix_utf8validator

#define sizeprefix_utf8validator(
   utf8validator
) ((utf8validator)->size_of_prefix)

Implements utf8validator_t.sizeprefix_utf8validator.

stringstream_t

peekutf8_stringstream

#define peekutf8_stringstream(
   strstream,
   uchar
) ( __extension__ ({ stringstream_t * _strstr = (strstream) ; nextutf8_stringstream( &(stringstream_t) stringstream_INIT( _strstr->next, _strstr->end), uchar) ; }))

Implements stringstream_t.peekutf8_stringstream.

skiputf8_stringstream

#define skiputf8_stringstream(
   strstream
) ( __extension__ ({ char32_t _uchar ; nextutf8_stringstream( (strstream), &_uchar ) ; }))

Implements stringstream_t.skiputf8_stringstream.

8-bit U(niversal Character Set) T(ransformation) F(ormat)
Implements UTF-8.
typedef struct utf8validator_t utf8validator_t
Export utf8validator_t into global namespace.
struct utf8validator_t
Allows to validate a blocked stream of bytes.
extern uint8_t g_utf8_bytesperchar[256]
Stores the length in bytes of an encoded utf8 character indexed by the first encoded byte.
int unittest_string_utf8(void)
Test escape_char.
char32_t maxchar_utf8(void)
Returns the maximum character value (unicode code point) which can be encoded into utf-8.
uint8_t maxsize_utf8(void)
Returns the maximum size in bytes of an utf-8 encoded multibyte sequence.
bool isvalidfirstbyte_utf8(const uint8_t firstbyte)
Returns true if the byte is a legal first byte of an utf8 encoded multibyte sequence.
bool isfirstbyte_utf8(const uint8_t firstbyte)
Returns true if this byte is a possible first (start) byte of an utf-8 encoded multibyte sequence.
uint8_t sizefromfirstbyte_utf8(const uint8_t firstbyte)
Returns the size in bytes of a correct encoded mb-sequence by means of the value of its first byte.
uint8_t sizechar_utf8(char32_t uchar)
Returns the size in bytes of uchar as encoded mb-sequence.
size_t length_utf8(const uint8_t *strstart,
const uint8_t *strend)
Returns number of UTF-8 characters encoded in string buffer.
uint8_t decodechar_utf8(
   const uint8_t strstart[/*maxsize_utf8() or big enough*/],
   /*out*/char32_t *uchar
)
Decodes utf-8 encoded bytes beginning from strstart and returns character in uchar.
uint8_t encodechar_utf8(size_t strsize,
/*out*/uint8_t strstart[strsize],
char32_t uchar)
Encodes uchar into UTF-8 enocoded string of size strsize starting at strstart.
uint8_t skipchar_utf8(const uint8_t strstart[/*maxsize_utf8() or big enough*/])
Skips the next utf-8 encoded character.
#define utf8validator_INIT { 0, { 0, 0, 0, 0} }
Static initializer.
void init_utf8validator(/*out*/utf8validator_t *utf8validator)
Same as assigning utf8validator_INIT.
int free_utf8validator(utf8validator_t *utf8validator)
Clear data members and checks that there is no internal prefix stored.
uint8_t sizeprefix_utf8validator(const utf8validator_t *utf8validator)
Returns a value != 0 if the last multibyte sequence was not fully contained in the last validated buffer.
int validate_utf8validator(utf8validator_t *utf8validator,
size_t size,
const uint8_t data[size],
/*err*/size_t *erroffset)
Validates a data block of length size in bytes.
struct stringstream_t
int nextutf8_stringstream(struct stringstream_t *strstream,
/*out*/char32_t *uchar)
Reads next utf-8 encoded character from strstream.
int peekutf8_stringstream(const struct stringstream_t *strstream,
/*out*/char32_t *uchar)
Same as nextutf8_stringstream except the strstream is not changed.
int skiputf8_stringstream(struct stringstream_t *strstream)
Skips next utf-8 encoded character from strstream.
void skipillegalutf8_strstream(struct stringstream_t *strstream)
Skips bytes until end of stream or the begin of a valid utf-8 encoding is found.
const uint8_t * findutf8_stringstream(const struct stringstream_t *strstream,
char32_t uchar)
Searches for unicode character in utf8 encoded stringstream.
#define init_utf8validator(
   utf8validator
) ((void)(*(utf8validator) = (utf8validator_t) utf8validator_INIT))
Implements utf8validator_t.init_utf8validator.
#define free_utf8validator(
   utf8validator
) ( __extension__ ({ int _err ; utf8validator_t * _v ; _v = (utf8validator) ; _err = _v->size_of_prefix ? EILSEQ : 0 ; _v->size_of_prefix = 0 ; _err ; }))
Implements utf8validator_t.free_utf8validator.
#define sizeprefix_utf8validator(
   utf8validator
) ((utf8validator)->size_of_prefix)
Implements utf8validator_t.sizeprefix_utf8validator.
#define peekutf8_stringstream(
   strstream,
   uchar
) ( __extension__ ({ stringstream_t * _strstr = (strstream) ; nextutf8_stringstream( &(stringstream_t) stringstream_INIT( _strstr->next, _strstr->end), uchar) ; }))
Implements stringstream_t.peekutf8_stringstream.
#define skiputf8_stringstream(
   strstream
) ( __extension__ ({ char32_t _uchar ; nextutf8_stringstream( (strstream), &_uchar ) ; }))
Implements stringstream_t.skiputf8_stringstream.
Implements utf8.maxchar_utf8.
Close