UTF8-Scanner

Exports utf8scanner_t which supports to break a text file into separate strings. The file is read with help of filereader_t. The common parts of every text scanner is implemented in this type.

Summary

UTF8-Scanner	Exports utf8scanner_t which supports to break a text file into separate strings.
Copyright	This program is free software.
Files
C-kern/api/io/reader/utf8scanner.h	Header file UTF8-Scanner.
C-kern/io/reader/utf8scanner.c	Implementation file UTF8-Scanner impl.
Types
struct utf8scanner_t	Export utf8scanner_t into global namespace.
Functions
test
unittest_io_reader_utf8scanner	Test utf8scanner_t functionality.
utf8scanner_t	Handles the data buffers returned from filereader_t and initializes a token of type splitstring_t.
next	Points to the next byte returned from nextbyte_utf8scanner.
end	As long as next is lower than end there are more bytes to read.
scanned_token	Stores the begin and length of a string of a recognized token.
lifetime
utf8scanner_FREE	Static initializer.
init_utf8scanner	Sets all data members to 0.
free_utf8scanner	Sets all data members to 0 and releases any acquired buffers from frd.
query
isfree_utf8scanner	Returns true if scan is initialized with utf8scanner_FREE.
isnext_utf8scanner	Returns true if the buffer contains at least one more byte.
sizeunread_utf8scanner	The number of bytes which are not read from the current buffer.
scannedtoken_utf8scanner	Returns the address to an internally stored splitstring_t.
read
nextbyte_utf8scanner	Reads the next byte from the buffer and increments the reading position.
peekbyte_utf8scanner	Returns any byte from the buffer without changing the read pointer.
skipbytes_utf8scanner	Increments the read pointer by nrbytes without reading the bytes.
nextchar_utf8scanner	Decodes the next utf8 character and increments the reading position.
skipuntilafter_utf8scanner	Skips characters until the last skipped character equals uchar.
buffer I/O
cleartoken_utf8scanner	Clears the current token string.
readbuffer_utf8scanner	Acquires the next buffer from filereader_t if isnext_utf8scanner returns false.
unread_utf8scanner	Decrements the reading position until the last nrofchars characters are unread.
inline implementation
Functions
isnext_utf8scanner	Implements utf8scanner_t.isnext_utf8scanner.
nextbyte_utf8scanner	Implements utf8scanner_t.nextbyte_utf8scanner.
peekbyte_utf8scanner	Implements utf8scanner_t.peekbyte_utf8scanner.
skipbytes_utf8scanner	Implements utf8scanner_t.skipbytes_utf8scanner.
sizeunread_utf8scanner	Implements utf8scanner_t.sizeunread_utf8scanner.

Copyright

This program is free software. You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Author

Files

C-kern/api/io/reader/utf8scanner.h

Header file UTF8-Scanner.

C-kern/io/reader/utf8scanner.c

Implementation file UTF8-Scanner impl.

Types

struct utf8scanner_t

typedef struct utf8scanner_t utf8scanner_t

Export utf8scanner_t into global namespace.

Functions

Summary

test
unittest_io_reader_utf8scanner	Test utf8scanner_t functionality.

test

unittest_io_reader_utf8scanner

int unittest_io_reader_utf8scanner( void )

Test utf8scanner_t functionality.

utf8scanner_t

struct utf8scanner_t

Handles the data buffers returned from filereader_t and initializes a token of type splitstring_t.

Protocol

The token begins with the first read byte or character and can span two buffers. Call nextbyte_utf8scanner and nextchar_utf8scanner to read the buffer content until you have found a valid token. Call unread_utf8scanner if you want to remove one or more of the last characters added to the token. A call to scannedtoken_utf8scanner returns the scanned token. If the token is processed call cleartoken_utf8scanner to clear the token and free any unsed buffers. Clearing a token sets the starting point of the new token.

Use filereader_t which is given as parameter to determine if a read error has occurred.

If the buffer is empty use readbuffer_utf8scanner to read the next buffer of the input data. The function nextchar_utf8scanner calls readbuffer_utf8scanner automatically if the buffer is empty.

Summary

next	Points to the next byte returned from nextbyte_utf8scanner.
end	As long as next is lower than end there are more bytes to read.
scanned_token	Stores the begin and length of a string of a recognized token.
lifetime
utf8scanner_FREE	Static initializer.
init_utf8scanner	Sets all data members to 0.
free_utf8scanner	Sets all data members to 0 and releases any acquired buffers from frd.
query
isfree_utf8scanner	Returns true if scan is initialized with utf8scanner_FREE.
isnext_utf8scanner	Returns true if the buffer contains at least one more byte.
sizeunread_utf8scanner	The number of bytes which are not read from the current buffer.
scannedtoken_utf8scanner	Returns the address to an internally stored splitstring_t.
read
nextbyte_utf8scanner	Reads the next byte from the buffer and increments the reading position.
peekbyte_utf8scanner	Returns any byte from the buffer without changing the read pointer.
skipbytes_utf8scanner	Increments the read pointer by nrbytes without reading the bytes.
nextchar_utf8scanner	Decodes the next utf8 character and increments the reading position.
skipuntilafter_utf8scanner	Skips characters until the last skipped character equals uchar.
buffer I/O
cleartoken_utf8scanner	Clears the current token string.
readbuffer_utf8scanner	Acquires the next buffer from filereader_t if isnext_utf8scanner returns false.
unread_utf8scanner	Decrements the reading position until the last nrofchars characters are unread.

const uint8_t * next

Points to the next byte returned from nextbyte_utf8scanner.

end

const uint8_t * end

As long as next is lower than end there are more bytes to read.

scanned_token

splitstring_t scanned_token

Stores the begin and length of a string of a recognized token. The token string can be scattered across two buffers.

lifetime

utf8scanner_FREE

#define utf8scanner_FREE { 0, 0, splitstring_FREE }

Static initializer.

init_utf8scanner

int init_utf8scanner( /*out*/utf8scanner_t * scan )

Sets all data members to 0. No data is read.

free_utf8scanner

int free_utf8scanner( utf8scanner_t * scan,
struct filereader_t * frd )

Sets all data members to 0 and releases any acquired buffers from frd.

query

isfree_utf8scanner

bool isfree_utf8scanner( const utf8scanner_t * scan )

Returns true if scan is initialized with utf8scanner_FREE.

isnext_utf8scanner

bool isnext_utf8scanner( const utf8scanner_t * scan )

Returns true if the buffer contains at least one more byte. In case false is returned do not call nextbyte_utf8scanner or any other function which accesses the buffer. Instead call readbuffer_utf8scanner which acquires the next buffer from filereader_t.

sizeunread_utf8scanner

size_t sizeunread_utf8scanner( const utf8scanner_t * scan )

The number of bytes which are not read from the current buffer. If this function returns 0 then isnext_utf8scanner returns false. Call readbuffer_utf8scanner in this case.

scannedtoken_utf8scanner

const splitstring_t * scannedtoken_utf8scanner( utf8scanner_t * scan )

Returns the address to an internally stored splitstring_t. Before the token string is returned the current reading position in the stream is used to calculate the length of the token. The returned string is valid as long as no other function is called except query functions. If you call reading functions you need to call scannedtoken_utf8scanner again to adapt the token string to the new length. To clear the token string call cleartoken_utf8scanner.

read

nextbyte_utf8scanner

uint8_t nextbyte_utf8scanner( utf8scanner_t * scan )

Reads the next byte from the buffer and increments the reading position. Call this function only if isnext_utf8scanner returned true else the behaviour is undefined.

peekbyte_utf8scanner

uint8_t peekbyte_utf8scanner( utf8scanner_t * scan,
size_t offset )

Returns any byte from the buffer without changing the read pointer. The parameter offset must be smaller than sizeunread_utf8scanner else the behaviour is undefined.

skipbytes_utf8scanner

void skipbytes_utf8scanner( utf8scanner_t * scan,
size_t nrbytes )

Increments the read pointer by nrbytes without reading the bytes.

(Unchecked) Preconditions

Make sure nrbytes <= sizeunread_utf8scanner else the behaviour is undefined.
Skip only whole characters (if an utf-8 character is encoded in 4 bytes then skip the whole 4 bytes).
The function assumes utf8 encodings are correct.

nextchar_utf8scanner

int nextchar_utf8scanner( utf8scanner_t * scan,
struct filereader_t * frd,
/*out*/char32_t * uchar )

Decodes the next utf8 character and increments the reading position. This function differs from other reading function in that it calls readbuffer_utf8scanner if the buffer is empty. It also handles the case where a multibyte character sequence is split across two buffers. The function assumes that the data contains only valid utf8 encoded characters. But illegal encoded first bytes of multibyte sequences return error EILSEQ and the byte is skipped. Incomplete sequences at the end of the file are also recognized and EILSEQ is returned. An incomplete sequence is also skipped - the next call returns ENODATA. For any other returned error value see readbuffer_utf8scanner.

skipuntilafter_utf8scanner

int skipuntilafter_utf8scanner( utf8scanner_t * scan,
struct filereader_t * frd,
char32_t uchar )

Skips characters until the last skipped character equals uchar. Returns ENODATA or ENOBUFS if uchar is not found. ENOBUFS means the scanned token is too long. The function assumes that the data contains only valid utf8 character encodings.

buffer I/O

cleartoken_utf8scanner

int cleartoken_utf8scanner( utf8scanner_t * scan,
struct filereader_t * frd )

Clears the current token string. All buffers are released which are no longer referenced by the cleared token. The new token starts with the next read character.

readbuffer_utf8scanner

int readbuffer_utf8scanner( utf8scanner_t * scan,
struct filereader_t * frd )

Acquires the next buffer from filereader_t if isnext_utf8scanner returns false.

Returns

0	If isnext_utf8scanner returned false before this call then <readnext_filereader> was called and the read pointer points to the new buffer content. If isnext_utf8scanner returned true before this call nothing was done.
EIO ...	The same error as returned from ioerror_filereader (see also <readnext_filereader>).
ENODATA	There is no more data. Also iseof_filereader returns true.
ENOBUFS	The scanned token spans already two buffers and no more than 2 buffers per token are supported. In other words: the token is too long.

unread_utf8scanner

int unread_utf8scanner( utf8scanner_t * scan,
struct filereader_t * frd,
uint8_t nrofchars )

Decrements the reading position until the last nrofchars characters are unread. For the last nrofchars characters their size in bytes is summed up into the value nrofbytes and the reading position is decremented by nrofbytes. This works only if the token (returned from scannedtoken_utf8scanner) contains at least nrofbytes bytes. In case the token is shorter EINVAL is returned and nothing is done. After successful return the scanned token’s length is decremented by nrofbytes which corresponds to nrofchars characters.

inline implementation

Summary

Functions
isnext_utf8scanner	Implements utf8scanner_t.isnext_utf8scanner.
nextbyte_utf8scanner	Implements utf8scanner_t.nextbyte_utf8scanner.
peekbyte_utf8scanner	Implements utf8scanner_t.peekbyte_utf8scanner.
skipbytes_utf8scanner	Implements utf8scanner_t.skipbytes_utf8scanner.
sizeunread_utf8scanner	Implements utf8scanner_t.sizeunread_utf8scanner.

Functions