Exports utf8scanner_t which supports to break a text file into separate strings. The file is read with help of filereader_t. The common parts of every text scanner is implemented in this type.
UTF8-Scanner | Exports utf8scanner_t which supports to break a text file into separate strings. |
Copyright | This program is free software. |
Files | |
C-kern/ | Header file UTF8-Scanner. |
C-kern/ | Implementation file UTF8-Scanner impl. |
Types | |
struct utf8scanner_t | Export utf8scanner_t into global namespace. |
Functions | |
test | |
unittest_io_reader_utf8scanner | Test utf8scanner_t functionality. |
utf8scanner_t | Handles the data buffers returned from filereader_t and initializes a token of type splitstring_t. |
next | Points to the next byte returned from nextbyte_utf8scanner. |
end | As long as next is lower than end there are more bytes to read. |
scanned_token | Stores the begin and length of a string of a recognized token. |
lifetime | |
utf8scanner_FREE | Static initializer. |
init_utf8scanner | Sets all data members to 0. |
free_utf8scanner | Sets all data members to 0 and releases any acquired buffers from frd. |
query | |
isfree_utf8scanner | Returns true if scan is initialized with utf8scanner_FREE. |
isnext_utf8scanner | Returns true if the buffer contains at least one more byte. |
sizeunread_utf8scanner | The number of bytes which are not read from the current buffer. |
scannedtoken_utf8scanner | Returns the address to an internally stored splitstring_t. |
read | |
nextbyte_utf8scanner | Reads the next byte from the buffer and increments the reading position. |
peekbyte_utf8scanner | Returns any byte from the buffer without changing the read pointer. |
skipbytes_utf8scanner | Increments the read pointer by nrbytes without reading the bytes. |
nextchar_utf8scanner | Decodes the next utf8 character and increments the reading position. |
skipuntilafter_utf8scanner | Skips characters until the last skipped character equals uchar. |
buffer I/ | |
cleartoken_utf8scanner | Clears the current token string. |
readbuffer_utf8scanner | Acquires the next buffer from filereader_t if isnext_utf8scanner returns false. |
unread_utf8scanner | Decrements the reading position until the last nrofchars characters are unread. |
inline implementation | |
Functions | |
isnext_utf8scanner | Implements utf8scanner_t.isnext_utf8scanner. |
nextbyte_utf8scanner | Implements utf8scanner_t.nextbyte_utf8scanner. |
peekbyte_utf8scanner | Implements utf8scanner_t.peekbyte_utf8scanner. |
skipbytes_utf8scanner | Implements utf8scanner_t.skipbytes_utf8scanner. |
sizeunread_utf8scanner | Implements utf8scanner_t.sizeunread_utf8scanner. |
This program is free software. You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
© 2013 Jörg Seebohn
Header file UTF8-Scanner.
Implementation file UTF8-Scanner impl.
typedef struct utf8scanner_t utf8scanner_t
Export utf8scanner_t into global namespace.
test | |
unittest_io_reader_utf8scanner | Test utf8scanner_t functionality. |
int unittest_io_reader_utf8scanner( void )
Test utf8scanner_t functionality.
struct utf8scanner_t
Handles the data buffers returned from filereader_t and initializes a token of type splitstring_t.
The token begins with the first read byte or character and can span two buffers. Call nextbyte_utf8scanner and nextchar_utf8scanner to read the buffer content until you have found a valid token. Call unread_utf8scanner if you want to remove one or more of the last characters added to the token. A call to scannedtoken_utf8scanner returns the scanned token. If the token is processed call cleartoken_utf8scanner to clear the token and free any unsed buffers. Clearing a token sets the starting point of the new token.
Use filereader_t which is given as parameter to determine if a read error has occurred.
If the buffer is empty use readbuffer_utf8scanner to read the next buffer of the input data. The function nextchar_utf8scanner calls readbuffer_utf8scanner automatically if the buffer is empty.
next | Points to the next byte returned from nextbyte_utf8scanner. |
end | As long as next is lower than end there are more bytes to read. |
scanned_token | Stores the begin and length of a string of a recognized token. |
lifetime | |
utf8scanner_FREE | Static initializer. |
init_utf8scanner | Sets all data members to 0. |
free_utf8scanner | Sets all data members to 0 and releases any acquired buffers from frd. |
query | |
isfree_utf8scanner | Returns true if scan is initialized with utf8scanner_FREE. |
isnext_utf8scanner | Returns true if the buffer contains at least one more byte. |
sizeunread_utf8scanner | The number of bytes which are not read from the current buffer. |
scannedtoken_utf8scanner | Returns the address to an internally stored splitstring_t. |
read | |
nextbyte_utf8scanner | Reads the next byte from the buffer and increments the reading position. |
peekbyte_utf8scanner | Returns any byte from the buffer without changing the read pointer. |
skipbytes_utf8scanner | Increments the read pointer by nrbytes without reading the bytes. |
nextchar_utf8scanner | Decodes the next utf8 character and increments the reading position. |
skipuntilafter_utf8scanner | Skips characters until the last skipped character equals uchar. |
buffer I/ | |
cleartoken_utf8scanner | Clears the current token string. |
readbuffer_utf8scanner | Acquires the next buffer from filereader_t if isnext_utf8scanner returns false. |
unread_utf8scanner | Decrements the reading position until the last nrofchars characters are unread. |
const uint8_t * next
Points to the next byte returned from nextbyte_utf8scanner.
bool isfree_utf8scanner( const utf8scanner_t * scan )
Returns true if scan is initialized with utf8scanner_FREE.
bool isnext_utf8scanner( const utf8scanner_t * scan )
Returns true if the buffer contains at least one more byte. In case false is returned do not call nextbyte_utf8scanner or any other function which accesses the buffer. Instead call readbuffer_utf8scanner which acquires the next buffer from filereader_t.
size_t sizeunread_utf8scanner( const utf8scanner_t * scan )
The number of bytes which are not read from the current buffer. If this function returns 0 then isnext_utf8scanner returns false. Call readbuffer_utf8scanner in this case.
const splitstring_t * scannedtoken_utf8scanner( utf8scanner_t * scan )
Returns the address to an internally stored splitstring_t. Before the token string is returned the current reading position in the stream is used to calculate the length of the token. The returned string is valid as long as no other function is called except query functions. If you call reading functions you need to call scannedtoken_utf8scanner again to adapt the token string to the new length. To clear the token string call cleartoken_utf8scanner.
uint8_t nextbyte_utf8scanner( utf8scanner_t * scan )
Reads the next byte from the buffer and increments the reading position. Call this function only if isnext_utf8scanner returned true else the behaviour is undefined.
uint8_t peekbyte_utf8scanner( utf8scanner_t * scan, size_t offset )
Returns any byte from the buffer without changing the read pointer. The parameter offset must be smaller than sizeunread_utf8scanner else the behaviour is undefined.
void skipbytes_utf8scanner( utf8scanner_t * scan, size_t nrbytes )
Increments the read pointer by nrbytes without reading the bytes.
int nextchar_utf8scanner( utf8scanner_t * scan, struct filereader_t * frd, /*out*/char32_t * uchar )
Decodes the next utf8 character and increments the reading position. This function differs from other reading function in that it calls readbuffer_utf8scanner if the buffer is empty. It also handles the case where a multibyte character sequence is split across two buffers. The function assumes that the data contains only valid utf8 encoded characters. But illegal encoded first bytes of multibyte sequences return error EILSEQ and the byte is skipped. Incomplete sequences at the end of the file are also recognized and EILSEQ is returned. An incomplete sequence is also skipped - the next call returns ENODATA. For any other returned error value see readbuffer_utf8scanner.
int skipuntilafter_utf8scanner( utf8scanner_t * scan, struct filereader_t * frd, char32_t uchar )
Skips characters until the last skipped character equals uchar. Returns ENODATA or ENOBUFS if uchar is not found. ENOBUFS means the scanned token is too long. The function assumes that the data contains only valid utf8 character encodings.
int readbuffer_utf8scanner( utf8scanner_t * scan, struct filereader_t * frd )
Acquires the next buffer from filereader_t if isnext_utf8scanner returns false.
0 | If isnext_utf8scanner returned false before this call then <readnext_filereader> was called and the read pointer points to the new buffer content. If isnext_utf8scanner returned true before this call nothing was done. |
EIO ... | The same error as returned from ioerror_filereader (see also <readnext_filereader>). |
ENODATA | There is no more data. Also iseof_filereader returns true. |
ENOBUFS | The scanned token spans already two buffers and no more than 2 buffers per token are supported. In other words: the token is too long. |
int unread_utf8scanner( utf8scanner_t * scan, struct filereader_t * frd, uint8_t nrofchars )
Decrements the reading position until the last nrofchars characters are unread. For the last nrofchars characters their size in bytes is summed up into the value nrofbytes and the reading position is decremented by nrofbytes. This works only if the token (returned from scannedtoken_utf8scanner) contains at least nrofbytes bytes. In case the token is shorter EINVAL is returned and nothing is done. After successful return the scanned token’s length is decremented by nrofbytes which corresponds to nrofchars characters.
Implements utf8scanner_t.isnext_utf8scanner.
Implements utf8scanner_t.nextbyte_utf8scanner.
Implements utf8scanner_t.peekbyte_utf8scanner.
Implements utf8scanner_t.skipbytes_utf8scanner.
Implements utf8scanner_t.sizeunread_utf8scanner.
Handles the data buffers returned from filereader_t and initializes a token of type splitstring_t.
struct utf8scanner_t
Export utf8scanner_t into global namespace.
typedef struct utf8scanner_t utf8scanner_t
Test utf8scanner_t functionality.
int unittest_io_reader_utf8scanner( void )
Points to the next byte returned from nextbyte_utf8scanner.
const uint8_t * next
Reads the next byte from the buffer and increments the reading position.
uint8_t nextbyte_utf8scanner( utf8scanner_t * scan )
As long as next is lower than end there are more bytes to read.
const uint8_t * end
Stores the begin and length of a string of a recognized token.
splitstring_t scanned_token
Static initializer.
#define utf8scanner_FREE { 0, 0, splitstring_FREE }
Sets all data members to 0.
int init_utf8scanner( /*out*/utf8scanner_t * scan )
Sets all data members to 0 and releases any acquired buffers from frd.
int free_utf8scanner( utf8scanner_t * scan, struct filereader_t * frd )
Returns true if scan is initialized with utf8scanner_FREE.
bool isfree_utf8scanner( const utf8scanner_t * scan )
Returns true if the buffer contains at least one more byte.
bool isnext_utf8scanner( const utf8scanner_t * scan )
The number of bytes which are not read from the current buffer.
size_t sizeunread_utf8scanner( const utf8scanner_t * scan )
Returns the address to an internally stored splitstring_t.
const splitstring_t * scannedtoken_utf8scanner( utf8scanner_t * scan )
Returns any byte from the buffer without changing the read pointer.
uint8_t peekbyte_utf8scanner( utf8scanner_t * scan, size_t offset )
Increments the read pointer by nrbytes without reading the bytes.
void skipbytes_utf8scanner( utf8scanner_t * scan, size_t nrbytes )
Decodes the next utf8 character and increments the reading position.
int nextchar_utf8scanner( utf8scanner_t * scan, struct filereader_t * frd, /*out*/char32_t * uchar )
Skips characters until the last skipped character equals uchar.
int skipuntilafter_utf8scanner( utf8scanner_t * scan, struct filereader_t * frd, char32_t uchar )
Clears the current token string.
int cleartoken_utf8scanner( utf8scanner_t * scan, struct filereader_t * frd )
Acquires the next buffer from filereader_t if isnext_utf8scanner returns false.
int readbuffer_utf8scanner( utf8scanner_t * scan, struct filereader_t * frd )
Decrements the reading position until the last nrofchars characters are unread.
int unread_utf8scanner( utf8scanner_t * scan, struct filereader_t * frd, uint8_t nrofchars )
Implements filereader_t.ioerror_filereader.
#define ioerror_filereader( frd ) ((frd)->ioerror)
Implements filereader_t.iseof_filereader.
#define iseof_filereader( frd ) ( __extension__ ({ const filereader_t * _f ; _f = (frd) ; (_f->unreadsize == 0 && _f->fileoffset == _f->filesize) ; }))