UTF-8 Processing Utilities

This module contains pure business logic for UTF-8 character processing, shared between synchronous and asynchronous implementations.

No I/O operations are performed here - only data validation and parsing.

Types

Utf8ValidationResult = object isValid*: bool expectedBytes*: int errorMessage*: string: Result of UTF-8 validation

Procs

proc buildUtf8String(firstByte: byte; continuationBytes: openArray[byte]): string {. ...raises: [], tags: [], forbids: [].}: Build a UTF-8 string from first byte and continuation bytes

This is a pure function that doesn't perform I/O - it just constructs the string from the provided bytes.

Example:

let s = buildUtf8String(0xC3.byte, [0xA9.byte]) # é assert s == "é"
proc isUtf8ContinuationByte(b: byte): bool {.inline, ...raises: [], tags: [], forbids: [].}: Check if a byte is a valid UTF-8 continuation byte (10xxxxxx)

Example:

assert isUtf8ContinuationByte(0x80) # true assert isUtf8ContinuationByte(0xBF) # true assert not isUtf8ContinuationByte(0xC0) # false
proc truncateUtf8(s: string; maxBytes: int): string {....raises: [], tags: [], forbids: [].}: Truncate a UTF-8 string to at most maxBytes, ensuring valid UTF-8

This will not split multi-byte characters - it truncates at character boundaries.

Example:

assert truncateUtf8("hello", 3) == "hel" assert truncateUtf8("こんにちは", 7) == "こん" # 6 bytes (2 chars)
proc utf8ByteLength(firstByte: byte): int {....raises: [], tags: [], forbids: [].}: Determine the number of bytes in a UTF-8 character from its first byte Returns 1 for ASCII, 2-4 for multi-byte characters, 0 for invalid

Example:

assert utf8ByteLength(0x41) == 1 # 'A' (ASCII) assert utf8ByteLength(0xC3) == 2 # Start of 2-byte UTF-8 assert utf8ByteLength(0xE0) == 3 # Start of 3-byte UTF-8 assert utf8ByteLength(0xF0) == 4 # Start of 4-byte UTF-8 assert utf8ByteLength(0xFF) == 0 # Invalid
proc utf8CharLength(s: string): int {....raises: [], tags: [], forbids: [].}: Get the number of UTF-8 characters in a string (not bytes)

Example:

assert utf8CharLength("hello") == 5 assert utf8CharLength("こんにちは") == 5 # 5 characters, 15 bytes
proc validateUtf8Sequence(bytes: openArray[byte]): Utf8ValidationResult {. ...raises: [], tags: [], forbids: [].}: Validate a UTF-8 byte sequence

Returns validation result with detailed error information

celina/core/utf8_utils

Types

Procs