Skip to content Skip to sidebar Skip to footer

Validate That A Stream Of Bytes Is Valid Utf-8 (or Other Encoding) Without Copy

This is perhaps a micro-optimization, but I would like to check that a stream of given bytes is valid UTF-8 as it passes through my application, but I don't want to keep the result

Solution 1:

You can use the incremental decoder provided by the codecs module:

utf8_decoder = codecs.getincrementaldecoder('utf8')()

This is a IncrementalDecoder() instance. You can then feed this decoder data in order and validate the stream:

# for each partial chunk of data:try:
        utf8_decoder.decode(chunk)
    except UnicodeDecodeError:
        # invalid data

The decoder returns the data decoded so far (minus partial multi-byte sequences, those are kept as state for the next time you decode a chunk). Those smaller strings are cheap to create and discard, you are not creating a large string here.

You can't feed the above loop partial data, because UTF-8 is a format using a variable number of bytes; a partial chunk is liable to have invalid data at the start.

If you can't validate from the start, then your first chunk may start with up to three continuation bytes. You could just remove those first:

first_chunk = b'....'for _ inrange(3):
    if first_chunk[0] & 0xc0 == 0x80:
        # remove continuation byte
        first_chunk = first_chunk[1:]

Now, UTF-8 is structured enough so you could also validate the stream entirely in Python code using more such binary tests, but you simply are not going to match the speed that the built-in decoder can decode at.

Post a Comment for "Validate That A Stream Of Bytes Is Valid Utf-8 (or Other Encoding) Without Copy"