[plum.str] String Tutorial: Create Custom String Type¶
This tutorial shows how to create custom string subclasses that define the encoding, error handling, size and padding.
Encoding & Error Handling¶
The plum.str
module provides string types for two common encodings,
ascii
and utf-8
. Subclass the Str
class to create a new string
type with a different encoding. Use the encoding
argument to pass any
valid codecs
standard encodings name:
>>> from plum import pack
>>>
>>> from plum.str import Str
>>>
>>> class Utf16Str(Str, encoding='utf-16'):
...
... """Interpret bytes as UTF-16 encoded string."""
...
>>> pack(Utf16Str, 'Ahoj světe')
bytearray(b'\xff\xfeA\x00h\x00o\x00j\x00 \x00s\x00v\x00\x1b\x01t\x00e\x00')
The Str
class by default supports 'strict'
error handling, meaning that
encoding errors raise an UnicodeError
. The errors
argument controls
the error handling behavior and supports the other possible values, 'ignore'
and 'replace'
or any other name registered the codecs.register_error()
:
>>> class FlexibleAsciiStr(Str, encoding='ascii', errors='ignore'):
...
... """Interpret bytes as ASCII encoded string."""
...
>>> pack(FlexibleAsciiStr, 'Ahoj světe')
bytearray(b'Ahoj svte')
Size & Padding¶
String types provided by the plum.str
module are greedy. Greedy string
types consume all bytes or accommodate all characters given to them. For
applications where the size of the string does not vary, Str
supports specifying
a nbytes
argument when subclassing. It controls the number of bytes the string
type consumes from a byte stream during unpack operations. During packing
it controls the number of bytes packed and raises an exception if the string does
not fit or pads when the string is too short.
>>> from plum import pack
>>>
>>> from plum.str import Str
>>>
>>> class AsciiStr16(Str, encoding='ascii', nbytes=16):
...
... """Interpret bytes as UTF-16 encoded string."""
...
>>> pack(AsciiStr16, 'Hello World!')
bytearray(b'Hello World!\x00\x00\x00\x00')
To control the pad byte, use the pad
argument when subclassing (be sure to
specify a byte
instance:
>>> class AsciiStrSpacePad(Str, encoding='ascii', nbytes=16, pad=b' '):
...
... """Interpret bytes as UTF-16 encoded string."""
...
>>> pack(AsciiStrSpacePad, 'Hello World!')
bytearray(b'Hello World! ')
Zero Terminated¶
Set the zero_termination
argument to True
when subclassing Str
to create a string type with a null byte termination. When unpacking a
string, the string type stops consuming bytes at the null byte.
When packing, the string type adds a null byte at the end:
>>> from plum import pack, unpack_and_dump
>>>
>>> from plum.str import Str
>>>
>>> class AsciiZeroTermStr(Str, encoding='ascii', zero_termination=True):
...
... """Interpret bytes as ASCII encoded zero terminated string."""
...
>>> pack(AsciiZeroTermStr, 'Hello World!')
bytearray(b'Hello World!\x00')
>>>
>>> s, dump = unpack_and_dump(AsciiZeroTermStr, b'Hello World!\x00')
>>> s
'Hello World!'
>>> print(dump)
+--------+-----------------+----------------+-------------------------------------+------------------+
| Offset | Access | Value | Bytes | Type |
+--------+-----------------+----------------+-------------------------------------+------------------+
| | | | | AsciiZeroTermStr |
| 0 | [0:12] | 'Hello World!' | 48 65 6c 6c 6f 20 57 6f 72 6c 64 21 | |
| 12 | --termination-- | | 00 | |
+--------+-----------------+----------------+-------------------------------------+------------------+
The nbytes
and pad
subclassing arguments also work with
zero terminated string types:
>>> class AsciiZeroTermStr(Str, encoding='ascii', zero_termination=True, nbytes=10, pad=b'\xff'):
...
... """Interpret bytes as ASCII encoded zero terminated string."""
...
>>> pack(AsciiZeroTermStr, 'Hello!')
bytearray(b'Hello!\x00\xff\xff\xff')
>>>
>>> s, dump = unpack_and_dump(AsciiZeroTermStr, b'Hello!\x00\x00\x00\x00')
>>> s
'Hello!'
>>> print(dump)
+--------+-----------------+----------+-------------------+------------------+
| Offset | Access | Value | Bytes | Type |
+--------+-----------------+----------+-------------------+------------------+
| | | | | AsciiZeroTermStr |
| 0 | [0:6] | 'Hello!' | 48 65 6c 6c 6f 21 | |
| 6 | --termination-- | | 00 | |
| 7 | --pad-- | | 00 00 00 | |
+--------+-----------------+----------+-------------------+------------------+