[plum.str] String Tutorial: Create Custom String Type

This tutorial shows how to create custom string subclasses that define the encoding, error handling, size and padding.

Encoding & Error Handling

The plum.str module provides string types for two common encodings, ascii and utf-8. Subclass the Str class to create a new string type with a different encoding. Use the encoding argument to pass any valid codecs standard encodings name:

>>> from plum import pack
>>>
>>> from plum.str import Str
>>>
>>> class Utf16Str(Str, encoding='utf-16'):
...
...     """Interpret bytes as UTF-16 encoded string."""
...
>>> pack(Utf16Str, 'Ahoj světe')
bytearray(b'\xff\xfeA\x00h\x00o\x00j\x00 \x00s\x00v\x00\x1b\x01t\x00e\x00')

The Str class by default supports 'strict' error handling, meaning that encoding errors raise an UnicodeError. The errors argument controls the error handling behavior and supports the other possible values, 'ignore' and 'replace' or any other name registered the codecs.register_error():

>>> class FlexibleAsciiStr(Str, encoding='ascii', errors='ignore'):
...
...     """Interpret bytes as ASCII encoded string."""
...
>>> pack(FlexibleAsciiStr, 'Ahoj světe')
bytearray(b'Ahoj svte')

Size & Padding

String types provided by the plum.str module are greedy. Greedy string types consume all bytes or accommodate all characters given to them. For applications where the size of the string does not vary, Str supports specifying a nbytes argument when subclassing. It controls the number of bytes the string type consumes from a byte stream during unpack operations. During packing it controls the number of bytes packed and raises an exception if the string does not fit or pads when the string is too short.

>>> from plum import pack
>>>
>>> from plum.str import Str
>>>
>>> class AsciiStr16(Str, encoding='ascii', nbytes=16):
...
...     """Interpret bytes as UTF-16 encoded string."""
...
>>> pack(AsciiStr16, 'Hello World!')
bytearray(b'Hello World!\x00\x00\x00\x00')

To control the pad byte, use the pad argument when subclassing (be sure to specify a byte instance:

>>> class AsciiStrSpacePad(Str, encoding='ascii', nbytes=16, pad=b' '):
...
...     """Interpret bytes as UTF-16 encoded string."""
...
>>> pack(AsciiStrSpacePad, 'Hello World!')
bytearray(b'Hello World!    ')

Zero Terminated

Set the zero_termination argument to True when subclassing Str to create a string type with a null byte termination. When unpacking a string, the string type stops consuming bytes at the null byte. When packing, the string type adds a null byte at the end:

>>> from plum import pack, unpack_and_dump
>>>
>>> from plum.str import Str
>>>
>>> class AsciiZeroTermStr(Str, encoding='ascii', zero_termination=True):
...
...     """Interpret bytes as ASCII encoded zero terminated string."""
...
>>> pack(AsciiZeroTermStr, 'Hello World!')
bytearray(b'Hello World!\x00')
>>>
>>> s, dump = unpack_and_dump(AsciiZeroTermStr, b'Hello World!\x00')
>>> s
'Hello World!'
>>> print(dump)
+--------+-----------------+----------------+-------------------------------------+------------------+
| Offset | Access          | Value          | Bytes                               | Type             |
+--------+-----------------+----------------+-------------------------------------+------------------+
|        |                 |                |                                     | AsciiZeroTermStr |
|  0     | [0:12]          | 'Hello World!' | 48 65 6c 6c 6f 20 57 6f 72 6c 64 21 |                  |
| 12     | --termination-- |                | 00                                  |                  |
+--------+-----------------+----------------+-------------------------------------+------------------+

The nbytes and pad subclassing arguments also work with zero terminated string types:

>>> class AsciiZeroTermStr(Str, encoding='ascii', zero_termination=True, nbytes=10, pad=b'\xff'):
...
...     """Interpret bytes as ASCII encoded zero terminated string."""
...
>>> pack(AsciiZeroTermStr, 'Hello!')
bytearray(b'Hello!\x00\xff\xff\xff')
>>>
>>> s, dump = unpack_and_dump(AsciiZeroTermStr, b'Hello!\x00\x00\x00\x00')
>>> s
'Hello!'
>>> print(dump)
+--------+-----------------+----------+-------------------+------------------+
| Offset | Access          | Value    | Bytes             | Type             |
+--------+-----------------+----------+-------------------+------------------+
|        |                 |          |                   | AsciiZeroTermStr |
|  0     | [0:6]           | 'Hello!' | 48 65 6c 6c 6f 21 |                  |
|  6     | --termination-- |          | 00                |                  |
|  7     | --pad--         |          | 00 00 00          |                  |
+--------+-----------------+----------+-------------------+------------------+