Standard Compression Scheme for Unicode
The Standard Compression Scheme for Unicode (“SCSU”) is a text encoding finalized in Unicode Technical Specification #6.
There was no Python text codec support for SCSU, so I decided to write a module myself.
Installation
To install my SCSU module, install the package or clone the Git repository.
Usage
Import the module and use it like you would any other text codec.
import scsu
s = "Está es texto en español. これは日本語です。"
b = s.encode("SCSU")
Command line
Input files can be specified on the command line, piped in, or omitted completely to read from stdin
.
- To see all the options, use the
-h
option. - To see the module version, use the
-v
option.
Output is always written to stdout
.
Encoding
Use the encode
subcommand to transcode encoded text to SCSU.
python3 -m scsu encode -e UTF-8 -s utf8.txt > scsu.txt
- The
-e
option specifies the source encoding. By default, this is the codec that is returned with thelocale.getpreferredencoding()
function. - The
-s
option adds a signature byte string to the output. This is the byte order mark encoded as0x0E 0xFE 0xFF
in SCSU.
Decoding
Use the decode
subcommand to transcode encoded text from SCSU.
python3 -m scsu decode -e UTF-8 -s scsu.txt > utf8.txt
- The
-e
option specifies the destination encoding. By default, this is the codec that is returned with thelocale.getpreferredencoding()
function. - The
-s
option removes a signature byte string from the input. If no signature is found, this option does nothing.