Standard Compression Scheme for Unicode

The Standard Compression Scheme for Unicode (“SCSU”) is a text encoding finalized in Unicode Technical Specification #6.

There was no Python text codec support for SCSU, so I decided to write a module myself.

Installation

To install my SCSU module, install the package or clone the Git repository.

Usage

Import the module and use it like you would any other text codec.

import scsu

s = "Está es texto en español. これは日本語です。"
b = s.encode("SCSU")

Command line

Input files can be specified on the command line, piped in, or omitted completely to read from stdin.

  • To see all the options, use the -h option.
  • To see the module version, use the -v option.

Output is always written to stdout.

Encoding

Use the encode subcommand to transcode encoded text to SCSU.

python3 -m scsu encode -e UTF-8 -s utf8.txt > scsu.txt
  • The -e option specifies the source encoding. By default, this is the codec that is returned with the locale.getpreferredencoding() function.
  • The -s option adds a signature byte string to the output. This is the byte order mark encoded as 0x0E 0xFE 0xFF in SCSU.

Decoding

Use the decode subcommand to transcode encoded text from SCSU.

python3 -m scsu decode -e UTF-8 -s scsu.txt > utf8.txt
  • The -e option specifies the destination encoding. By default, this is the codec that is returned with the locale.getpreferredencoding() function.
  • The -s option removes a signature byte string from the input. If no signature is found, this option does nothing.