Compressed Binary Transmission Format, Eight Bits (CBTF-8)

© Copyright 2010, Color My Data, All Rights Reserved


PROBLEM STATEMENT

Data management systems must accomodate a wide variety of data types. The exchange of information between a remote client and the DMS typically has only one data type: string. When large volume data are acquired from data acquisition systems, string formatting can be costly in limited bandwidth environments and lossy especially for real numbers. For example, infinities, quiet and signalling NaNs may have no string representation in the DMS. To cover the wide range of values, precisions and dynamic ranges a typical DMS differentiates column types by sizes which may or may not be platform or application independent. As a result, data about the source is lost and requires added complexity to handle. Even the transfer of string data has unwanted complexity. To delimit the boundaries of a string most languages enclose strings in quotes. If the quote character must be embedded within the string it must be differentiated using special parsing rules such as an escape sequence. In languages such as PHP two different types of quotes are used and the semantics change depending on the type of quotes used. Unicode requires multi-byte data spanning nearly 221 code points. Some UTF-8 sequences are invalid and require safeguards to prevent exploitation. Moreover, South and East Asian languages require more than two bytes for every character, an inconvenience. All of these factors combine to limit functionality and interoperability and increase required bandwidth while adding unwanted size and complexity.

INNOVATION

Our innovation is to embed data typing and usage indicators in a self-synchronizing byte stream. Data are exchanged without loss in a compressed, byte-oriented, platform-independent binary format (CBTF-8). The encoded length of unsigned and twos-complement integers adapts to the size of the value. The sign, if any, is recovered when the value is widened to the maximum register width of the arithmetic logic unit. Real formats have adaptive precision and dynamic range for lossless data compression yet widen to IEEE754 standard formats. Infinities and both quiet and signalling NaNs are fully supported. Adaptive data compression means column types no longer depend on the storage size, thus reducing complexity and increasing interoperability. Our unicode encoding scheme offers a byte oriented encoding alternative to UTF-8 that conforms to the full range of unicode code points. Unicode strings require no quotes, no escape sequences and no null terminators. Unicode compression supports language localization on the fly so that most characters require no more than two encoded bytes. Nevertheless, UTF-8 is fully supported for backwards compatibility with existing standards.

SOLUTION

The solution requires separation of data content from data about the data (meta-data). The encoding scheme satisfies this requirement by limiting encoded bytes to a set of reserved characters. The reserved character set is a subset of the graphic characters from the 7-bit ASCII character set. Thus, encoded data can be printed to facilitate data evaluation using human readable text.

The reserved character set groups characters by function into sextets, delimiters, usage indicators and subrange indicators.

ReservedCharacter:
	Sextet
	Delimiter
	UsageIndicator
	SubrangeIndicator

Sextets

A sextet is a one-to-one mapping of a six bit value onto a reserved character. The sextet is the payload for information transfer. Sextets contain no information about usage or semantics.
Sextet: in order of appearance, one of
	0 1 2 3 4 5 6 7 8 9 A B C D E F
	G H I J K L M N O P Q R S T U V
	W X Y Z ^ _ a b c d e f g h i j
	k l m n o p q r s t u v w x y z

Delimiters

Three reserved characters delimit the responses to a database query.

Delimiter: one of
	{	Content aggregator
	}	Content terminator
	]	Record terminator

The content aggregator { signals that in a recordset zero or more records will follow.

RecordsetContent:
	{
	RecordsetContent Record

The content terminator } follows the last record to close the recordset.

Recordset:
	RecordsetContent }

Each record comprises one or more fields in a field list terminated with the record terminator.

FieldList:
	Field
	FieldList Field

Record:
	FieldList ]

Each field is one of five basic types.

Field:
	WholeNumberField
	IntegerField
	RealNumberField
	CharacterOrStringField
	BooleanOrSetField

Usage Indicators and Components

Multiple sextets can be concatenated to form components which are integral multiples of 6 bits in length. Each component begins with a usage indicator and ends with the usage indicator of the successor component or the record terminator. The length of a component in sextets is the difference between successive usage indicators less one. Uninitialized components have zero length.
Component:
	UsageIndicator
	Component Sextet

The usage indicator differentiates components by type and by usage. It is also a component delimiter and a point of synchronization. This is why unicode strings require no quotes, no escape sequences and no null terminators. Although many components are fields, the unicode bias component and the array dimension component are not fields.

UsageIndicator: one of the following values from the 7-bit ASCII character set
	+	a whole number (non-negative integer) type
	-	a twos complement integer type
	#	a real number type
	&	a boolean type or a set of boolean values
	'	a unicode character type or a string of unicode characters
	=	unicode bias (whole number type)
	[	array dimension (whole number type)

Whole Numbers

Whole numbers are non-negative values. The encoding scheme adapts the length of the whole number to the value being encoded by requiring the sextet following the usage indicator to be greater than 0 except when it is the last and only sextet. When decoded, the value extracted from the most significant sextet is zero extended to the maximum width of the register. There is only one whole number type but three use cases.

Whole Number Fields

The character + indicates the type is a whole number and the value is either a field or an array element. Whole number fields always use encoded sextets. A zero length field is uninitialized.

WholeNumberField:
	+
	WholeNumberField Sextet

Array Dimensions

The character [ indicates the type is a whole number and the value is the dimension of an array. Array dimensions are always initialized and must contain at least one sextet.

ArrayDimension:
	[ sextet
	ArrayDimension Sextet

Array elements are always unencoded octets (eight bit values).

Whole Number Arrays

A whole number array has two dimensions. The first dimension is the number of array elements. The second dimension is the number of octets per element. There is no compelling reason to limit the number of octets to a power of 2. Instead, there should be a range constraint applied to all elements of the array and the number of octets should be the smallest number that will satisfy the range constraint. Values must be validated against the range constraints before assignment to the array. Values extracted from the array are then known to satisfy the range constraint.

WholeNumberArray:
	ArrayDimension ArrayDimension +
	WholeNumberArray Octet

Unicode Bias

The character = indicates the type is a whole number and the value is a unicode bias. The unicode bias allows any subrange of 128 consecutive code points between U+0080 to U+10FFFF to be addressed with two sextets. The unicode bias is range constrained to values between U+0080 and U+10FF8F and therefore must be followed by a minimum of two sextets.

UnicodeBias:
	= Sextet Sextet
	UnicodeBias Sextet

Twos Complement Integers

The character - indicates the type is a twos complement integer and the value is either a field or an array element. Twos complement integers negate the weight of the most significant bit. A six-bit value between 0 and 31 (encoded 0 to V) is therefore a whole number. A six-bit value between 32 and 63 (encoded W to z) counts up from -32 to yield a value between -32 and -1. Data compression is achieved by conditionally discarding leading sextets containing 0 or z. The condition for discarding the 0 after the usage indicator is that the successor sextet is between 0 and V inclusive. The condition for discarding the z after the usage indicator is that the successor sextet is between W and z inclusive. When decoded, the most significant sextet is sign extended by subtracting 64 from decoded whole numbers between 32 and 63. The result is a value between -32 and 31 widened to the maximum register width.

Integer Fields

Integer fields are encoded using sextets. An integer field with no sextets is uninitialized.

IntegerField:
	-
	IntegerField Sextet

Integer Arrays

When the lower bound on the range constraint of an array of integer values ≥ 0, the whole number array is preferred over the integer array because twice as many positive values can be stored in the same number of octets. For example, with two octets per value an integer array can store values between 0 and 32767; but a whole number array can store values between 0 and 65535.

On the other hand, when the lower bound on the range constraint of an array of integers < 0, the array of integers must be used. As in the case of whole number arrays the first dimension is the number of elements and the second dimension is the number of octets per element. Also, the number of octets per element must satisfy both bounds of the array element range constraint.

IntegerArray:
	ArrayDimension ArrayDimension -
	IntegerArray Octet

Real Numbers

The character # indicates the type is a real number type and the value is either a field or an array element. All real numbers have a sign, a biased exponent and a significand and can be widened to conform with standard IEEE-754 formats.

Real Number Fields

Encoded real numbers have six-bit boundaries and may be much narrower than the IEEE-754 formatted values from which they were encoded. The process for narrowing encoded real numbers adapts the dynamic range and precision to the value and is lossless.

RealNumberField:
	# Sextet Sextet
	RealNumberField Sextet

To encode IEEE754 formatted data the raw bits of the binary format must first be obtained. For example, in Java the method Double.doubleToRawLongBits(double value) yields a 64 bit long value corresponding to the IEEE-754 binary64 format. Shift and mask operations can then be applied to the result to create a sequence of six-bit values which are then encoded as sextets. Since the binary formats are powers of 2 and the encoded values are on six bit boundaries, the least significant byte of data is padded with either 2 or 4 bits depending on the IEEE754 format of the source as shown in the following table.

IEEE754 FormatSextetsPadding Bits
binary1632
binary3264
binary64112
binary128224

If the least significant sextet is 0, the number is a candidate for adaptive narrowing. The values +/- infinity, +/- zero and quiet NaN satisfy this condition and can always be narrowed to two sextets. Signalling NaNs can also be narrowed to two sextets provided all the signals are in the 5 highest bits of the significand. Narrowing is limited by the dynamic range of the exponent and the number of trailing zeros in the significand. The dynamic range is a function of the number of bits in the exponent. As seen in the following table, the number of bits in the exponent and consequently the bias applied to the exponent is itself a function of the length of the component in sextets or the element storage size of the array in octets.

BitsBiasOctetsSextets
51522-3
63134
812745-6
1110235-87-11
15163839-1612-22

Trailing zeros in the significand can be discarded with no loss of precision provided that the dynamic range does not change. If narrowing the dynamic range would result in underflow, overflow or subnormal values, no further narrowing is permitted. Otherwise, the narrowing operation continues over the reduced dynamic range.

For example, let 0.0625 be a double precision value (binary64 format). Nominally, double precision values are padded to binary66 and encode as eleven sextets. However, 0.0625 is also a power of 2. (2-4). This means all bits in the significand are 0. For a bias of 15, the biased exponent is +11 which means underflow or subnormal values are no factor. As a consequence, this value can be stored in binary12 format and take only two sextets.

When decoded, the bits from the sextets are concatenated to form an integral value. Shift and mask operations extract the sign, exponent and significand from this value and the value is widened to the desired IEEE754 binary format. The raw bits from the widening operation are converted to a real number with a reciprocal operation such as Java's Double.longBitsToDouble(long value)

Real Arrays

The real array has two dimensions. The first is the number of values; the second is the number of octets per element. Carefully choose the number of octets per element because the fixed number of octets per element subject the stored values to loss of both dynamic range and precision

RealArray:
	ArrayDimension ArrayDimension #
	RealArray Octet

To set dynamic range compare the largest and smallest power of 2 expected with the bias in table 2. Allow 10 units of bias for every multiple of 1000. For example, the ratio of one second to one nanosecond is a billion to one. A billion has nine zeros (3 multiples of three zeros) so the bias should be a minimum of 30 (3 x 10). Five bits of dynamic range is too small, six bits is borderline and eight bits is sufficient.

Next we need to check precision. To calculate precision we subtract the number of bits of exponent from the product of 8 and the number of octets. Note that the sign bit cancels the implicit bit of the significand. For example, if we want one nanosecond precision per second of time, the significand needs to be a minimum of 30 bits.

Three octets gives us 24-6 or 18 bits of precision (17 + 1 implicit bit) = not even close.
Four octets gives us 32-8 or 24 bits of precision (23 + 1 implicit bit) = still inadequate .
Five octets gives us 40-11 or 29 bits of precision (28 + 1 implicit bit) = not quite enough.
Six octets gives us 48-11 or 37 bits of precision (36 + 1 implicit bit) = we are there.

We therefore conclude that we should be using double precision (IEEE754 binary64) for calculations but can narrow the stored value from eight octets to six octets and safely meet the requirement of one nanosecond precision per second of time.

Characters or Strings

The character ' indicates the type is a character or string and the value is either an encoded character field, an encoded string field or an array of octets encoded as UTF-8. A character field has exactly one character following the usage indicator whereas a string field has 0 or more characters following the usage indicator. There is no difference between a character field and a string field containing exactly one character.

CharacterField:
	' Character

String Field:
	'
	StringField Character

Subrange Indicators

Subrange indicators address the requirement to span 221 unicode code points with an unambiguous, byte-oriented encoding scheme using only reserved characters. The digits 0 to 9 the letters A to Z and a to z, and the characters ^ and _ do not change when encoded. The unicode code point is the same as the encoded value. All other unicode code points require multiple reserved characters to encode. The first reserved character is the subrange indicator. It indicates the number of sextets needed for encoding and the code point offset. Subrange indicators can be used for synchronization on multi-byte characters but are NOT delimiters.

Character:
	Sextet				CodePoint = Sextet
	! Sextet			CodePoint = OtherAsciiCharacter
	< Sextet			CodePoint = UnicodeBias + 6-bit value
	> Sextet			CodePoint = UnicodeBias + 6-bit value + 64
	" Sextet Sextet			CodePoint = U+0080 + 12-bit value
	$ Sextet Sextet Sextet		CodePoint = U+1080 + 18-bit value
	% Sextet Sextet Sextet Sextet	CodePoint = U+41080 + 24-bit value	
	

OtherAsciiCharacter: in order of appearance, one of
	NUL SOH STX ETX EOT ENQ ACK BEL BS  HT  LF  VT  FF  CR  SO  SI
	DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM  SUB ESC FS  GS  RS  US
	sp  !   "   #   $   %   &   '   (   )   *   +   ,   -   .   /
	:   ;   <   =   >   ?   @   [   \   ]   `   {   |   }   ~   DEL

Unicode Data Compression

An issue for some South and East Asian languages is the number of bytes required to represent a single character. To address this issue a data compression scheme offsets a preferred range of 128 consecutive code points with a configurable bias. The unicode bias configures the stream for a given locale and language. Once configured, sextets beginning with the subrange indicators < and > index either the lower or upper 64 code points of the range. The default bias is 128 so that the range U+0080 to U+00FF can be represented with one additional sextet and not two.

Variable Character Elements and Arrays

The Unicode Transfer Format UTF-8 uses all eight bits of the octet. Since encoded values are restricted to reserved characters, UTF-8 cannot be used in encoded fields. However, UTF-8 octets are used in variable character elements. The length which dimensions the variable character element is the number of octets. For example, the string hello takes six octets h e l l o NUL. The corresponding variable character element is [ 6 ' h e l l o NUL. Thus, if the posted data from a client is in UTF-8 format, it can be immediately stored in that same format. In response to a client query, the UTF-8 data is encoded using sextets as a character field or a string field (e.g. ' h e l l o). When the client parses the recordset the character or string field is decoded and reverts to UTF-8 for compatibility with the client's browser.

VarCharElement:
	ArrayDimension '
	VarCharElement octet

Variable character elements may be combined to yield an array of variable character elements. Since a two dimensional array requires all elements to have the same number of octets, curly braces { } are used to mark the start and end of the variable length character array content.

VarCharArrayContent:
	{
	VarCharArrayContent VarCharElement

VarCharArray:
	VarCharArrayContent }

Booleans or Boolean Sets

The character & indicates the type is a Boolean Set and the value is either an encoded Boolean value an encoded Boolean Set or an array of octets organized as a packed array of Boolean values or a packed array of Boolean Sets. Each bit of a sextet or an octet represents a boolean value. Thus, a set of up to six boolean values can be packed into a sextet and eight boolean values can be packed into an octet. The boolean values are ordered from most significant bit to least significant bit. Thus, a single boolean value uses the most significant bit.

Boolean or Boolean Set Field

The Boolean field holds a single boolean value. The Boolean set field holds a set of one or more Boolean values. Both are encoded with sextets.

BooleanField:
	& Sextet	

BooleanSetField:
	& Sextet
	BooleanSetField Sextet

To test whether a boolean value is set, the sextet containing the bit is extracted as a whole number and left shifted 2 + index mod 6 bits. If the resulting byte is negative, the value true is returned; otherwise false is returned. To set, clear or toggle a boolean value, the sextet is first converted to a whole number. The constant 32 is right shifted index mod 6 bits and subtracted from 63 to create a mask. To clear the bit and the whole number with the mask and reencode the result as a sextet. To set the bit or the whole number with the right shifted index mod 6 and reencode the result as a sextet. To toggle the bit exclusive or the whole number with the right shifted index mod 6 and reencode the result as a sextet.

Boolean Arrays

A Boolean field uses only one of 6 bits to hold a single Boolean value. Eight fields would be needed for eight values in a column. However, using a Boolean array for a column of Boolean values uses all eight bits of an octet thereby achieving an 8 to 1 compression of data. The number of octets for the array is the dimension's whole number + 7 right shift 3. Thus, 1-8 values store in one octet; 9-16 values store in two octets, etc.

BooleanArray:
	ArrayDimension &
	BooleanArray octet

Boolean Set Arrays

With a two dimensional array of octets, the first dimension declares how many sets there are and the second dimension is the number of elements per set. Since eight elements can be packed into an octet, the number of octets per set is the second dimension + 7 right shift 3. The product of this result and the first dimension's whole number is the total number of octets in the array.

BooleanSetArray:
	ArrayDimension ArrayDimension &
	BooleanSetArray octet

Arrays

Wheras records are ideal for working with heterogeneous data, arrays are better suited to homogeneous data. In a relational database first normal form can be enforced by restricting each column to a single usage indicator. This means the elements of a column can be implemented as an array. Accessing an entire column of data is highly desirable in signal processing applications such as digital filters or Fourier Transforms where time dependent data can span a large number of records.

For example, a column of a table can be any of the following arrays.

Column:
	WholeNumberArray Enumerationopt
	IntegerArray
	VarCharArray
	RealArray
	BooleanArray
	BooleanSetArray

Enumeration:
	VarCharArray
	Collation

Collation:
	WholeNumberArray
	

Summary

The following table illustrates how usage and subrange indicators affect the succeeding sextet.

Sextet	+ [ =	-	'	'!	'"	'$	'%	#	&
0	0	0	0	NUL	U+0080	U+1080	U+41080	+00000	FFFFFF
1	1	1	1	SOH	U+00C0	U+2080	U+81080	+00001	FFFFFT
2	2	2	2	STX	U+0100	U+3080	U+C1080	+00010	FFFFTF	
3	3	3	3	ETX	U+0140	U+4080 U+101080	+00011	FFFFTT
4	4	4	4	EOT	U+0180	U+5080	Illegal	+00100	FFFTFF
5	5	5	5	ENQ	U+01C0  U+6080	Illegal	+00101	FFFTFT
6	6	6	6	ACK	U+0200	U+7080	Illegal	+00110	FFFTTF
7	7	7	7	BEL	U+0240	U+8080	Illegal	+00111	FFFTTT
8	8	8	8	BS	U+0280	U+9080	Illegal	+01000	FFTFFF
9	9	9	9	HT	U+02C0	U+A080	Illegal	+01001	FFTFFT
A	10	10	A	LF	U+0300	U+B080	Illegal	+01010	FFTFTF
B	11	11	B	VT	U+0340	U+C080	Illegal	+01011	FFTFTT
C	12	12	C	FF	U+0380	U+D080	Illegal	+01100	FFTTFF
D	13	13	D	CR	U+03C0	U+E080	Illegal	+01101	FFTTFT
E	14	14	E	SO	U+0400	U+F080	Illegal	+01110	FFTTTF
F	15	15	F	SI	U+0440	U+10080	Illegal	+01111	FFTTTT
G	16	16	G	DLE	U+0480	U+11080	Illegal	+10000	FTFFFF
H	17	17	H	DC1	U+04C0	U+12080	Illegal	+10001	FTFFFT
I	18	18	I	DC2	U+0500	U+13080	Illegal	+10010	FTFFTF
J	19	19	J	DC3	U+0540	U+14080	Illegal	+10011	FTFFTT
K	20	20	K	DC4	U+0580	U+15080	Illegal	+10100	FTFTFF
L	21	21	L	NAK	U+05C0	U+16080	Illegal	+10101	FTFTFT
M	22	22	M	SYN	U+0600	U+17080	Illegal	+10110	FTFTTF
N	23	23	N	ETB	U+0640	U+18080	Illegal	+10111	FTFTTT
O	24	24	O	CAN	U+0680	U+19080	Illegal	+11000	FTTFFF
P	25	25	P	EM	U+06C0	U+1A080	Illegal	+11001	FTTFFT
Q	26	26	Q	SUB	U+0700	U+1B080	Illegal	+11010	FTTFTF
R	27	27	R	ESC	U+0740	U+1C080	Illegal	+11011	FTTFTT
S	28	28	S	FS	U+0780	U+1D080	Illegal	+11100	FTTTFF
T	29	29	T	GS	U+07C0	U+1E080	Illegal	+11101	FTTTFT
U	30	30	U	RS	U+0800	U+1F080	Illegal	+11110	FTTTTF
V	31	31	V	US	U+0840	U+20080	Illegal	+11111	FTTTTT
W	32	-32	W	space	U+0880	U+21080	Illegal	-00000	TFFFFF
X	33	-31	X	!	U+08C0	U+22080	Illegal	-00001	TFFFFT
Y	34	-30	Y	"	U+0900	U+23080	Illegal	-00010	TFFFTF
Z	35	-29	Z	#	U+0940	U+24080 Illegal	-00011	TFFFTT
^	36	-28	^	$	U+0980	U+25080	Illegal	-00100	TFFTFF
_	37	-27	_	%	U+09C0	U+26080	Illegal	-00101	TFFTFT
a	38	-26	a	&	U+0A00	U+27080	Illegal	-00110	TFFTTF
b	39	-25	b	'	U+0A40	U+28080	Illegal	-00111	TFFTTT
c	40	-24	c	(	U+0A80	U+29080	Illegal	-01000	TFTFFF
d	41	-23	d	)	U+0AC0	U+2A080	Illegal	-01001	TFTFFT
e	42	-22	e	*	U+0B00	U+2B080 Illegal	-01010	TFTFTF
f	43	-21	f	+	U+0B40	U+2C080	Illegal	-01011	TFTFTT
g	44	-20	g	,	U+0B80	U+2D080	Illegal	-01100	TFTTFF
h	45	-19	h	-	U+0BC0	U+2E080	Illegal	-01101	TFTTFT
i	46	-18	i	.	U+0C00	U+2F080	Illegal	-01110	TFTTTF
j	47	-17	j	/	U+0C40	U+30080	Illegal	-01111	TFTTTT
k	48	-16	k	:	U+0C80	U+31080	Illegal	-10000	TTFFFF
l	49	-15	l	;	U+0CC0	U+32080	Illegal	-10001	TFFFFT
m	50	-14	m	<	U+0D00	U+33080	Illegal	-10010	TTFFTF
n	51	-13	n	=	U+0D40	U+34080	Illegal	-10011	TTFFTT
o	52	-12	o	>	U+0D80	U+35080	Illegal	-10100	TTFTFF
p	53	-11	p	?	U+0DC0	U+36080	Illegal	-10101	TTFTFT
q	54	-10	q	@	U+0E00	U+37080	Illegal	-10110	TTFTTF
r	55	-9	r	[	U+0E40	U+38080	Illegal	-10111	TTFTTT
s	56	-8	s	\	U+0E80	U+39080	Illegal	-11000	TTTFFF
t	57	-7	t	]	U+0EC0	U+3A080	Illegal	-11001	TTTFFT
u	58	-6	u	`	U+F000	U+3B080	Illegal	-11010	TTTFTF
v	59	-5	v	{	U+0F40	U+3C080	Illegal	-11011	TTTFTT
w	60	-4	w	|	U+0F80	U+3D080	Illegal	-11100	TTTTFF
x	61	-3	x	}	U+0FC0	U+3E080	Illegal	-11101	TTTTFT
y	62	-2	y	~	U+1000	U+3F080 Illegal	-11110	TTTTTF
z	63	-1	z	del	U+1040	U+40080 Illegal	-11111	TTTTTT

Color My Data Dynamic Data Model with CBTF-8