© Copyright 2010, Color My Data, All Rights Reserved
Data management systems must accomodate a wide variety of data types. The exchange of information between a remote client and the DMS typically has only one data type: string. When large volume data are acquired from data acquisition systems, string formatting can be costly in limited bandwidth environments and lossy especially for real numbers. For example, infinities, quiet and signalling NaNs may have no string representation in the DMS. To cover the wide range of values, precisions and dynamic ranges a typical DMS differentiates column types by sizes which may or may not be platform or application independent. As a result, data about the source is lost and requires added complexity to handle. Even the transfer of string data has unwanted complexity. To delimit the boundaries of a string most languages enclose strings in quotes. If the quote character must be embedded within the string it must be differentiated using special parsing rules such as an escape sequence. In languages such as PHP two different types of quotes are used and the semantics change depending on the type of quotes used. Unicode requires multi-byte data spanning nearly 221 code points. Some UTF-8 sequences are invalid and require safeguards to prevent exploitation. Moreover, South and East Asian languages require more than two bytes for every character, an inconvenience. All of these factors combine to limit functionality and interoperability and increase required bandwidth while adding unwanted size and complexity.
Our innovation is to embed data typing and usage indicators in a self-synchronizing byte stream. Data are exchanged without loss in a compressed, byte-oriented, platform-independent binary format (CBTF-8). The encoded length of unsigned and twos-complement integers adapts to the size of the value. The sign, if any, is recovered when the value is widened to the maximum register width of the arithmetic logic unit. Real formats have adaptive precision and dynamic range for lossless data compression yet widen to IEEE754 standard formats. Infinities and both quiet and signalling NaNs are fully supported. Adaptive data compression means column types no longer depend on the storage size, thus reducing complexity and increasing interoperability. Our unicode encoding scheme offers a byte oriented encoding alternative to UTF-8 that conforms to the full range of unicode code points. Unicode strings require no quotes, no escape sequences and no null terminators. Unicode compression supports language localization on the fly so that most characters require no more than two encoded bytes. Nevertheless, UTF-8 is fully supported for backwards compatibility with existing standards.
The solution requires separation of data content from data about the data (meta-data). The encoding scheme satisfies this requirement by limiting encoded bytes to a set of reserved characters. The reserved character set is a subset of the graphic characters from the 7-bit ASCII character set. Thus, encoded data can be printed to facilitate data evaluation using human readable text.
The reserved character set groups characters by function into sextets, delimiters, usage indicators and subrange indicators.
ReservedCharacter: Sextet Delimiter UsageIndicator SubrangeIndicator
Sextet: in order of appearance, one of 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ^ _ a b c d e f g h i j k l m n o p q r s t u v w x y z
Three reserved characters delimit the responses to a database query.
Delimiter: one of
{ Content aggregator
} Content terminator
] Record terminator
The content aggregator { signals that in a recordset zero or more records will follow.
RecordsetContent:
{
RecordsetContent Record
The content terminator } follows the last record to close the recordset.
Recordset: RecordsetContent }
Each record comprises one or more fields in a field list terminated with the record terminator.
FieldList: Field FieldList Field Record: FieldList ]
Each field is one of five basic types.
Field: WholeNumberField IntegerField RealNumberField CharacterOrStringField BooleanOrSetField
Component: UsageIndicator Component Sextet
The usage indicator differentiates components by type and by usage. It is also a component delimiter and a point of synchronization. This is why unicode strings require no quotes, no escape sequences and no null terminators. Although many components are fields, the unicode bias component and the array dimension component are not fields.
UsageIndicator: one of the following values from the 7-bit ASCII character set + a whole number (non-negative integer) type - a twos complement integer type # a real number type & a boolean type or a set of boolean values ' a unicode character type or a string of unicode characters = unicode bias (whole number type) [ array dimension (whole number type)
Whole numbers are non-negative values. The encoding scheme adapts the length of the whole number to the value being encoded by requiring the sextet following the usage indicator to be greater than 0 except when it is the last and only sextet. When decoded, the value extracted from the most significant sextet is zero extended to the maximum width of the register. There is only one whole number type but three use cases.
The character + indicates the type is a whole number and the value is either a field or an array element. Whole number fields always use encoded sextets. A zero length field is uninitialized.
WholeNumberField: + WholeNumberField Sextet
The character [ indicates the type is a whole number and the value is the dimension of an array. Array dimensions are always initialized and must contain at least one sextet.
ArrayDimension: [ sextet ArrayDimension Sextet
Array elements are always unencoded octets (eight bit values).
WholeNumberArray: ArrayDimension ArrayDimension + WholeNumberArray Octet
The character = indicates the type is a whole number and the value is a unicode bias. The unicode bias allows any subrange of 128 consecutive code points between U+0080 to U+10FFFF to be addressed with two sextets. The unicode bias is range constrained to values between U+0080 and U+10FF8F and therefore must be followed by a minimum of two sextets.
UnicodeBias: = Sextet Sextet UnicodeBias Sextet
The character - indicates the type is a twos complement integer and the value is either a field or an array element. Twos complement integers negate the weight of the most significant bit. A six-bit value between 0 and 31 (encoded 0 to V) is therefore a whole number. A six-bit value between 32 and 63 (encoded W to z) counts up from -32 to yield a value between -32 and -1. Data compression is achieved by conditionally discarding leading sextets containing 0 or z. The condition for discarding the 0 after the usage indicator is that the successor sextet is between 0 and V inclusive. The condition for discarding the z after the usage indicator is that the successor sextet is between W and z inclusive. When decoded, the most significant sextet is sign extended by subtracting 64 from decoded whole numbers between 32 and 63. The result is a value between -32 and 31 widened to the maximum register width.
Integer fields are encoded using sextets. An integer field with no sextets is uninitialized.
IntegerField: - IntegerField Sextet
When the lower bound on the range constraint of an array of integer values ≥ 0, the whole number array is preferred over the integer array because twice as many positive values can be stored in the same number of octets. For example, with two octets per value an integer array can store values between 0 and 32767; but a whole number array can store values between 0 and 65535.
On the other hand, when the lower bound on the range constraint of an array of integers < 0, the array of integers must be used. As in the case of whole number arrays the first dimension is the number of elements and the second dimension is the number of octets per element. Also, the number of octets per element must satisfy both bounds of the array element range constraint.
IntegerArray: ArrayDimension ArrayDimension - IntegerArray Octet
Encoded real numbers have six-bit boundaries and may be much narrower than the IEEE-754 formatted values from which they were encoded. The process for narrowing encoded real numbers adapts the dynamic range and precision to the value and is lossless.
RealNumberField: # Sextet Sextet RealNumberField Sextet
To encode IEEE754 formatted data the raw bits of the binary format must first be obtained.
For example, in Java the method Double.doubleToRawLongBits(double value) yields a 64 bit long value corresponding to the IEEE-754 binary64 format.
Shift and mask operations can then be applied to the result to create a sequence of six-bit values which are then encoded as sextets.
Since the binary formats are powers of 2 and the encoded values are on six bit boundaries, the least significant byte of data is padded with either 2 or 4 bits
depending on the IEEE754 format of the source as shown in the following table.
| IEEE754 Format | Sextets | Padding Bits |
|---|---|---|
| binary16 | 3 | 2 |
| binary32 | 6 | 4 |
| binary64 | 11 | 2 |
| binary128 | 22 | 4 |
If the least significant sextet is 0, the number is a candidate for adaptive narrowing. The values +/- infinity, +/- zero and quiet NaN satisfy this condition and can always be narrowed to two sextets. Signalling NaNs can also be narrowed to two sextets provided all the signals are in the 5 highest bits of the significand. Narrowing is limited by the dynamic range of the exponent and the number of trailing zeros in the significand. The dynamic range is a function of the number of bits in the exponent. As seen in the following table, the number of bits in the exponent and consequently the bias applied to the exponent is itself a function of the length of the component in sextets or the element storage size of the array in octets.
| Bits | Bias | Octets | Sextets |
|---|---|---|---|
| 5 | 15 | 2 | 2-3 |
| 6 | 31 | 3 | 4 |
| 8 | 127 | 4 | 5-6 |
| 11 | 1023 | 5-8 | 7-11 |
| 15 | 16383 | 9-16 | 12-22 |
Trailing zeros in the significand can be discarded with no loss of precision provided that the dynamic range does not change. If narrowing the dynamic range would result in underflow, overflow or subnormal values, no further narrowing is permitted. Otherwise, the narrowing operation continues over the reduced dynamic range.
For example, let 0.0625 be a double precision value (binary64 format). Nominally, double precision values are padded to binary66 and encode as eleven sextets. However, 0.0625 is also a power of 2. (2-4). This means all bits in the significand are 0. For a bias of 15, the biased exponent is +11 which means underflow or subnormal values are no factor. As a consequence, this value can be stored in binary12 format and take only two sextets.
When decoded, the bits from the sextets are concatenated to form an integral value.
Shift and mask operations extract the sign, exponent and significand from this value and the value is widened to the desired IEEE754 binary format.
The raw bits from the widening operation are converted to a real number with a reciprocal operation such as Java's Double.longBitsToDouble(long value)
The real array has two dimensions. The first is the number of values; the second is the number of octets per element. Carefully choose the number of octets per element because the fixed number of octets per element subject the stored values to loss of both dynamic range and precision
RealArray: ArrayDimension ArrayDimension # RealArray Octet
To set dynamic range compare the largest and smallest power of 2 expected with the bias in table 2. Allow 10 units of bias for every multiple of 1000. For example, the ratio of one second to one nanosecond is a billion to one. A billion has nine zeros (3 multiples of three zeros) so the bias should be a minimum of 30 (3 x 10). Five bits of dynamic range is too small, six bits is borderline and eight bits is sufficient.
Next we need to check precision. To calculate precision we subtract the number of bits of exponent from the product of 8 and the number of octets. Note that the sign bit cancels the implicit bit of the significand. For example, if we want one nanosecond precision per second of time, the significand needs to be a minimum of 30 bits.
Three octets gives us 24-6 or 18 bits of precision (17 + 1 implicit bit) = not even close.
Four octets gives us 32-8 or 24 bits of precision (23 + 1 implicit bit) = still inadequate .
Five octets gives us 40-11 or 29 bits of precision (28 + 1 implicit bit) = not quite enough.
Six octets gives us 48-11 or 37 bits of precision (36 + 1 implicit bit) = we are there.
The character ' indicates the type is a character or string and the value is either an encoded character field, an encoded string field or an array of octets encoded as UTF-8. A character field has exactly one character following the usage indicator whereas a string field has 0 or more characters following the usage indicator. There is no difference between a character field and a string field containing exactly one character.
CharacterField: ' Character String Field: ' StringField Character
Subrange indicators address the requirement to span 221 unicode code points with an unambiguous, byte-oriented encoding scheme using only reserved characters. The digits 0 to 9 the letters A to Z and a to z, and the characters ^ and _ do not change when encoded. The unicode code point is the same as the encoded value. All other unicode code points require multiple reserved characters to encode. The first reserved character is the subrange indicator. It indicates the number of sextets needed for encoding and the code point offset. Subrange indicators can be used for synchronization on multi-byte characters but are NOT delimiters.
Character:
Sextet CodePoint = Sextet
! Sextet CodePoint = OtherAsciiCharacter
< Sextet CodePoint = UnicodeBias + 6-bit value
> Sextet CodePoint = UnicodeBias + 6-bit value + 64
" Sextet Sextet CodePoint = U+0080 + 12-bit value
$ Sextet Sextet Sextet CodePoint = U+1080 + 18-bit value
% Sextet Sextet Sextet Sextet CodePoint = U+41080 + 24-bit value
OtherAsciiCharacter: in order of appearance, one of
NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
sp ! " # $ % & ' ( ) * + , - . /
: ; < = > ? @ [ \ ] ` { | } ~ DEL
An issue for some South and East Asian languages is the number of bytes required to represent a single character. To address this issue a data compression scheme offsets a preferred range of 128 consecutive code points with a configurable bias. The unicode bias configures the stream for a given locale and language. Once configured, sextets beginning with the subrange indicators < and > index either the lower or upper 64 code points of the range. The default bias is 128 so that the range U+0080 to U+00FF can be represented with one additional sextet and not two.
The Unicode Transfer Format UTF-8 uses all eight bits of the octet.
Since encoded values are restricted to reserved characters, UTF-8 cannot be used in encoded fields.
However, UTF-8 octets are used in variable character elements.
The length which dimensions the variable character element is the number of octets.
For example, the string hello takes six octets h e l l o NUL.
The corresponding variable character element is [ 6 ' h e l l o NUL.
Thus, if the posted data from a client is in UTF-8 format, it can be immediately stored in that same format.
In response to a client query, the UTF-8 data is encoded using sextets as a character field or a string field (e.g. ' h e l l o).
When the client parses the recordset the character or string field is decoded and reverts to UTF-8 for compatibility with the client's browser.
VarCharElement: ArrayDimension ' VarCharElement octet
Variable character elements may be combined to yield an array of variable character elements. Since a two dimensional array requires all elements to have the same number of octets, curly braces { } are used to mark the start and end of the variable length character array content.
VarCharArrayContent:
{
VarCharArrayContent VarCharElement
VarCharArray:
VarCharArrayContent }
The character & indicates the type is a Boolean Set and the value is either an encoded Boolean value an encoded Boolean Set or an array of octets organized as a packed array of Boolean values or a packed array of Boolean Sets. Each bit of a sextet or an octet represents a boolean value. Thus, a set of up to six boolean values can be packed into a sextet and eight boolean values can be packed into an octet. The boolean values are ordered from most significant bit to least significant bit. Thus, a single boolean value uses the most significant bit.
The Boolean field holds a single boolean value. The Boolean set field holds a set of one or more Boolean values. Both are encoded with sextets.
BooleanField: & Sextet BooleanSetField: & Sextet BooleanSetField Sextet
To test whether a boolean value is set, the sextet containing the bit is extracted as a whole number and left shifted 2 + index mod 6 bits.
If the resulting byte is negative, the value true is returned; otherwise false is returned.
To set, clear or toggle a boolean value, the sextet is first converted to a whole number.
The constant 32 is right shifted index mod 6 bits and subtracted from 63 to create a mask.
To clear the bit and the whole number with the mask and reencode the result as a sextet.
To set the bit or the whole number with the right shifted index mod 6 and reencode the result as a sextet.
To toggle the bit exclusive or the whole number with the right shifted index mod 6 and reencode the result as a sextet.
A Boolean field uses only one of 6 bits to hold a single Boolean value. Eight fields would be needed for eight values in a column. However, using a Boolean array for a column of Boolean values uses all eight bits of an octet thereby achieving an 8 to 1 compression of data. The number of octets for the array is the dimension's whole number + 7 right shift 3. Thus, 1-8 values store in one octet; 9-16 values store in two octets, etc.
BooleanArray: ArrayDimension & BooleanArray octet
With a two dimensional array of octets, the first dimension declares how many sets there are and the second dimension is the number of elements per set. Since eight elements can be packed into an octet, the number of octets per set is the second dimension + 7 right shift 3. The product of this result and the first dimension's whole number is the total number of octets in the array.
BooleanSetArray: ArrayDimension ArrayDimension & BooleanSetArray octet
Wheras records are ideal for working with heterogeneous data, arrays are better suited to homogeneous data. In a relational database first normal form can be enforced by restricting each column to a single usage indicator. This means the elements of a column can be implemented as an array. Accessing an entire column of data is highly desirable in signal processing applications such as digital filters or Fourier Transforms where time dependent data can span a large number of records.
For example, a column of a table can be any of the following arrays.
Column: WholeNumberArray Enumerationopt IntegerArray VarCharArray RealArray BooleanArray BooleanSetArray Enumeration: VarCharArray Collation Collation: WholeNumberArray
The following table illustrates how usage and subrange indicators affect the succeeding sextet.
Sextet + [ = - ' '! '" '$ '% # &
0 0 0 0 NUL U+0080 U+1080 U+41080 +00000 FFFFFF
1 1 1 1 SOH U+00C0 U+2080 U+81080 +00001 FFFFFT
2 2 2 2 STX U+0100 U+3080 U+C1080 +00010 FFFFTF
3 3 3 3 ETX U+0140 U+4080 U+101080 +00011 FFFFTT
4 4 4 4 EOT U+0180 U+5080 Illegal +00100 FFFTFF
5 5 5 5 ENQ U+01C0 U+6080 Illegal +00101 FFFTFT
6 6 6 6 ACK U+0200 U+7080 Illegal +00110 FFFTTF
7 7 7 7 BEL U+0240 U+8080 Illegal +00111 FFFTTT
8 8 8 8 BS U+0280 U+9080 Illegal +01000 FFTFFF
9 9 9 9 HT U+02C0 U+A080 Illegal +01001 FFTFFT
A 10 10 A LF U+0300 U+B080 Illegal +01010 FFTFTF
B 11 11 B VT U+0340 U+C080 Illegal +01011 FFTFTT
C 12 12 C FF U+0380 U+D080 Illegal +01100 FFTTFF
D 13 13 D CR U+03C0 U+E080 Illegal +01101 FFTTFT
E 14 14 E SO U+0400 U+F080 Illegal +01110 FFTTTF
F 15 15 F SI U+0440 U+10080 Illegal +01111 FFTTTT
G 16 16 G DLE U+0480 U+11080 Illegal +10000 FTFFFF
H 17 17 H DC1 U+04C0 U+12080 Illegal +10001 FTFFFT
I 18 18 I DC2 U+0500 U+13080 Illegal +10010 FTFFTF
J 19 19 J DC3 U+0540 U+14080 Illegal +10011 FTFFTT
K 20 20 K DC4 U+0580 U+15080 Illegal +10100 FTFTFF
L 21 21 L NAK U+05C0 U+16080 Illegal +10101 FTFTFT
M 22 22 M SYN U+0600 U+17080 Illegal +10110 FTFTTF
N 23 23 N ETB U+0640 U+18080 Illegal +10111 FTFTTT
O 24 24 O CAN U+0680 U+19080 Illegal +11000 FTTFFF
P 25 25 P EM U+06C0 U+1A080 Illegal +11001 FTTFFT
Q 26 26 Q SUB U+0700 U+1B080 Illegal +11010 FTTFTF
R 27 27 R ESC U+0740 U+1C080 Illegal +11011 FTTFTT
S 28 28 S FS U+0780 U+1D080 Illegal +11100 FTTTFF
T 29 29 T GS U+07C0 U+1E080 Illegal +11101 FTTTFT
U 30 30 U RS U+0800 U+1F080 Illegal +11110 FTTTTF
V 31 31 V US U+0840 U+20080 Illegal +11111 FTTTTT
W 32 -32 W space U+0880 U+21080 Illegal -00000 TFFFFF
X 33 -31 X ! U+08C0 U+22080 Illegal -00001 TFFFFT
Y 34 -30 Y " U+0900 U+23080 Illegal -00010 TFFFTF
Z 35 -29 Z # U+0940 U+24080 Illegal -00011 TFFFTT
^ 36 -28 ^ $ U+0980 U+25080 Illegal -00100 TFFTFF
_ 37 -27 _ % U+09C0 U+26080 Illegal -00101 TFFTFT
a 38 -26 a & U+0A00 U+27080 Illegal -00110 TFFTTF
b 39 -25 b ' U+0A40 U+28080 Illegal -00111 TFFTTT
c 40 -24 c ( U+0A80 U+29080 Illegal -01000 TFTFFF
d 41 -23 d ) U+0AC0 U+2A080 Illegal -01001 TFTFFT
e 42 -22 e * U+0B00 U+2B080 Illegal -01010 TFTFTF
f 43 -21 f + U+0B40 U+2C080 Illegal -01011 TFTFTT
g 44 -20 g , U+0B80 U+2D080 Illegal -01100 TFTTFF
h 45 -19 h - U+0BC0 U+2E080 Illegal -01101 TFTTFT
i 46 -18 i . U+0C00 U+2F080 Illegal -01110 TFTTTF
j 47 -17 j / U+0C40 U+30080 Illegal -01111 TFTTTT
k 48 -16 k : U+0C80 U+31080 Illegal -10000 TTFFFF
l 49 -15 l ; U+0CC0 U+32080 Illegal -10001 TFFFFT
m 50 -14 m < U+0D00 U+33080 Illegal -10010 TTFFTF
n 51 -13 n = U+0D40 U+34080 Illegal -10011 TTFFTT
o 52 -12 o > U+0D80 U+35080 Illegal -10100 TTFTFF
p 53 -11 p ? U+0DC0 U+36080 Illegal -10101 TTFTFT
q 54 -10 q @ U+0E00 U+37080 Illegal -10110 TTFTTF
r 55 -9 r [ U+0E40 U+38080 Illegal -10111 TTFTTT
s 56 -8 s \ U+0E80 U+39080 Illegal -11000 TTTFFF
t 57 -7 t ] U+0EC0 U+3A080 Illegal -11001 TTTFFT
u 58 -6 u ` U+F000 U+3B080 Illegal -11010 TTTFTF
v 59 -5 v { U+0F40 U+3C080 Illegal -11011 TTTFTT
w 60 -4 w | U+0F80 U+3D080 Illegal -11100 TTTTFF
x 61 -3 x } U+0FC0 U+3E080 Illegal -11101 TTTTFT
y 62 -2 y ~ U+1000 U+3F080 Illegal -11110 TTTTTF
z 63 -1 z del U+1040 U+40080 Illegal -11111 TTTTTT