-
Notifications
You must be signed in to change notification settings - Fork 1
Encoding Specification
This document contains the full specification of the typed binary format TIER. This specification contains the purpose of the format, precise binary layout and example encodings.
The TIER format is split into two different domains: structural data and metatype data. The 'structural data' encoding represents an application value encoded using a set of common serial data structures. In this document this 'structural data' can be referred to as encoded value, structural data or simply data. The 'metatype data' encoding is a typing representation that describes which serial data structures are used and can contain additional semantic information about these data structures. In this document this typing information can be referred to as metadata, metatype or typing information.
The 'metatype data' encoding is composed of a stream of special tags each tag which are small numeric values that uniquely represents a specific encoding value. These values can be primitive types, that is numbers, strings, booleans etc, data structures, such as tuples, lists, maps, unions or types that carry additional semantic or behavioral information. Each possible structural data structure has one unique valid metatype representation. There are two primary reasons for this.
-
Make it easy to check if two encoded values are structurally equivalent and can be treated as the same or automatically converted from one to another.
-
Be able to dynamically generate optimized encoding implementation for each new type (metadata) found, and reuse it for other structurally compatible types.
In the remaining of this text we will describe each 'metadata' tag defined and how the 'structural' data (serial data structure) it describes is encoded and should be interpreted.
Tag values are represented in capital letters, like VOID, BIT, VARINT. However they are only a byte containing a small number. Other values are represented in the form 'type', where 'type' indicates the type of the value and 'value' a textual representation of the value. For example:
123 encoded as a Varint: varint<123>
"text" encoded as String: string<"text">
true encoded as Boolean: boolean<true>
This first set of tags are mainly used to describe the structure of the data encoded, however they can also imply a default semantic for the data. Such default semantic should be considered carefully to define how values described by these tags are mapped to the concepts of the application programming language. After the description of each tag we present a schema indicating the use of the tag and how the values encoded with them are structured in the encoded stream.
This tag indicates that the data encoded does not occupy any space, that is, the encoding is empty. Usually this tag is used to describe data that has only one possible value, like 'nil' in Lua.
Tag: VOID == 0x00
Metadata: VOID
Encoded: nothing
This tag indicates that the data is a number encoded as defined by Google's Protocol Buffer Unsigned VarInt. The VARINT tag has an implicit alignment of ALIGN 0x00. (see the ALIGN tag for details)
Tag: VARINT == 0x02
Metadata: VARINT
Encoded: varint<value>
Value = 1
Metadata = VARINT
Encoded = 0x01
Value = 127
Metadata = VARINT
Encoded = 0x7F
Value = 128
Metadata = VARINT
Encoded = 0x80 0x01
Value = 255
Metadata = VARINT
Encoded = 0xFF 0x01
Value = 256
Metadata = VARINT
Encoded = 0x80 0x02
This tag indicates that the data encoded represents an unsigned numeric integer value. The UINT tag is directly followed with how many bits is used by the numeric value. Applications are required to support all bit values up to 64-bit numbers.
Tag: UINT == 0x09
Metadata: UINT varint<bitsize>
Encoded: uint(bitsize)<value>
Value = 15
Metadata = UINT 0x08 -- unsigned 8-bit integer [0 .. 0xFF]
Encoded = 0x0F
Value = 24
Metadata = UINT 0x10 -- unsigned 16-bit integer [0 .. 0xFFFF]
Encoded = 0x18 0x00
Value = 10
Metadata = UINT 0x04 -- unsigned 4-bit integer [0 .. 0xF]
Encoded = 0xA -- only four bits have been written
This tag indicates that the data encoded represents an signed numeric integer value. The SINT tag is directly followed with how many bits is used by the numeric value. Signed numbers are stored in two's complement format.
--Tag value
Tag: SINT == 0x0a
Metadata: SINT varint<bitsize>
Encoded: sint(bitsize)<value>
Value = -1
Metadata = SINT 0x08 -- signed 8-bit integer [-128, 127]
Encoded = 0x80
Value = 10
Metadata = SINT 0x08 -- signed 8-bit integer [-128, 127]
Encoded = 0x0A
This tag indicates that the data is encoded with an alignment. The alignment is relative to the start of the stream and not absolute alignment, thus it is not guaranteed that the alignment specified with this tag is the absolute alignment. Applications must take this in-mind when creating TIER streams with alignment. It is not required to pad a stream with any particular value.
Tag: ALIGN == 0x11
Metadata: ALIGN varint<alignment> type<T>
Encoded: stream<padding> T<value>
The ALIGN tag has a special relationship to tags that work on the bit level. These tags can potentially make the stream fall outside of byte alignment. For example UINT 0x06 will only use 6 out of the 8 bits that a byte contains. If this is the case implementations must skip the two unused bytes that are left in the byte and align from the next byte position in the stream. That is the align tag ensures that it's both bit aligned to at least 8 and byte aligned to the specified alignment. Thus ALIGN 0x00 type<T> ensures that we are bit aligned.
Original Stream: 0x00
Value = 15
Metadata = ALIGN 0x00 VARINT
Encoded = 0x00 -- Original stream
0x0F -- value 15 aligned to 0x00
Value = 15
Metadata = ALIGN 0x04 VARINT
Encoded = 0x00 -- Original stream
0x?? -- This can be any value
0x??
0x??
0x0F -- value 15 aligned to 0x04 relative to the stream
Value = 15 13
Metadata = UINT 0x04 ALIGN 0x00 UINT 0x04
Encoded = 0x0F -- uint4<15>
0x0D -- ALIGN 0x00 increments stream byte position
This tag indicates that the data is encoded as a sequence of a variable number of values the same type preceded by a numeric length value. The number of bits of the numeric value is included in the metadata. If the number of bits to use is 0x00 then a varint value is used to store the length.
Tag: LIST == 0x0e
Metadata: LIST varint<bitsize> type<T>
Encoded : varint<N> T<item1> ... T<itemN>
Value = {1}
Metadata = LIST 0x00 VARINT
Encoded = 0x01 -- varint<N=5>
0x01 -- varint<1>
Value = {1}
Metadata = LIST 0x08 VARINT
Encoded = 0x01 -- uint8<N=5>
0x01 -- varint<1>
Value = {1,127,128,255,256}
Metadata = LIST 0x00 VARINT
Encoded = 0x05 -- varint<N=5>
0x01 -- varint<1>
0x7F -- varint<127>
0x80 0x01 -- varint<128>
0xFF 0x01 -- varint<255>
0x80 0x02 -- varint<256>
Value = {1, 0, 1, 0, 1}
Metadata = LIST 0x00 UINT 0x01
Encoded = 0x05 -- varint<N=5>
0x0A -- uint1<1> uint1<0> uint1<1> uint1<0> uint1<1>
Value = {1, 0, 1, 0, 1}
Metadata = LIST 0x10 UINT 0x01
Encoded = 0x00 0x05 -- uint16<N=5>
0x0A -- uint1<1> uint1<0> uint1<1> uint1<0> uint1<1>
This tag indicates that the data is encoded as a sequence of a predefined and fixed number of values of a same type. This fixed number of values is encoded as a varint just after the tag. The type of these values must be described immediately after the varint.
Tag: ARRAY == 0x0b
Metadata: ARRAY varint<N> type<T>
Encoded : T<item1> ... T<itemN>
Value = {1}
Metadata = ARRAY 01 VARINT
Encoded = 01 -- varint<1>
Value = {1,127,128,255,256}
Metadata = ARRAY 05 VARINT
Encoded = 01 -- varint<1>
7F -- varint<127>
80 01 -- varint<128>
FF 01 -- varint<255>
80 02 -- varint<256>
This tag indicates that the data is encoded as a sequence of values, each of a predefined type. The number of these predefined types must be described immediately after the tag as a varint. And immediately after must be description of these types in the order the encoded data must be encoded.
Tag: TUPLE == 0x0c
Metadata: TUPLE varint<N> type<T1> ... type<TN>
Encoded : T1<item1> ... TN<itemN>
Value = (12, 23, 127)
Metadata = TUPLE 03 UINT 0x08 UINT 0x10 VARINT
Encoded = 0x0D -- uint-8<12>
0x17 0x00 -- uint-16<23>
7F -- varint<127>
This tag indicates that the data is encoded as a value from a list of predefined types. The encoded value is preceded by a numeric value indicating with which type from the list it was encoded. What kind of numeric value is used depends on a bitsize parameter that directly after the union tag. This is followed by the number of distinct types in the list of predefined types. Followed by the predefined types. It is not a requirement that Unions be lexicographically ordered. Thus unions that encode the same set of types can differ from each other and are then not compatible. Having lexicographically ordered unions might become a requirement in a future version of TIER so if possible implementers are recommended to order their unions in lexicographical order.
Tag: UNION == 0x0d
Metadata: UNION varint<bitsize> varint<X> type<T1> ... type<TN>
Encoded : varint<X> type<TX> or uint_bitsize<X> type<TX> # X = [0..N-1]
Value = 23 (as uint8)
Metadata = UNION 0x00 02 UINT 0x08 VARINT
Encoded = 00 -- varint<0>
01 -- uint8<23>
Value = 127 (as varint)
Metadata = UNION 0x00 02 SINT 0x08 VARINT
Encoded = 01 -- varint<1>
7F -- varint<127>
Value = empty (void)
Metadata = UNION 0x00 02 VOID VARINT
Encoded = 00 -- varint<0>
TYPE tag indicates that the data encoded is a type description formed by the tags described here. TYPEREF tag does not describe an decoded data, it only indicates that the type of the encoded data was previously described in the same type description. It is followed by a varint indicating the negative offset (in bytes) to the description of the type, as zero being the byte just before the tag.
Tag: TYPE == 0x06
Metadata: TYPE
Encoded: a metadata type
Tag: TYPEREF == 0x07
Metadata: TYPEREF varint<byteoffset>
Encoded: encoded value of the type typeref points to
Value = "union LinkedListNode { LinkedListNode, void };"
Metadata = TYPE
Encoded = UNION
02 -- varint<2>
TYPEREF
02 -- varint<2> # offset of two bytes before TYPEREF
VOID
This tag indicates that the data is encoded preceded by the description of its type.
Tag: DYNAMIC == 0x08
Metadata: DYNAMIC
Encoded : type<T> T<value>
Value = 32 (as uint8)
Metadata = DYNAMIC
Encoded = UINT 0x08 -- type
0x20 -- uint8<32>
Value = 127 (as varint)
Metadata = DYNAMIC
Encoded = VARINT -- type
7F -- varint<127>
Value = {1,127,128,255,256} (as list of varint)
Metadata = DYNAMIC
Encoded = LIST 0x00 VARINT -- type
05 -- varint<N=5>
01 -- varint<1>
7F -- varint<127>
80 01 -- varint<128>
FF 01 -- varint<255>
80 02 -- varint<256>
This tag indicates that the encoded data has a unique identity of its own (selfness) associated with it in this stream. As a result of this, the same value might appear again the stream and shall be decoded as a the same value and not another value with the identical contents (copy of it). After the tag there must be the description of the type of the data encoded.
The value is encoded as a varint indicating if the value is encoded for the very first time in the stream (varint is 0) or if the value is reappearing. In the latter case the varint is a negative offset to the first byte of the first (and only) serialization of the value.
Tag: OBJECT == 0x12
Metadata: OBJECT type<T>
Encoded : varint<offset> (offset == 0 ? T<value> : <empty>)
Value = A=(1,B) B=(2,C) C=(3,A)
Metadata = OBJECT TUPLE 02 VARINT TYPEREF 04
Encoded = 00 -- varint<0> # offset zero, value follows
01 -- varint<1> # 1st value of tuple 'A'
00 -- varint<0> # offset zero, value follows
02 -- varint<2> # 1st value of tuple 'B'
00 -- varint<0> # offset zero, value follows
03 -- varint<3> # 1st value of tuple 'C'
06 -- varint<6> # offset of six bytes before
This tag indicates that the data is encoded preceded by a varint containing the size of the encoded representation of the data. Data encoded this way can be ignored or partially decoded without compromising the decoding of the remainder of the stream. The tag is followed by the description of the type of the data encoded.
Tag: EMBEDDED == 0x13
Metadata: EMBEDDED type<T>
Encoded : varint<length> T<value>
Value = {1,127,128,255,256}
Metadata = EMBEDDED LIST 0x00 VARINT
Encoded = 09 -- varint<9> # encoded length
05 -- varint<N=5>
01 -- varint<1>
7F -- varint<127>
80 01 -- varint<128>
FF 01 -- varint<255>
80 02 -- varint<256>
The second set of tags are the ones used to associate additional semantics to the type. This additional semantic might indicate a language or application specific concept (like a class) that shall realize the decoded value. For example we shall describe a two new types with structure 'tag tag' but one must describe a table of Lua containing integer numbers and the other 'userdata' of Lua containing a C array of 'long'. This additional semantic should be described in each type representation. To do such we will use the tags described in this section.
This tag indicates that the data encoded is associated with additional semantics indicated by a id encoded as stream encoded just after the tag. The actual type of the encoded type is described after this stream. A stream in this context is a varint length field followed by a string of bytes.
Tag: SEMANTIC 0x14
Metadata: SEMANTIC stream<ID> type<T>
Encoded : T<value>
[Non-Standard Tags]
All tag values larger than 127 (varint of 2 bytes) indicates that the data encoded is associated with additional semantics indicated by the tag value. The actual type of the encoded type is described after this stream.
The purpose of this tag is similar to 'tag' but are intended for domain-specific extensions to this specification.
Metadata: varint< N>127 > type<T>
Encoded : T<value>
Some value semantics are very common in programming languages in general. Therefore we define some abbreviation for usual semantics associated with common structure used to represent them.
Value associated with the null semantics. Some languages differ between a null value and absence of value (void), therefore this additional tag might be useful to make this distinction.
Tag: NULL == 0x01
Metadata: NULL = SEMANTIC "null" VOID
This tag represents a true or false value and is stored in a single byte. The value 0x00 is used for false and the value 0x01 is used for true.
Tag: BOOLEAN == 0x1b
Metadata: BOOLEAN = SEMANTIC "boolean" UINT 0x08
This tag represents a single bit encoded value. If the bit is 1 the value represents a positive sign, and if the value is 0 the value represents a negative sign value.
Tag: SIGN == 0x16
Metadata: SIGN = SEMANTIC "sign" UINT 0x01
This tag represents a single bit boolean value. If the bit is 1 the value is true, if the bit is 0 the value is false.
Tag: FLAG == 0x15
Metadata: FLAG = SEMANTIC "flag" UINT 0x01
Tags for all common integers present in many programing languages. These integers use the native endianess of the platform.
Tag: UINT8 == 0x1c
UINT8 = ALIGN 0x00 UINT 0x08
Tag: UINT16 == 0x1d
UINT16 = ALIGN 0x00 UINT 0x10
Tag: UINT32 == 0x1e
UINT32 = ALIGN 0x00 UINT 0x20
Tag: UINT64 == 0x1f
UINT64 = ALIGN 0x00 UINT 0x40
Tag: SINT8 == 0x20
SINT8 = ALIGN 0x00 SINT 0x08
Tag: SINT16 == 0x21
SINT16 = ALIGN 0x00 SINT 0x10
Tag: SINT32 == 0x22
SINT32 = ALIGN 0x00 SINT 0x20
Tag: SINT64 == 0x23
SINT64 = ALIGN 0x00 SINT 0x40
Tags to indicate characters encoded in a single (UTF8) or two bytes (UTF-16).
CHAR = SEMANTIC "UTF8" UINT8 # interpreted as character
WCHAR = SEMANTIC "UTF16" UINT16 # interpreted as wide-character of two-bytes
Tags to indicate byte streams and strings of simple or wide characters.
STREAM = LIST 0x00 UINT8
STRING = LIST 0x00 CHAR
WSTRING = LIST 0x00 WCHAR
A tag to indicate a varint Zig-Zag as described by Google's Protocol Buffers.
VARINTZZ = SEMANTIC "Zig-Zag" VARINT # interpreted as varint zig-zag
A tag to indicate a real number encoded as in floating-point format defined by IEEE 754 using the native endianess of the platform. The Half tag and Quad tag is currently not a required part of the format. It will however likely be required in a future version of the format, thus implementors are recommended to implement them as well.
HALF = SEMANTIC "half" UINT16 # IEEE 754 half
SIMPLE = SEMANTIC "single" UINT32 # IEEE 754 simple
DOUBLE = SEMANTIC "double" UINT64 # IEEE 754 double
QUAD = SEMANTIC "quad" ALIGN 0x00 UINT 128 # IEEE 754 quadruple
A tag to indicate a list of values without repetition and where order is irrelevant.
SET varint<bitsz> type<T> = SEMANTIC "set" LIST varint<bitsz> type<T>
A tag do indicate a map of values of one type to values of another type, which is described as a list of pairs containing a mapped value to its related value. The mapped values does not repeat and the order of the pairs are irrelevant.
MAP varint<bitsz> type<K> type <T> = SEMANTIC "map" LIST varint<bitsz> TUPLE 02 type<K> type <T>
####Alignments There are some tags for common alignments.
ALIGN1 type<T> = ALIGN 0x01 type<T>
ALIGN2 type<T> = ALIGN 0x02 type<T>
ALIGN4 type<T> = ALIGN 0x04 type<T>
ALIGN8 type<T> = ALIGN 0x08 type<T>
###Typed Stream Layout
It important for tier applications to be able to parse any typed tier stream without having previous knowledge of the contents contained in the stream. This section specifies how the stream should be laid out in memory to make this possible.
The basic layout for encoding N different application values is the following:
-----------------------------------------------------------------------------------------------------
| MSIZE-0 | METADATA-0 | ENCODED VALUE-0 | ... | MSIZE-(N-1) | METADATA-(N-1) | ENCODED VALUE-(N-1) |
-----------------------------------------------------------------------------------------------------
Here MSIZE-M is a varint encoded length that represents the size of metadata for the Mth message. METADATA-M represents the metadata for the message and ENCODED VALUE-M is the actual encoded data for the message.
Example
--This is an example of diffrent metadata encodings
Metadata = TUPLE 02 VARINT BOOLEAN
Encoded = 0x0c -- TUPLE
0x03 -- tuple length
0x02 -- tuple members
0x02 -- VARINT
0x1b -- BOOLEAN
Metadata = ALIGN 0x00 VARINT
EXAMPLE
Value0 = 32
Metadata0 = UINT8
Value1 = { 10, true }
Metadata1 = TUPLE 02 SINT8 BOOLEAN
Stream = 0x01 -- Size of Metadata0
0x1c -- UINT8
0x20 -- 32 in hex
0x04 -- Size of Metadata1
0x0c -- TUPLE
0x02 -- tuple has 2 items
0x20 -- SINT8
0x1b -- BOOLEAN
0x0A -- 10 in hex
0x01 -- true
As we can see from the example the metadata is interleaved with the encoded data in the stream. Thus if you are only interested in the typing information in the stream you typically have to parse the entire stream. However, it is possible to place all metadata at the beginning of the stream by wrapping all encoded values in a tuple. If this is done only the first part of the stream needs to be checked to extract the metadata. However it is legal to append extra encoded values to the end of such a stream so there are no guarantees that just interpreting the first part of the stream will provide all the typing information in the stream.
Example
--An example showing how you can wrap metadata in a tuple
--to place all metadata at the beginning of the stream.
Value0 = 32
Metadata0 = UINT8
Value1 = { 10, true }
Metadata1 = TUPLE 02 SINT8 BOOLEAN
Value2 = -1
Metadata2 = SINT8
MetadataTuple = TUPLE 03
UINT8
TUPLE 02 SINT8 BOOLEAN
SINT8
Stream = 0x08 -- Size of MetadataTuple
0x0C -- TUPLE
0x03 -- wrapping 3 values
0x1C -- UINT8
0x0C -- TUPLE
0x02 -- tuple length
0x20 -- SINT8
0x1b -- BOOLEAN
0x20 -- SINT8
0x20 -- 32 in hex (uint8)
0x0a -- 10 in hex (sint8)
0x01 -- true (boolean)
0x80 -- -1 in hex (sint8)