H3 serialization is used in Baratine for the document storage. Its requirements are:
- Must be searchable for document queries
- Reasonably fast and compact
- Supports Java serialization
- Supports self-described, inline schema
- Allows graphs but does not require them
- Allows for predefined schema
- Reasonably simple, avoiding many exception cases
0x00 - 0x7f # integer 0x80 - 0xbf # utf-8 strings 0xc0 - 0xcf # binary data 0xd0 # object definition 0xd1 - 0xef # objects 0xf0 # null 0xf1 # false 0xf2 # true 0xf3 # 64-bit double 0xf4 # 32-bit float 0xf5 # chunked string 0xf6 # chunked binary 0xf7 # back reference for graphs 0xf8 # graph reference for next object 0xf9 # graph references for all following objects 0xfa-0xfe # reserved 0xff # error/invalid (reserved)
Integers are encoded with a variable length. Low-order chunks are first. The high bit is set if additional bytes are required.
For the pre-defined ranges, the high bit is the high bit of the range. So 0x00 is the final byte for an integer, while 0x40 requires additional bytes.
0x80 # zero-character string 0x81 'a' # one-character string 0xb2 0x01 ... # 18-character string (1 * 16 + 2)
Signed integers use the zig-zag encoding. The low order bit is the sign bit, so positive and negative numbers alternate.
0x00 # integer 0 0x01 # integer -1 0x02 # integer 1 0x03 # integer -2 0x40 0x01 # integer 32 (1 * 0x40 + 0 converted using zig-zag)
Objects encode their type followed by their data. There are currently four object types:
class # fixed named fields list # variable-length data map # key, value pairs enum # enumeration using integers to select the value
Object definitions are used to self-describe the schema. The format looks like:
class def --------- uint # object id string # type name (java class name) uint # type (class, list, map, enum) uint # number of fields field def --------- string # name uint # type (must be 1)
Class-typed objects are encoded with their object definition followed by field values. The field values match the order of the fields defined by the class definition.
0xe1 0x04 0x85 hello # def 65 field 1="hello"
Lists are variable length, comprised of chunks. Each chunk has a length integer, where the lowest bit is a ‘more’ flag. If non-zero, there are more chunks.
0xd9 0x02 0x85 hello # def 9 (list), length=1, more=0, string "hello"
Maps are variable length, comprised of chunks. Each chunk has a length integer, where the lowest bit is a ‘more’ flag. If non-zero, there are more chunks.
0xd8 0x02 0x83 key 0x85 hello # def 8 (map), length=1, more=0 # key "key", value "hello"