Baratine on GitHub

H3 Serialization

H3 serialization is used in Baratine for the document storage. Its requirements are:

  • Must be searchable for document queries
  • Reasonably fast and compact
  • Supports Java serialization
  • Supports self-described, inline schema
  • Allows graphs but does not require them
  • Allows for predefined schema
  • Reasonably simple, avoiding many exception cases

Byte Encoding

0x00 - 0x7f  # integer
0x80 - 0xbf  # utf-8 strings
0xc0 - 0xcf  # binary data
0xd0         # object definition
0xd1 - 0xef  # objects
0xf0         # null
0xf1         # false
0xf2         # true
0xf3         # 64-bit double
0xf4         # 32-bit float
0xf5         # chunked string
0xf6         # chunked binary
0xf7         # back reference for graphs
0xf8         # graph reference for next object
0xf9         # graph references for all following objects
0xfa-0xfe    # reserved
0xff         # error/invalid (reserved)

Integer Encoding

Integers are encoded with a variable length. Low-order chunks are first. The high bit is set if additional bytes are required.

For the pre-defined ranges, the high bit is the high bit of the range. So 0x00 is the final byte for an integer, while 0x40 requires additional bytes.

Examples:

0x80          # zero-character string
0x81 'a'      # one-character string
0xb2 0x01 ... # 18-character string (1 * 16 + 2)

Signed integers use the zig-zag encoding. The low order bit is the sign bit, so positive and negative numbers alternate.

Examples:

0x00       # integer 0
0x01       # integer -1
0x02       # integer 1
0x03       # integer -2
0x40 0x01  # integer 32 (1 * 0x40 + 0 converted using zig-zag)

Object Encoding

Objects encode their type followed by their data. There are currently four object types:

class  # fixed named fields
list   # variable-length data
map    # key, value pairs
enum   # enumeration using integers to select the value

Object Definition

Object definitions are used to self-describe the schema. The format looks like:

class def
---------

uint   # object id
string # type name (java class name)
uint   # type (class, list, map, enum)
uint   # number of fields

field def
---------
  string # name
  uint   # type (must be 1)

Class Object Value

Class-typed objects are encoded with their object definition followed by field values. The field values match the order of the fields defined by the class definition.

0xe1 0x04 0x85 hello  # def 65 field 1="hello"

List Value

Lists are variable length, comprised of chunks. Each chunk has a length integer, where the lowest bit is a ‘more’ flag. If non-zero, there are more chunks.

0xd9 0x02 0x85 hello # def 9 (list), length=1, more=0, string "hello"

Map Value

Maps are variable length, comprised of chunks. Each chunk has a length integer, where the lowest bit is a ‘more’ flag. If non-zero, there are more chunks.

0xd8 0x02 0x83 key 0x85 hello # def 8 (map), length=1, more=0
                              # key "key", value "hello"