3.3. Data Encoding

The purpose of the network protocol is to allow data to be serialized and transmitted over a network in a way that is independent of the machine type of the transmitting end. This requires well-defined ways to represent various kinds of values, and how this is done is specified in this section. The discussion is split into sections for each major type of value required by Verse; integers, reals (floating-point numbers), text strings, enumerations, and structures. There is also a description of how arrays of these value types are encoded.

Being external representations, all numerical types have well-defined sizes, in bits.

3.3.1. Integers

All the various integers defined in Verse are encoded as sequences of 8-bit bytes, with the number of bytes trivially deductible from the bit-size of the integer type (simply divide by eight). Encoding is done in "network byte order", with the most significant byte first (towards the start of a packet).

Signed integers are encoded in two's complement form, with the most significant bit being the sign bit.

The various integer aliases encode as the types they alias, respectively.

3.3.2. Reals

Reals are encoded as IEEE 754 numbers, and occupy four or eight bytes each depending on precision. The first byte in the encoded form holds the sign bit in its most significant bit, followed by the biased exponent and finally the mantissa. The general form for real numbers is thus:

TypeSign BitExponentMantissa
real321823
real6411152

3.3.3. Strings

Strings are encoded in UTF-8. Briefly, this means that characters with Unicode code points U+0001 to U+007f (inclusive) are encoded as single bytes with the value of the code point, while code points outside that range require multiple bytes. At most, a single character requires four bytes to encode.

Strings use a byte with the value 0 to mark the end, so no string can contain that byte legally. An empty string is thus encoded as a single byte with the value 0.

3.3.4. Enumerations

All enumerations encode as the uint8 integer type, and thus have had all their values chosen to fit in the numerical range of that type.

3.3.5. Arrays

Arrays are encoded simply as a sequence of values of the basic type, one after the other, and with no extraneous information about number of elements as part of the actual array. For variable-length arrays, the length must be encoded in some preceding field to allow proper decoding, as the length is not encoded automatically as part of the array.

For arrays of unions, it must be specified where the information needed to resolve which union member is present is stored, and if it is shared by all slots or not. This is unlike programming languages such as C and C++, where unions always occupy enough space to accommodate their largest possible member.

3.3.6. Structures and Unions

A structure is encoded by encoding its fields, in order as they appear in the definitions, end-to-end. There is no padding between fields.

A union type is encoded by picking one of the fields, using external information, and then applying whatever encoding rules apply to that field's type. The encoded size of a union is exactly the size of whichever field was encoded, there is no padding of the union as a whole to the size of the largest field (unlike the in-memory representation of unions in the C language).

3.3.7. Alignment

Verse data is always encoded completely without padding after or between values. Since all commands begin with the command byte, this means that the rest of the fields are almost guaranteed not to be aligned in any particular manner. Thus, code that deals with doing encoding and decoding must be carefully written.

Basically, accessing the command buffers on a byte-by-byte basis is the easiest way to do this safely. Doing whole reads and writes of data types bigger than one byte (such as a double which is generally the way the real64 data type is implemented in C) using non-aligned addresses can often cause bad performance, or even crashes.

The reference API implementation hides these details from users; application programmers using it need never concern themselves with how Verse data is transported in a network.

No data type is smaller than a single byte; Verse data is encoded as a sequence of whole bytes with no overlap, all bits of a given byte always belong to the same logical field.

3.3.8. Examples

Here are few simple examples of how values of the various types encode in the network. Encoded data is shown as a sequence of bytes, where the leftmost byte would be the one towards the beginning of the packet.

Example 3-1. A signed integer

The integer -4711, taken as an int32, would be encoded like this:

0xff0xff0xed0x99

Example 3-2. An unsigned integer

The integer 711, taken as an uint16, would be encoded like this:

0x020xc7

Example 3-3. A Simple String

The 10-character string "Verse test" would encode as the following sequence of eleven 8-bit bytes:

0x560x650x720x730x650x200x740x650x730x740x00

Example 3-4. An International String

The 7-character string "smörgås" [1] encodes as the following 10-byte sequence using UTF-8:

0x730x6d0xc30xb60x720x670xc30xa50x730x00

Here, it is clear that a string using characters outside the range U+0001 to U+007f inclusive will use more space in their encoded form than a string that uses only characters from inside the range.

Notes

[1]

Swedish for "sandwich"