2.11. The Audio Node

The audio node adds the possibility to work with digital audio in Verse applications. Audio in Verse is represented as raw, uncompressed, pulse-code modulated (PCM) samples. This simply means that audio is described by providing a linear sequence of amplitude values, called samples, and a desired replay frequency. This way of describing audio digitally is fairly standardized and well supported by typical hardware.

NoteTerminology
 

The word sample is used to denote a single numerical value in a PCM sequence; it is not used to refer to the whole sound.

Verse audio is always monaural, a single sound cannot be in stereo. This better mimics the properties of real-world audio sources, which are considered to be points in space from which audio is emitted. It is up to whatever software is used to "render", i.e. play, the audio to create different versions for a human listener's left and right ears, if so desired.

Audio, in both buffers and streams, is represented as uncompressed PCM (pulse-code modulation) data, meaning linear sequences of digital values that represent the audio amplitude at some point in time. The two main parameters that control the quality of the sound then become the sampling frequency, i.e. how many such values exist per unit of time, and the sampling resolution, i.e. how many bits are used to represent the value. Verse supports arbitrary sampling frequencies, and a six different sample formats of both fixed-point (integer) and floating-point varieties.

2.11.1. Audio Data Organization

Verse supports samples in the following formats:

FormatDescription
VN_A_BLOCK_INT88-bit signed integers. The most space-efficient format supported, but not a very high-quality one.
VN_A_BLOCK_INT1616-bit signed integers. Perhaps the format that is most commonly used in low to medium-end audio applications, such as typical games on PCs and consoles.
VN_A_BLOCK_INT2424-bit signed integers. Since most general-purpose CPUs don't have a native 24-bit integer data type, these are represented as three unsigned 8-bit bytes, stored in big-endian order without any padding.
VN_A_BLOCK_INT3232-bit signed integers, for added precision.
VN_A_BLOCK_REAL3232-bit floating point numbers, stored in IEEE-754 big-endian format. For high-end processing applications.
VN_A_BLOCK_REAL6464-bit floating point numbers, stored in IEEE-754 big-endian format. For very high-end processing applications.

Verse does not support unsigned samples, and neither does it support the compression of audio data.

Samples are always transmitted and received collected into blocks; it is not possible to send a single sample value. This coarser granularity helps reduce the overhead of transmitting bulk data such as audio. The number of samples in a single block depends on the chosen sample format, according to the following table:

FormatBlock Size
VN_A_BLOCK_INT81024
VN_A_BLOCK_INT16512
VN_A_BLOCK_INT24384
VN_A_BLOCK_INT32256
VN_A_BLOCK_REAL32256
VN_A_BLOCK_REAL64128

These numbers have been chosen to optimize packet sizes for the network transmission of audio. There is no way to send a partially full packet; you must always provide (and expect) the number of samples that corresponds to the layer type. Pad with zeroes towards the end to create silence, if needed.

The sampling frequency associated with audio data is specified using a 64-bit floating point number, and is expressed in Hertz (Hz). So, for CD-quality audio, you would use a VN_A_BLOCK_INT16-type layer at a frequency of 44,100.0 Hz.

2.11.2. Node Structure

The audio node provides two distinct kinds of support for handling audio data:

These two variants of support complement each other, and make the overall audio support richer than having just one would. The buffer and stream mechanisms are described in more detail below.

Because the number of samples per block varies with the block's type, and the duration of playback of a single sample varies with the frequency (as 1:1, i.e. at N Hz, you need N samples per second), it is not possible to specify the duration of playback of a single block of audio.

2.11.2.1. Buffers

Buffers, like in the text node node, are used to store audio data for editing. A buffer is simply a named container that can hold blocks of samples. Each such block is given an index, simply an integer that tells you the location of the block in the buffer as a whole. There can be gaps in the index sequence, that represent silence.

The intended use for audio buffers is creating audio editing applications; they provide a host-side "back end" for audio storage. By storing the samples as blocks of the same size used by streams (see below), the transition from passive storage to active playback can be made easier.

The following image (Figure 2-5) illustrates how buffer blocks form a sequence, and that there can be gaps where the data is "clear", i.e. the amplitude is zero and the audio silent. The vertical lines at regular intervals illustrate block boundaries (these blocks are 128 samples each), and the digits below are the block indices of each. Note how they start off at zero, and how the index of the silent block is still a valid index.

Figure 2-5. Audio Buffer Blocks

2.11.2.2. Streams

Streams are simply independent "places" where audio data can be sent for playback. It might be helpful to think of them as channels of a "radio" (the node), each of which can be subscribed to individually.

Unlike almost all other data in Verse, stream data is transferred in an unreliable way; any dropped audio commands will not be resent. This is because the intent is for the commands to contain data to be replayed imminently, there should not be enough time to do a resend.

Data in streams arrives in time-stamped blocks, one per command. The size of the blocks, in number of samples, varies with the data type chosen as per the table above. For network data encoding/decoding reasons, each block specifies its data type and sample frequency, although it is not recommended that these are actually varied in the same stream as that creates rather complex problems during replay.

The timestamp in each block lets the receiver know when that block is supposed to start playing. A typical stream playing client will put the block in a queue, sorted on the timestamp. Blocks are then de-queued and played, possibly employing some kind of double buffering scheme, as the current time reaches that of the block's timestamp.