Making generic wrapper for a niche binary format using meta tags

In this article I’ll show how meta tags turned out to be a surprisingly powerful feature and how I used them to implement core parts of a polymorphic wrapper around a binary message format.

Problem overview

I work for a company that makes various software for financial markets. Some of those markets use FIX SBE format (or simply SBE) for message encoding. Details of this format don’t really matter here but small example will make overall picture clearer. In SBE message layout is specified using XML schema:

<!-- user-defined types... -->
<type name="uint32_t" primitiveType="uint32"/>

<!-- user-defined messages... -->
<sbe:message name="msg_a" id="1">
    <field name="field_a" type="uint32_t" id="1"/>
    <field name="field_b" type="uint32_t" id="2"/>
</sbe:message>

In memory this message looks roughly like:

struct msg_a_layout{
    // common for all messages from the same schema, provides message ID
    message_header header;
    uint32_t field_a;
    uint32_t field_b;
};

Thanks to its binary nature, the format is very fast to work with, much faster than its ancestor, text-based FIX.

Typical way of working with SBE is to generate language-specific code from XML schema (so the generated code is schema-specific) and include it into the final app. This workflow is very similar to protobuf or flatbuffers.

At some point, being unhappy with our proprietary implementation, I decided to write my own one - sbepp. The main goals I had in mind for that project were:

zero overhead, in financial apps we care about performance
provide only the basic functionality, building blocks that allow users to create their own efficient solutions to their specific problems
provide all information from SBE schema, high-level things like field description and low-level details like memory offsets and sizes

As a result, generated code never allocates or does more work than one would do by hand.

sbepp compiles XML schema into C++ header-only library, for most SBE entities it generates a class with field-named accessors:

// SBE-specific wrapper around `std::uint32_t`
class uint32_t{
    std::uint32_t value() const;
    // ...
};

class msg_a{
    uint32_t field_a();      // getter
    void field_a(uint32_t);  // setter

    uint32_t field_b();
    void field_b(uint32_t);
};

// other messages, types, etc.

Since SBE is a format from financial world, its primary use-case is encoding and decoding market messages. It’s quite natural to generate and include schema-specific headers in market-specific component. We began to use sbepp for this purpose and everything was quite good.

However, we also have a project that can work with multiple markets and SBE is used internally as their common message format. The main market logic is still located in market-specific components but there’s also a core component that has to be able to work with any SBE message in a generic way. In particular, it should be able to:

convert to/from JSON
create and fill message given its schema/message/field name/value

And here’s the problem, how to make a generic, runtime polymorphic wrapper around a bunch of unrelated classes, each with its own set of field names and types? There’s no way to convert msg->get_by_name("field_a") into msg.field_a() in a generic fashion. Even the return type for such a function is unclear.

In theory, sbepp knows everything during code generation but again, it’s not possible to generate interface/implementation that will satisfy all users. Not mentioning that most users don’t even need this polymorphic wrapper, it’s a very ad-hoc problem.

So the goal of this task became to provide a minimal mechanism that allows users to do literally anything they want without overhead. No ad-hoc solutions, only building blocks for them.

However, there was a glimpse of light. If you remember, one of the initial sbepp goals was providing all information from SBE schema. That means field names, descriptions, offsets, basically every attribute that SBE schema defines. The way it’s been done is a standalone set of meta tags that represent SBE schema structure and various traits to access properties via those tags:

// tags
struct schema_name::schema{
    struct types{
        struct uint32_t{};
        // other types...
    };

    struct messages{
        struct msg_a{
            struct field_a{};
            // other fields...
        };
        // other messages...
    };
};

// traits
struct type_traits<schema_name::schema::types::uint32_t>{/*...*/};
struct message_traits<schema_name::schema::messages::msg_a>{/*...*/};
struct field_traits<schema_name::schema::messages::msg_a::field_a>{/*...*/};

// representation classes
class schema_name::messages::msg_a{
    uint32_t field_a();
    void field_a(uint32_t);
    // other fields...
};

// a mapping between them is available too
// representation type to meta tag
static_assert(std::same_as<
    sbepp::traits_tag_t<msg_a>,
    schema_name::schema::messages::msg_a>);
// meta tag to representation type
static_assert(std::same_as<
    message_traits<schema_name::schema::messages::msg_a>::value_type,
    schema_name::messages::msg_a>);

Solution

Turns out that both, a generic de/serialization and by-name accessors, need similar functionality - ability to navigate over the fields at compile time to access their meta information (e.g. name) and, at the same time, to access field value at run-time if needed.

From the previous section we know that sbepp provides a way to access both kinds of information via two separate API which, however, is impossible to use generically:

// access runtime value
auto field_value = msg.field_a();
// access meta-information
constexpr auto field_name = field_traits<message_a::field_a>::name();
// but no way to combine them

To solve the first part of the problem, compile-time navigation, we can use a very simple tool - a type list.

template<typename... Types>
struct type_list{};

Existing meta-tags and their traits already provide all possible information, we only need to put them inside a list and provide via trait:

// concrete message traits
template<>
struct message_traits<msg_a>{
    using field_tags = type_list<
    schema_name::schema::messages::msg_a::field_a,
    schema_name::schema::messages::msg_a::field_b>;
};

Same approach is used to represent all children tags within their parents traits, not just message fields. For more convenience, let’s add is_ABC_tag trait so user can figure out tag kind and use proper trait to access its properties, e.g. is_enum_tag<T> followed by enum_traits<T>.

The next step is a runtime access by compile time tag. For that we add a couple of functions: get/set_by_tag<Tag>. Every generated class that has children to access by tag implements it by simply forwarding to a proper name-based accessor:

class msg_a{
    // normal accessors
    uint32_t field_a();
    void field_a(uint32_t);

    // for exposition only
    auto get_by_tag(message_a::field_a){
        return field_a();
    }

    void set_by_tag(message_a::field_a, uint32_t value){
        return field_a(value);
    }
};

// public methods forward to internal ones above
sbepp::get_by_tag<message_a::field_a>(msg);
sbepp::set_by_tag<message_a::field_a>(msg, 123);

Now let’s see how these two seemingly minor additions allowed a very rich set of functionality. Actually, they allow nearly everything one can imagine about working with SBE data.

Examples

Disclaimer. Examples are intentionally simplified and are not intended for direct use as-is. The main goal is to demonstrate overall approach, interested reader can look near the bottom of the Examples page for examples that do compile.

Common technique for most examples is iteration over a type list that contains children tags of various kind. Its implementation is not important here, just assume it has this signature:

// For each tag `T` in `TypeList`, calls `cb(T{})` until it returns `true`.
// Returns `true` if `cb` returned `true`, `false` otherwise.
template<typename TypeList>
auto for_each_tag_until(auto cb);

Also, I won’t specify sbepp namespace to make the code more compact.

Enum to string

sbepp represents SBE enums as scoped C++ enums for which there’s a classic problem: you have an enum and want to log its name. To solve it we can iterate over all known enumerator values to find a match with the current one:

// returns `nullptr` if value is unknown
template<typename Enum>
const char* enum_to_string(Enum e){
    const char* res{};

    for_each_tag_until<enum_traits<traits_tag_t<Enum>>::value_tags>(
        [e, &res]<typename Tag>(Tag){
            if(enum_value_traits<Tag>::value() == e){
                res = enum_value_traits<Tag>::name();
                return true;
            }
            return false;
    });

    return res;
}

Btw, here’s how to do the opposite conversion:

template<typename Enum>
Enum string_to_enum(std::string_view name){
    Enum res{};

    for_each_tag_until<enum_traits<traits_tag_t<Enum>>::value_tags>(
        [name, &res]<typename Tag>(Tag){
        if(enum_value_traits<Tag>::name() == name){
            res = enum_value_traits<Tag>::value();
            return true;
        }
        return false;
    });

    return res;
}

Note that it’s not the only possible implementation. While for enum-to-string conversion and relatively small enums we can expect compiler to optimize into a lookup table, for the opposite conversion a compile time hash table can be used to improve the performance in complex cases (again, can be decided at compile time based on the number of enumerators).

Handle all schema messages

Another common use-case is when you receive SBE message from network and need to handle it. To do that, you first need to figure out message type by looking at message header and then create corresponding message wrapper. Instead of writing a bunch of if-s or case-s for each message type manually, we can make a generic helper that will do that for us:

// calls `cb(Message)` when a buffer given by `data`/`size` pair represents
// any `Message` from schema represented by `SchemaTag`.
// returns `false` if message is unknown
template<typename SchemaTag>
bool handle_schema_message(const char* data, const size_t size, auto cb){
    // implementation of this function is not important here
    const auto msg_id = get_message_id_from_header<SchemaTag>(data, size);

    return for_each_tag_until<schema_traits<SchemaTag>::message_tags>(
        [data, size, msg_id, &cb]<typename Tag>(Tag){
            if(message_traits<Tag>::id() == msg_id){
                cb(message_traits<Tag>::value_type{data, size});
                return true;
            }
            return false;
    });
}

// usage:
handle_schema_message<MySchema::schema>(data, size, overloaded{
    [](MySchema::messages::message_a msg){},
    [](MySchema::messages::message_b msg){},
    // other messages...
});

Note that this helper allows to either ensure that all messages are handled or to handle only some target messages by adding [](auto){} at the end of the overload set. Optimizer will see it and eliminate branches for ignored messages.

Access by name

So far we’ve seen only the usage of tags iteration, let’s add get/set_by_tag into the mix. Here’s simplified by-name getter:

template<typename Message>
uint64_t get_by_name(Message m, std::string_view field_name){
    uint64_t res{};

    for_each_tag_until<message_traits<traits_tag<Message>>::field_tags>(
        [field_name, m, &res]<typename Tag>(Tag){
        if(field_traits<Tag>::name() == field_name){
            res = get_by_tag<Tag>(m).value();
            return true;
        }
        return false;
    });

    return res;
}

It assumes all message fields type can be represented by uint64_t. The full implementation of this method can return a variant of basic types or even some extra info, for example enumerator name along with its value. It also should handle complex field paths instead of just a single field name, e.g. field_a.sub_field_b. That’s all doable using the same technique. Note that this approach relies on string comparison, it compares given field name to all known field names at runtime. This might be suboptimal for some applications. Further optimization is quite trivial, we can compare numbers instead of strings by relying on some compile time hash function:

// constexpr-friendly hash implementation
constexpr uint64_t get_hash(const std::string_view str);

const auto field_name_hash = get_hash(field_name);

for_each_tag_until<message_traits<traits_tag<Message>>::field_tags>(
    [field_name, field_name_hash, m, &res]<typename Tag>(Tag){
        if((get_hash(field_traits<Tag>::name()) == field_name_hash)
            && (field_traits<Tag>::name() == field_name)){
                res = get_by_tag<Tag>(m).value();
                return true;
        }
        return false;
});

Because field names are available at compile time, their hash can be computed at compile time too so at runtime this will mostly compare numbers. One string comparison is unavoidable if we want to protect against collisions. To give you an idea how to extend it to support, for example, SBE enums, we can leverage if constexpr to distinguish field kind:

for_each_tag_until<...>(
    [...]<typename Tag>(Tag){
    // as before
    if(names_match){
        // field tag != its value type tag
        using value_type_tag = field_traits<Tag>::value_type_tag;
        if constexpr(is_enum_tag_v<field_type_tag>){
            res = to_underlying(get_by_tag<Tag>(m));
        }
        else{
            res = get_by_tag<Tag>(m).value();
        }
        return true;
    }
    return false;
});

Since SBE enums are scoped C++ enum-s, we use to_underlying() to get their underlying value instead of value() for numeric fields.

OK, let’s see the setter:

template<typename Message>
bool set_by_name(Message m, std::string_view field_name, uint64_t value){
    return for_each_tag_until<
        message_traits<traits_tag_t<Message>>::field_tags>(
        [m, value]<typename Tag>(Tag){
            if(field_traits<Tag>::name() == field_name){
                set_by_tag<Tag>(m, value);
                return true;
            }
            return false;
    });
}

Again, the real implementation will need to handle various field types and check that given value can be properly converted to the underlying field type.

De/serialization

Conversion to/from any other format is not a big deal now as we can freely iterate over the message structure and get any representation of its fields, let’s imagine we want JSON:

template<typename Message>
std::string to_json(Message m){
    json j;

    for_each_tag_until<message_traits<traits_tag_t<Message>>::field_tags>(
        [m, &j]<typename Tag>(Tag){
            j[field_traits<Tag>::name()] = get_by_tag<Tag>(m).value();
            return false;
        });

    return j.to_string();
}

template<typename Message>
Message from_json(const json& j){
    Message m;

    for_each_tag_until<message_traits<traits_tag_t<Message>>::field_tags>(
        [&m, &j]<typename Tag>(Tag){
            set_by_tag<Tag>(m, j[field_traits<Tag>::name()]);
            return false;
        });

    return m;
}

Of course, the real implementation is a bit harder than that because message can have nested structure and many other field types. But what’s important is that it’s possible to tailor the code to specific needs, for example, in our internal implementation enums are represented as JSON strings or numbers for unknown values, optional numeric fields are numbers or JSON null, arrays are JSON strings or JSON arrays of numbers and so on.

Final thoughts

Being able to navigate over the message structure at compile time and have access to both its compile and run-time properties has become a surprisingly powerful mechanism. I already used it in sbepp to replace some generated code with a single generic implementation that works for all. When I initially wrote this project, I was sure that “normal” accessors are a key part and meta tags just provide an extra info that is useful in some relatively rare cases. But with this new mechanism it’s kinda vice versa. Meta tags and their traits actually contain all information about the schema. Instead of forwarding get/set_by_tag to normal accessors that contain hardcoded values and types, it’s possible to reverse it and let get/set_by_tag to actually calculate them at compile time. In fact, it’s possible to generate SBE implementation for any other language now without touching sbepp’s own schema compiler. It effectively becomes a project-specific reflection mechanism and it works even in C++11.