Beta Protoc Compiler is a command-line interface (CLI) tool designed to generate serialization code and data structures from JSON message definitions. It is specifically built to facilitate serial communication along with the beta_com library for framing.
- Optimized Binary Protocol: Employs Varint and ZigZag encoding for efficient, language-agnostic serialization.
- Versatile Array Support: Natively handles static and dynamic arrays with optimized serialization for primitive types.
- JSON Schema Definition: Uses a clear, human-readable JSON format for defining message structures.
- Robust Compile-Time Validation: Rigorously validates schemas before code generation to prevent runtime errors.
- Automatic Dependency Resolution: Automatically manages dependencies between nested messages.
- Modular & Extensible: A template-based architecture (Jinja2) allows for easy addition of new target languages.
- Optional C Dispatcher: Generates a dispatcher in C for simplified message routing and handling.
This project is packaged using pyproject.toml. You can install it locally using pip:
# From the project root
pip install .For development or to modify the source code, install it in editable mode:
pip install -e .Once installed, the beta_protoc_compiler command is available in your terminal.
beta_protoc_compiler path/to/your_schema.json| Argument | Description | Default |
|---|---|---|
filepath |
The path to the input JSON file containing message definitions. | (Required) |
-l, --lang |
The output language(s) for the generated files. Can be one or more. | All supported languages |
-o, --out |
The directory where the generated files will be saved. | ./generated |
--clean |
Deletes the output directory before generating new files. | False |
Example:
beta_protoc_compiler my_protocol.json -l c -o ./src/protocol --cleanThe input file must follow a specific JSON structure defining a list of messages.
{
"messages": [
{
"name": "MessageName",
"id": 1,
"fields": [
{
"name": "field_name",
"id": 1,
"type": "FieldType"
}
]
}
]
}You can use the following primitive types or the name of another message defined in your JSON file.
| Category | Types |
|---|---|
| Unsigned Integers | uint8, uint16, uint32, uint64 |
| Signed Integers | int8, int16, int32, int64 |
| Floating Point | float32, float64 |
| Other | char, bool |
Additionally, both static and dynamic arrays are supported for any data type (including nested messages).
- Static array:
type[SIZE], for exampleint32[10]for an array of 10 integers. - Dynamic array:
type[], for examplefloat32[].
- Naming Conventions: Message and field names must start with a letter and can only contain letters, digits, or underscores (
_). In addition, CamelCase is advised to get the right conversion for all languages. - Unique IDs: Message IDs must be unique across all messages. Field IDs must be unique within a single message.
- Dependencies: If message
Ais used as a field type inside messageB, messageAmust be defined within themessageslist. The compiler will automatically generate the required dependencies (e.g.,#include "A.h"). - Order: The order in which messages are defined in the JSON file does not matter; the compiler resolves dependencies automatically.
The compiler generates code that adheres to a simple, efficient, and language-agnostic binary protocol. The following sections describe the structure and encoding rules of this protocol.
The serializer converts a data structure into a binary message. It's the user's responsibility to handle message framing (e.g., adding start/end bytes) if the communication channel requires it.
All multi-byte integers are encoded in little-endian format.
Message Format:
[PROTOCOL_VERSION, MESSAGE_ID, MESSAGE_LEN, PAYLOAD...]
Overall Message Structure:
| Field | Description | Size |
|---|---|---|
PROTOCOL_VERSION |
The version of the serialization protocol. | 1 byte |
MESSAGE_ID |
The unique identifier for the message (from JSON). | 2 bytes |
MESSAGE_LEN |
The length of the PAYLOAD in bytes. |
Varint (1-10 bytes) |
PAYLOAD |
The serialized data fields. | MESSAGE_LEN bytes |
Payload Field Structure:
The PAYLOAD consists of a sequence of fields, where each field is encoded as follows:
[FIELD_ID, FIELD_LEN, FIELD_VALUE...]
| Field | Description | Size |
|---|---|---|
FIELD_ID |
The unique identifier for the field (from JSON). | Varint (1-10 bytes) |
FIELD_LEN |
The length of the FIELD_VALUE in bytes. |
Varint (1-10 bytes) |
FIELD_VALUE |
The binary value of the field. | FIELD_LEN bytes |
The way data types are serialized into FIELD_VALUE depends on their type:
-
Integers (8-bit and 16-bit):
int8,uint8,int16,uint16are written directly in little-endian format. They are not converted to varints, as the overhead would negate any potential space savings. For 16-bit integers, the average size is roughly equivalent, and using varints would add computational cost for minimal gain.
-
Integers (32-bit and 64-bit):
uint32,uint64: Encoded as standard varints. This is efficient for small, non-negative numbers.int32,int64: Encoded using ZigZag encoding first, and then the result is encoded as a varint. ZigZag re-maps signed integers to unsigned integers so that small negative numbers (like -1) are encoded as small varints, which is highly efficient.
-
Floating-Point Numbers:
float32,float64: Written directly in IEEE 754 binary format (little-endian).
-
Other Types:
bool: Encoded as a single byte (0x00forfalse,0x01fortrue).char: Encoded as a single byte.- Nested Messages: The field value is the serialized sub-message itself (following the same
[PROTOCOL_VERSION, MESSAGE_ID, ...]structure). - Arrays: The encoding of arrays depends on the type of their elements.
- Arrays of Nested Messages: For arrays of complex types (other messages), each element is serialized as a separate
[FIELD_ID, FIELD_LEN, FIELD_VALUE]block. This allows for lists of different-sized objects. - Arrays of Primitive Types (Optimization): For arrays of primitive types (e.g.,
int32,float32), a significant optimization is applied. The entire array is treated as a single field. TheFIELD_IDis written once, followed by aFIELD_LENthat represents the total byte size of all elements combined. TheFIELD_VALUEthen consists of the raw, concatenated values of the array elements. This reduces overhead by removing the need for repeated ID and length tags for each element. For example, an array of 10uint32integers will be encoded as one field, not ten.
- Arrays of Nested Messages: For arrays of complex types (other messages), each element is serialized as a separate
To create a null-terminated string, you can use an array of char (e.g., char[64]). The deserializer will automatically add a null terminator \0 at the end of the data. Furthermore, during serialization, if a \0 character is found before the end of the array's specified size, the serialization will stop at that point, saving space in the final message.
- Message ID: The
MESSAGE_IDis a 2-byte integer, allowing for up to 65,536 unique messages. - Field ID: The
FIELD_IDis encoded as a variable-length integer (varint), allowing for up to 2^64 unique fields (depending on the architecture) per message. - Message and Field Size: The
MESSAGE_LENandFIELD_LENare also encoded as varints, allowing for payloads and fields a theoretical maximum length of 2^64 bytes (depending on the architecture).
The deserialization process performs the reverse operation of serialization:
- It reads the message header (
PROTOCOL_VERSION,MESSAGE_ID,MESSAGE_LEN) to validate and identify the message. - It iterates through the
PAYLOAD, reading each field's header (FIELD_ID,FIELD_LEN). - It extracts the
FIELD_VALUEand populates the corresponding member of the data structure. - Multi-byte values are converted from little-endian back to the host's native byte order.
Currently, the compiler supports:
- C (
.c,.h): Generates structs and dependent message headers.
This section describes how to use the C code generated by the compiler.
The way arrays are handled in the generated C code depends on whether they are static or dynamic.
A field defined with a fixed size, like "type": "uint8[16]" in JSON, will be generated in the C struct as follows:
// In your message struct:
uint8_t my_field[16];
size_t my_field_count;my_field[16]: A standard C array with the specified size.my_field_count: Asize_tvariable indicating how many elements are actually in use. When serializing, the compiler will only writemy_field_countelements. When deserializing, this field will be populated with the number of elements read from the buffer.
A field defined as a dynamic array, like "type": "MyMessage[]", requires you to manage the memory. The generated C struct will contain:
// In your message struct:
MyMessage* my_field;
size_t my_field_count;
size_t my_field_max_count;my_field: A pointer to a block of memory that you must allocate.my_field_count: The number of elements currently stored in the allocated memory.my_field_max_count: The total capacity of the allocated memory block (i.e., the maximum number of elements it can hold).
Before serializing or deserializing, you are responsible for allocating the memory for the dynamic array and setting my_field_max_count to the capacity of your buffer. The compiler will check that my_field_count does not exceed my_field_max_count to prevent buffer overflows.
Once you have generated the code from your JSON schema, you can use it to create, serialize, and deserialize messages.
For each message (e.g., MyMessage), the following files are created:
MyMessage.h: The header file defining the data structure and function prototypes.MyMessage.c: The implementation of the serialization and deserialization functions.
The generated C code has an external dependency that must be included in your project's build system (e.g., CMakeLists.txt) to compile correctly.
beta_protocCommon code: The core serialization helpers are located in theprotoc_common_code/C/directory of this repository. You must addbeta_protoc.candbeta_protoc.hto your project.
To simplify message handling, the compiler also generates a dispatcher. It is a set of files (dispatcher.c and dispatcher.h) that can automatically read an incoming byte stream, identify a message, deserialize it, and call a user-defined callback function, while also passing a user-defined context.
Features:
- Automatic Message Identification: Reads the message header and determines the message type.
- Callback System with Context: For each message
MyMessage, it calls a weak functionon_MyMessage_received(MyMessage *msg, void *ctx)that you can implement in your application. Thectxparameter allows you to pass a custom context (e.g., a pointer to an object or state) to your callbacks. - Stream-Safe: The dispatcher can be fed bytes one by one or in chunks, and it will find messages in the stream.
To use it, include dispatcher.h in your project and implement the on_<MessageName>_received functions for the messages you want to handle. Then, feed your incoming data stream and your context to the protoc_dispatch function.
For each message, the following functions are generated to facilitate serialization and deserialization:
-
struct <MessageName>The C struct representing your message. You must first populate this struct with the data you want to send, or use it to receive deserialized data.// Example for a "Position" message Position my_pos; my_pos.x = 12.34; my_pos.y = -56.78;
-
int get_<MessageName>_size(<MessageName> data, size_t *size)Calculates the total size in bytes that the message payload will occupy once serialized. -
int <MessageName>_to_buff(<MessageName> data, uint8_t **buff, size_t *buff_len)Serializes only the payload (the fields) of the struct into a provided buffer. This function is mainly used internally by<MessageName>_to_message. -
int <MessageName>_to_message(<MessageName> data, uint8_t **buff, size_t *buff_len)This is the main function to use for serialization. It takes the populated struct and serializes it into a binary message format (header + payload). -
int <MessageName>_from_buff(<MessageName> *data, uint8_t **buff, size_t *rem_buff)Deserializes the payload from a buffer and populates the provided struct. This function is mainly used internally by<MessageName>_from_message. -
int <MessageName>_from_message(<MessageName> *data, uint8_t **buff, size_t *rem_buff)This is the main function to use for deserialization. It takes a buffer containing a binary message, validates the header, and deserializes the payload into the provided struct.
All serialization and deserialization functions return an integer value of type beta_protoc_err_t to indicate the outcome of the operation. A return value of 0 (BETA_PROTOC_SUCCESS) means the operation was successful. Any negative value indicates an error.
Here is a list of possible error codes:
| Code | Name | Description |
|---|---|---|
0 |
BETA_PROTOC_SUCCESS |
The operation completed successfully. |
-1 |
BETA_PROTOC_ERR_INVALID_ARGS |
One or more arguments (e.g., a null pointer) passed to the function were invalid. |
-2 |
BETA_PROTOC_ERR_BUFFER_TOO_SMALL |
The provided buffer was not large enough to complete the serialization or deserialization. |
-3 |
BETA_PROTOC_ERR_INVALID_ID |
The message ID in the buffer does not match the expected ID for the message type. |
-4 |
BETA_PROTOC_ERR_INVALID_PROTOC_VERSION |
The protocol version in the buffer does not match the version supported by the generated code. |
-5 |
BETA_PROTOC_VALUE_EXCEEDS_ARCH_LIMIT |
A value (e.g., a varint) is too large to be represented on the target architecture. |
-6 |
BETA_PROTOC_ERR_INVALID_DATA |
The data in the buffer is corrupted or does not follow the expected format. |
-7 |
BETA_PROTOC_ERR_ARRAY_SIZE_EXCEEDED |
An attempt was made to write more elements into a fixed-size array than its capacity allows. |
-8 |
BETA_PROTOC_ERR_NULL_ARRAY_POINTER |
A pointer to a dynamic array was null when it was expected to be allocated. |
The dispatcher also has its own set of error codes, of type dispatcher_err_t:
| Code | Name | Description |
|---|---|---|
0 |
DISPATCHER_SUCCESS |
The message was successfully dispatched. |
-100 |
DISPATCHER_ERR_INVALID_DATA |
The input buffer contains invalid or corrupted data. |
-101 |
DISPATCHER_ERR_INVALID_PROTOC_VERSION |
The protocol version of the message is not supported by the dispatcher. |
-102 |
DISPATCHER_ERR_UNKNOWN_MESSAGE_ID |
The message ID is not recognized by the dispatcher. |
Here is an example demonstrating serialization and deserialization using the dispatcher.
#include "Position.h"
#include "dispatcher.h"
#include <stdio.h>
// --- User-defined context structure ---
typedef struct {
int messages_processed;
} AppContext;
// --- User-defined callback ---
// This function is called by the dispatcher when a Position message is received.
void on_position_received(Position *msg, void *ctx) {
AppContext *app_ctx = (AppContext *)ctx;
app_ctx->messages_processed++;
printf("Callback triggered! (Message count: %d)\n", app_ctx->messages_processed);
printf("Received Position: x=%.2f, y=%.2f\n", msg->x, msg->y);
}
int main() {
// --- Initialization ---
AppContext my_context = {0};
// --- Serialization ---
// 1. Populate the message struct
Position my_pos;
my_pos.x = 10.5f;
my_pos.y = -2.3f;
// 2. Prepare a buffer to receive the encoded message
uint8_t output_buffer[256];
uint8_t *p_buffer = output_buffer;
size_t buffer_len = sizeof(output_buffer);
// 3. Serialize the message
int result = position_to_message(my_pos, &p_buffer, &buffer_len);
if (result == 0) {
size_t message_size = sizeof(output_buffer) - buffer_len;
printf("Message encoded successfully (%zu bytes)!\n", message_size);
// The 'output_buffer' now contains the binary message.
// In a real application, you would receive this data from a serial port, socket, etc.
// --- Deserialization with Dispatcher ---
// 4. Feed the buffer and context to the dispatcher
uint8_t *p_read_buffer = output_buffer;
size_t read_buffer_len = message_size;
while(read_buffer_len > 0) {
int dispatch_result = protoc_dispatch(&p_read_buffer, &read_buffer_len, &my_context);
if (dispatch_result == DISPATCHER_SUCCESS) {
printf("Dispatcher found and processed a message.\n");
}
// The dispatcher advances the buffer pointer automatically.
}
} else {
fprintf(stderr, "Error encoding message: %d\n", result);
}
return 0;
}The architecture is modular. To add support for a new language (e.g., Python, C++), follow these two steps:
Open compiler/core/language.py and add an entry to the SUPPORTED_LANGUAGES list. You must provide:
- The mapping between internal types (
DataType) and the target language types. - The naming convention (
case) to use for generating names (e.g., function names likeread_<message_name>).
# Example for Python
from compiler.core.language import Language, Case
from compiler.common.data_types import DataType
Language(
name="Python",
case=Case.SNAKE, # Use snake_case for names (e.g., my_function)
src_ext="py",
header_ext=None, # Python does not use header files
types_mapping={
DataType.UINT8: "",
# ... define mappings for all DataType enum members
}
)Create a directory with the exact name of the language in compiler/templates/ (e.g., compiler/templates/Python/).
Add the Jinja2 templates corresponding to the extensions defined in the Language object:
message.py.j2(ifsrc_ext="py")