Data format support
Apache Kafka is data payload agnostic. The message information is an array of bytes. The actual bytes can be the storage for
* AVRO
* PROTOBUF
* XML
* JSON
* JSON with Schema
Below is a table with the supported formats and schema evolution. For AVRO and PROTOBUF, it is expected a schema registry solution is in place.
Data Format | Supported | Schema Evolution |
---|---|---|
AVRO | Yes | Only adding nullable columns |
PROTOBUF | Yes | Only adding nullable columns |
JSON | Partially | If each message contains all the fields and no field is nullable. |
JSON with schema | Yes | Only adding nullable columns |
XML | No | - |
JSON data format needs a bit of attention. While it is a simple and human-readable storage format, it's far from the ideal format for enforcing data quality and schema validation.
Each Kafka message is self-contained. This means the information it carries it's not dependent to previous messages. As a result, the connector can infer the schema only at the message level. But then a JSON document can contain a field address=null, or maybe the nullable field is not even written in the payload - for performance reasons. Therefore there is no way to infer the type correctly, and tracking cross messages is not a bulletproof solution.
{ "firstName": "Alexandra", "lastName": "Jones", "address": null }
{ "firstName": "Alexandra", "lastName": "Jones", "address": null, "age": 32 }
In some use cases, the JSON payload can contain the schema, and for this, as stated in the table above, the support is better. Here is an example of a JSON with a schema that the Kafka Connect converter: JsonConverter can interpret.
{ "schema": { "type": "struct", "fields": [ { "type": "int64", "optional": false, "field": "registertime" }, { "type": "string", "optional": false, "field": "userid" }, { "type": "string", "optional": false, "field": "regionid" }, { "type": "string", "optional": false, "field": "gender" } ], "optional": false, "name": "ksql.users" }, "payload": { "registertime": 1493819497170, "userid": "User_1", "regionid": "Region_5", "gender": "MALE" } }