Apache Avro, Java, File Format, Data Serialization

Apache Avro Demystified

Data serialization with Apache Avro

Kapil Sreedharan

--

What is Apache Avro?

Avro is an open-source language-agnostic data serialization framework. The schema of Avro files is specified in JSON format, making it easy to read and interpret. Files that store Avro data should always also include the schema for that data in the same file.

Avro includes APIs for C, C++, C#, Java, JS, Perl, PHP, Python, and Ruby. Being language agnostic, files stored using Avro can be passed between programs written in different languages.

You can find the source code for this tutorial here: https://github.com/ksree/apache-avro-demistified

Avro File Format

An Avro file consists of a file header, followed by one or more file data blocks.

The data is written according to a schema that is stored within the file. These data objects are stored in blocks that may be compressed.

A file header consists of:

  • Four bytes, ASCII ‘O’, ‘b’, ‘j’, followed by 1
  • A map containing file metadata. This map contains avro.schema property stored in the file as JSON data.
  • The 16-byte, randomly-generated sync marker for this file.

File data block consists of:

  • A long indicating number of objects in this block.
  • A long indicating the size in bytes of the serialized objects in the current block, after any codec is applied
  • The serialized objects. If a codec is specified, this is compressed by that codec.
  • The file’s 16-byte sync marker.

Defining Avro Schemas

Avro schemas are composed of:

To describe an Avro schema, you create a JSON record which identifies your data definition, like this:

This schema defines a record representing an employee.

A record definition must include its:

  • type (“type”: “record”): identifies the type of the schema
  • fields: in this case first, last, alias, and so on. The type can be either a primitive or complex type.
  • namespace (“namespace”: “com.ksr.avro”): which describes the namespace where the given Schema belongs to
  • name (“name”: “Employee”): the name of the schema, which together with the namespace attribute defines the “full name” of the schema (com.ksr.avro.Employee in this case).

Serializing and deserializing: With Code Generation

Now that we have a schema defined, let's use it to generate classes. Once we have the relevant classes defined, there is no need to use the schema directly.

Add the following maven dependency to your POM:

Also, the Avro Maven plugin, to perform code generation:

Compiling the schema and code generation:

Place the Avro schema file employee.avsc under /src/main/avro/

The maven plugin automatically performs code generation on all .avsc files present under the configured source directory.

In our case, it will generate Employee and Department classes under /src/main/java/.

Serializing and deserializing employee objects:

Now that we have autogenerated Employee and Department classes, let's create some employees, serialize them to a data file on disk, and then read back the file and deserialize the Employee objects.

First, let's create some Employees:

As shown above, Avro objects can be created by invoking a constructor or by using a builder. Unlike constructors, builders set any default values specified in the schema as well as validate the data as it is set.

Serializing with SpecificDatumWriter:

SpecificDatumWriter is used with generated classes. It extracts the schema from the specified type.

Let's serialize the 3 employee objects created above to disk:

Here we use Datumwriter to convert Java objects into an in-memory serialized format. While the SpeficDatumWriter is used to extract the schema from the specified type.

Finally, we use the DataFileWriter, to write the serialized records as well as the schema to a file named employee.avro

Deserializing with SpecificDatumWriter:

Now let's deserialize the Avro data file (employee.avro) we just created

SpecificDatumReader converts in-memory serialized items into instances of our generated class. In this case Employees and Department.

SpecificDatumReader

Output:

Avro data

We pass the DatumReader and the Avro data file to the DataFileReader. DataFileReader reads the data using the writer's schema included in the file and the schema provided by the reader, in this case, Employee class. The writer's schema is required to know the order in which the fields were added, the reader's schema is required to know the required fields and default values for fields added since the file was created(say schema evolution). If there is a mismatch in reader and writer schema, it's resolved using https://avro.apache.org/docs/current/spec.html#Schema+Resolution

And finally, we iterate through dataFileReader and print out the deserialized object.

Serializing and deserializing: Without Code Generation

Avro data is always stored with its corresponding schema. Schema parser libraries allow us to perform serialization and deserialization without code generation.

Serializing and deserializing employee objects:

First, read the employee schema using Schema Parser:

Schema schema = new Schema.Parser()
.parse(new File("employee.avsc"));

Create GenericRecords according to the employee schema to represent the employees.

Serializing using GenericDatumWriter

To serialize and write employee data to disk, we use Generic writer.

DatumWriter converts Java objects into an in-memory serialized format. Here we use GenericDatumWriter, which requires the schema to determine how to write Generic Records and perform simple fields level validations.

Deserializing using GenericDatumWriter

Now let’s deserialize the Avro data file (employee.avro) we just created

GenericDatumReader converts in-memory serialized items into GenericRecords.

We pass the DatumReader and the Avro data file to the DataFileReader. DataFileReader reads the data using the writer's schema included in the file and the schema provided by the reader, in this case, Employee class. The writer's schema is required to know the order in which the fields were added, the reader's schema is required to know the required fields and default values for fields added since the file was created(say schema evolution). If there is a mismatch in reader and writer schema, it's resolved using https://avro.apache.org/docs/current/spec.html#Schema+Resolution

And finally, we iterate through dataFileReader and print out the deserialized object.

Compiling and running the project:

You can find the project repository here:

Project setup for Apache Avro Demystified

To Conclude

We have seen how to use Avro as a data serialization system.

Avro format being compact, fast, with embedded JSON schema and bindings for a variety of programming languages make it a very popular data format in the Hadoop world as well as with Apache Kafka.

--

--