Avro defines a data format designed to support data-intensive applications, and provides support for this format in a variety of programming languages.
The Hadoop ecosystem includes a new binary data serialization system — Avro.
Avro provides:
· Rich data
structures.
·
A compact,
fast, binary data format.
·
A container
file, to store persistent data.
·
Remote
procedure call (RPC).
· Simple
integration with dynamic languages. Code generation is not required to read or
write data files nor to use or implement RPC protocols. Code generation as an
optional optimization, only worth implementing for statically typed languages.
Its functionality is similar to the
other marshaling systems such as Thrift, Protocol Buffers, and so on.
The main differentiators of Avro
include the following:
|
[Hadoop Interview Questions] |
Dynamic typing — The Avro implementation always keeps data and its
corresponding schema together. As a result, marshaling/unmarshaling operations
do not require either code generation or static data types. This also allows
generic data processing.
Untagged data — Because it keeps data and schema together, Avro
marshaling/unmarshaling does not
require type/size information or manually assigned IDs to be encoded in data.
As a result, Avro serialization produces a smaller output.
Enhanced versioning support — In the case of schema changes, Avro contains
both schemas, which enables you to resolve differences symbolically based on
the field names.
Because of high performance, a
small codebase, and compact resulting data, there is a wide adoption of Avro
not only in the Hadoop community, but also by many other NoSQL implementations
(including Cassandra).
At the heart of Avro is a data
serialization system. Avro can either use reflection to dynamically generate
schemas of the existing Java objects, or use an explicit Avro schema — a
JavaScript Object Notation (JSON) document describing the data format. Avro
schemas can contain both simple and complex types.
Simple data types supported by Avro
include null, boolean, int, long, float, double, bytes, and string. Here, null
is a special type, corresponding to no data, and can be used in place of any
data type.
Complex types supported by Avro
include the following:
Record — This is roughly equivalent to a C structure. A
record has a name and optional namespace, document, and alias. It contains a
list of named attributes that can be of any Avro type.
Enum — This is an enumeration of values. Enum has a name,
an optional namespace, document, and alias, and contains a list of symbols
(valid JSON strings).
Array — This is a collection of items of the same type.
Map — This is a map of keys of type string and values of
the specified type.
Union — This represents an or option for the value. A common use
for unions is to specify nullable values.
Comments
Post a Comment
Thanks for your message. We will get back you.