← Back to projects

Apache Arrow Rust: arrow-avro

Major contributor to arrow-avro (Arrow and Avro conversion) and author of the public launch announcement.

  • rust
  • apache-arrow
  • avro
  • open-source
Arrow arrays mapped into Avro records for Rust data pipelines.

Highlights

  • Implemented Arrow-to-Avro conversion paths that preserve schema fidelity for nested data.
  • Ensured developer ergonomics with clearer APIs and examples for integration workflows.
  • Authored launch messaging and technical documentation for the public release.

Screenshots

Arrow columns mapped into Avro records in a Rust data pipeline.
Schema conversion and serialization flow.

What it is

arrow-avro (imported as arrow_avro) is Apache Arrow Rust’s Official Arrow-native Avro bridge: it converts between Apache Avro and Apache Arrow by decoding/encoding column-by-column and moving data in Arrow RecordBatches (batch in/out), instead of materializing per-row Avro values and rebuilding columns afterward.

It’s designed to cover both files and streaming / schema-registry pipelines:

  • OCF (Object Container Files) for file-based I/O (with optional block compression)
  • SOE (Single‑Object Encoding) plus Confluent and Apicurio Schema Registry wire formats for message streams

The API is intentionally Arrow-first and minimal: tunable batch sizing, projection and schema resolution/evolution (reader vs. writer schemas), and optional StringViewArray support for faster string handling—so downstream compute stays vectorized end-to-end. See the official docs and the launch announcement.

What I contributed

  • Played a major role in taking arrow-avro to a production-ready release within the Arrow Rust ecosystem, focusing on correctness, performance, and an ergonomic RecordBatch-first API.
  • Helped drive feature completeness across real-world Avro pipelines: OCF ingestion/egress, streaming decoders/encoders for SOE and schema-registry framing (Confluent + Apicurio), and schema evolution via reader/writer schema resolution.
  • Authored the public launch announcement and wrote the bulk of the official crate documentation (runnable quickstarts, streaming examples, and “which API should I use?” guidance).

Outcome / impact

  • Made Avro and Arrow interchange significantly faster and more “Arrow-native” by aligning conversion with Arrow’s vectorized execution model (decode directly into Arrow builders; avoid per-row overhead).
  • Reduced integration friction for teams that use Avro on disk and on the wire, enabling a single, upstream crate for OCF files and Kafka-style schema-registry messages.
  • In the public launch benchmarks, the Arrow-first approach delivered order-of-magnitude improvements over a row-centric pipeline (up to ~33× faster reads with projection pushdown, and up to ~18× faster writes in the benchmarked cases).

Tech (high-level)

Rust · Apache Arrow (arrow-rs) · Apache Avro · Confluent & Apicurio Schema Registry · Schema resolution · Projection pushdown