Apache Arrow Rust: arrow-avro
Major contributor to arrow-avro (Arrow and Avro conversion) and author of the public launch announcement.
- rust
- apache-arrow
- avro
- open-source
Highlights
- Implemented Arrow-to-Avro conversion paths that preserve schema fidelity for nested data.
- Ensured developer ergonomics with clearer APIs and examples for integration workflows.
- Authored launch messaging and technical documentation for the public release.
Screenshots
What it is
arrow-avro (imported as arrow_avro) is Apache Arrow Rust’s Official Arrow-native Avro bridge: it converts between Apache Avro and Apache Arrow by decoding/encoding column-by-column and moving data in Arrow RecordBatches (batch in/out), instead of materializing per-row Avro values and rebuilding columns afterward.
It’s designed to cover both files and streaming / schema-registry pipelines:
- OCF (Object Container Files) for file-based I/O (with optional block compression)
- SOE (Single‑Object Encoding) plus Confluent and Apicurio Schema Registry wire formats for message streams
The API is intentionally Arrow-first and minimal: tunable batch sizing, projection and schema resolution/evolution (reader vs. writer schemas), and optional StringViewArray support for faster string handling—so downstream compute stays vectorized end-to-end. See the official docs and the launch announcement.
What I contributed
- Played a major role in taking
arrow-avroto a production-ready release within the Arrow Rust ecosystem, focusing on correctness, performance, and an ergonomicRecordBatch-first API. - Helped drive feature completeness across real-world Avro pipelines: OCF ingestion/egress, streaming decoders/encoders for SOE and schema-registry framing (Confluent + Apicurio), and schema evolution via reader/writer schema resolution.
- Authored the public launch announcement and wrote the bulk of the official crate documentation (runnable quickstarts, streaming examples, and “which API should I use?” guidance).
Outcome / impact
- Made Avro and Arrow interchange significantly faster and more “Arrow-native” by aligning conversion with Arrow’s vectorized execution model (decode directly into Arrow builders; avoid per-row overhead).
- Reduced integration friction for teams that use Avro on disk and on the wire, enabling a single, upstream crate for OCF files and Kafka-style schema-registry messages.
- In the public launch benchmarks, the Arrow-first approach delivered order-of-magnitude improvements over a row-centric pipeline (up to ~33× faster reads with projection pushdown, and up to ~18× faster writes in the benchmarked cases).
Tech (high-level)
Rust · Apache Arrow (arrow-rs) · Apache Avro · Confluent & Apicurio Schema Registry · Schema resolution · Projection pushdown