Ray Data ClickHouse Connector

Contributed ClickHouse source and sink connectors to Ray Data to enable scalable reads/writes in Ray pipelines.

Highlights

Implemented first-class ClickHouse reader and writer APIs for Ray Data users.
Validated Arrow type mapping and retry handling for large batch workloads.
Documented usage patterns for partitioning, auth, and throughput tuning.

Screenshots

Ray Data read and write flow across ClickHouse clusters. — Connector and data path.

What it is

First-class ClickHouse Source + Sink support for Ray Data, enabling Ray Datasets to be created directly from ClickHouse tables/views (ray.data.read_clickhouse) and written back efficiently (Dataset.write_clickhouse) with scaling and ergonomics aligned with Ray’s distributed, block-based execution model.

What I contributed

Implemented and upstreamed the ClickHouse Source (ray.data.read_clickhouse) and Sink (Dataset.write_clickhouse) connectors for Ray Data.
Built the read path end-to-end: DSN-based connectivity, column projection, optional SQL filtering, and deterministic parallel reads when an explicit order_by is provided (with clear behavior when parallelism isn’t possible).
Built the write path with production-friendly semantics: CREATE/APPEND/OVERWRITE modes, optional schema inference/enforcement via pyarrow.Schema, and configurable table creation via ClickHouseTableSettings (engine, ORDER BY, partitioning, primary key, and settings).
Worked through edge cases around Arrow and ClickHouse type mapping, batching/chunking very large blocks, parallelism and concurrency controls, and “sharp edges” that impact operational reliability.
Authored the user-facing documentation for both APIs, including runnable examples, performance guidance (e.g., using .repartition() to control write parallelism), and notes on common gotchas.

Outcome / impact

Made ClickHouse a first-class I/O target in Ray Data, enabling scalable ETL and analytics pipelines where ClickHouse is the system of record and Ray provides distributed compute.
Replaced bespoke “glue code” with a supported connector that standardizes connection configuration, schema/table management, and parallel execution patterns in Ray workflows.
Improved day-2 usability with clear documentation and well-defined scaling knobs, helping teams confidently run backfills, batch transforms, and data publishing pipelines on Ray.

Tech (high-level)

Python · Ray Data · ClickHouse · clickhouse-connect · PyArrow (Arrow schema/streaming) · Distributed task parallelism