← Back to projects

Ray Data ClickHouse Connector

Contributed ClickHouse source and sink connectors to Ray Data to enable scalable reads/writes in Ray pipelines.

  • python
  • ray
  • clickhouse
  • data-engineering
Ray Data pipelines reading from and writing to ClickHouse clusters.

Highlights

  • Implemented first-class ClickHouse reader and writer APIs for Ray Data users.
  • Validated Arrow type mapping and retry handling for large batch workloads.
  • Documented usage patterns for partitioning, auth, and throughput tuning.

Screenshots

Ray Data read and write flow across ClickHouse clusters.
Connector and data path.

What it is

First-class ClickHouse Source + Sink support for Ray Data, enabling Ray Datasets to be created directly from ClickHouse tables/views (ray.data.read_clickhouse) and written back efficiently (Dataset.write_clickhouse) with scaling and ergonomics aligned with Ray’s distributed, block-based execution model.

What I contributed

  • Implemented and upstreamed the ClickHouse Source (ray.data.read_clickhouse) and Sink (Dataset.write_clickhouse) connectors for Ray Data.
  • Built the read path end-to-end: DSN-based connectivity, column projection, optional SQL filtering, and deterministic parallel reads when an explicit order_by is provided (with clear behavior when parallelism isn’t possible).
  • Built the write path with production-friendly semantics: CREATE/APPEND/OVERWRITE modes, optional schema inference/enforcement via pyarrow.Schema, and configurable table creation via ClickHouseTableSettings (engine, ORDER BY, partitioning, primary key, and settings).
  • Worked through edge cases around Arrow and ClickHouse type mapping, batching/chunking very large blocks, parallelism and concurrency controls, and “sharp edges” that impact operational reliability.
  • Authored the user-facing documentation for both APIs, including runnable examples, performance guidance (e.g., using .repartition() to control write parallelism), and notes on common gotchas.

Outcome / impact

  • Made ClickHouse a first-class I/O target in Ray Data, enabling scalable ETL and analytics pipelines where ClickHouse is the system of record and Ray provides distributed compute.
  • Replaced bespoke “glue code” with a supported connector that standardizes connection configuration, schema/table management, and parallel execution patterns in Ray workflows.
  • Improved day-2 usability with clear documentation and well-defined scaling knobs, helping teams confidently run backfills, batch transforms, and data publishing pipelines on Ray.

Tech (high-level)

Python · Ray Data · ClickHouse · clickhouse-connect · PyArrow (Arrow schema/streaming) · Distributed task parallelism