Ray Data ClickHouse Connector
Contributed ClickHouse source and sink connectors to Ray Data to enable scalable reads/writes in Ray pipelines.
- python
- ray
- clickhouse
- data-engineering
Highlights
- Implemented first-class ClickHouse reader and writer APIs for Ray Data users.
- Validated Arrow type mapping and retry handling for large batch workloads.
- Documented usage patterns for partitioning, auth, and throughput tuning.
Screenshots
What it is
First-class ClickHouse Source + Sink support for Ray Data, enabling Ray Datasets to be created directly from ClickHouse tables/views (ray.data.read_clickhouse) and written back efficiently (Dataset.write_clickhouse) with scaling and ergonomics aligned with Ray’s distributed, block-based execution model.
What I contributed
- Implemented and upstreamed the ClickHouse Source (
ray.data.read_clickhouse) and Sink (Dataset.write_clickhouse) connectors for Ray Data. - Built the read path end-to-end: DSN-based connectivity, column projection, optional SQL filtering, and deterministic parallel reads when an explicit
order_byis provided (with clear behavior when parallelism isn’t possible). - Built the write path with production-friendly semantics: CREATE/APPEND/OVERWRITE modes, optional schema inference/enforcement via
pyarrow.Schema, and configurable table creation viaClickHouseTableSettings(engine,ORDER BY, partitioning, primary key, and settings). - Worked through edge cases around Arrow and ClickHouse type mapping, batching/chunking very large blocks, parallelism and concurrency controls, and “sharp edges” that impact operational reliability.
- Authored the user-facing documentation for both APIs, including runnable examples, performance guidance (e.g., using
.repartition()to control write parallelism), and notes on common gotchas.
Outcome / impact
- Made ClickHouse a first-class I/O target in Ray Data, enabling scalable ETL and analytics pipelines where ClickHouse is the system of record and Ray provides distributed compute.
- Replaced bespoke “glue code” with a supported connector that standardizes connection configuration, schema/table management, and parallel execution patterns in Ray workflows.
- Improved day-2 usability with clear documentation and well-defined scaling knobs, helping teams confidently run backfills, batch transforms, and data publishing pipelines on Ray.
Tech (high-level)
Python · Ray Data · ClickHouse · clickhouse-connect · PyArrow (Arrow schema/streaming) · Distributed task parallelism