Article

Distributed SQL and NewSQL: The Next Generation of Databases

For decades, the database world was split between two camps: **relational databases** (PostgreSQL, MySQL) with strong consistency but limited scalability, and **NoSQL databases** (MongoDB, Cassandra) with horizontal scalability but weaker consistency and limited query capabilities.

4 min readLanguage: EN EnglishFree0 claps0 comments

TechnologyEngineering ArticlesDistributedDatabasesTechnologyEngineering ArticlesNextNewSQL

Reading options

Introduction

For decades, the database world was split between two camps: relational databases (PostgreSQL, MySQL) with strong consistency but limited scalability, and NoSQL databases (MongoDB, Cassandra) with horizontal scalability but weaker consistency and limited query capabilities.

Distributed SQL (also called NewSQL) bridges this gap. It provides the full power of SQL — ACID transactions, joins, foreign keys, and familiar tooling — on a horizontally scalable, distributed architecture. The result: the consistency and query power of a relational database with the scalability of NoSQL.

The Database Spectrum

Strong Consistency
        ▲
        │           ┌───────────────────────┐
        │           │  Distributed SQL      │
        │           │  (CockroachDB,        │
        │           │   YugabyteDB,         │
        │           │   Google Spanner)     │
        │           └───────────────────────┘
        │
        │    ┌──────────────┐     ┌──────────────────┐
        │    │  Traditional │     │  Traditional      │
        │    │  SQL         │     │  NewSQL           │
        │    │  (PostgreSQL,│     │  (Vitess, TiDB)   │
        │    │   MySQL)     │     │                   │
        │    └──────────────┘     └──────────────────┘
        │
        │                              ┌──────────────────┐
        │                              │  NoSQL           │
        │                              │  (Cassandra,     │
        │                              │   DynamoDB,      │
        │                              │   CosmosDB)      │
        │                              └──────────────────┘
        └────────────────────────────────────────────────────
                              Horizontal Scalability  ──────►

What Is Distributed SQL?

Core Characteristics

Feature	Description
SQL interface	Full SQL support: JOINs, subqueries, aggregations, window functions
ACID transactions	Serializable or Snapshot Isolation across nodes
Horizontal scaling	Add nodes without downtime, data rebalances automatically
Auto-sharding	Data automatically distributed across nodes
Replication	Synchronous replication for durability (RPO=0)
High availability	Automatic failover, no manual intervention
Geo-distribution	Data can span multiple regions with latency < 50ms

Distributed SQL Architectures

Google Spanner — The Pioneer

┌──────────────────────────────────────────┐
│         Spanner Global Architecture       │
├──────────────────────────────────────────┤
│  Zone 1 (us-east1)   Zone 2 (europe-west)│
│  ┌──────────────┐    ┌──────────────┐    │
│  │ Leader Node  │    │ Follower Node│    │
│  │   ┌────────┐ │    │              │    │
│  │   │ Spanner│ │    │              │    │
│  │   │ Server │ │    │              │    │
│  │   └────────┘ │    │              │    │
│  └──────────────┘    └──────────────┘    │
├──────────────────────────────────────────┤
│          TrueTime API (GPS + Atomic)     │
│          Clock synchronization < 7ms     │
└──────────────────────────────────────────┘

Key innovation: TrueTime API — GPS and atomic clock synchronization providing bounded clock uncertainty. This enables external consistency (linearizability) across globally distributed nodes.

CockroachDB — Spanner-Inspired Open Source

                     ┌─────────────────────────┐
                     │    SQL Gateway          │
                     │  (Any node can serve)   │
                     └──────────┬──────────────┘
                                │
         ┌──────────────────────┼──────────────────────┐
         │                      │                      │
   ┌─────▼─────┐          ┌─────▼─────┐          ┌─────▼─────┐
   │  Node 1   │          │  Node 2   │          │  Node 3   │
   │ ┌───────┐ │          │ ┌───────┐ │          │ ┌───────┐ │
   │ │ Range A│ │          │ │ Range A│ │          │ │ Range A│ │
   │ │ (LEADER)│          │ │(FOLLOWER│          │ │(FOLLOWER│ │
   │ └───────┘ │          │ └───────┘ │          │ └───────┘ │
   │ ┌───────┐ │          │ ┌───────┐ │          │ ┌───────┐ │
   │ │ Range B│ │          │ │ Range B│ │          │ │ Range B│ │
   │ │(FOLLOWER│          │ │ (LEADER)│          │ │(FOLLOWER│ │
   │ └───────┘ │          │ └───────┘ │          │ └───────┘ │
   └───────────┘          └───────────┘          └───────────┘

How CockroachDB works:

Data is split into ranges (64MB by default, similar to Spanner's splits).
Each range is replicated 3x (configurable) across nodes.
One replica is the leaseholder (handles reads and writes).
Raft consensus ensures consistency across replicas.
Any node can serve queries — the SQL gateway routes to the correct range.

YugabyteDB — PostgreSQL-Compatible

YugabyteDB is designed for PostgreSQL compatibility at the wire protocol level:

-- This is standard PostgreSQL syntax — it just works
CREATE TABLE orders (
    id UUID DEFAULT gen_random_uuid(),
    customer_id UUID NOT NULL,
    total NUMERIC(10,2) NOT NULL,
    status TEXT NOT NULL DEFAULT 'pending',
    created_at TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (id)
);

-- Distributed ACID transaction across nodes
BEGIN;
INSERT INTO orders (customer_id, total, status)
VALUES ($1, $2, 'pending');
UPDATE inventory SET quantity = quantity - 1
WHERE product_id = $3 AND quantity > 0;
COMMIT;

-- Full PostgreSQL ecosystem compatibility
-- Supports: pgAdmin, Prisma, DBeaver, all standard drivers

Distributed SQL Comparison

Feature	CockroachDB	YugabyteDB	TiDB	Google Spanner
Compatibility	PostgreSQL (85%)	PostgreSQL (95%+)	MySQL	Proprietary
Consistency	Serializable	Serializable	Snapshot isolation	External consistency
Replication	Raft	Raft	Raft	Paxos
Geo-distribution	Yes	Yes	Yes	Yes
Auto-sharding	Yes (by range)	Yes (by hash/range)	Yes (by range)	Yes (by split)
Secondary indexes	Global (distributed)	Global + Local	Global + Local	Global
Foreign keys	Yes	Yes	Yes	Yes
CDC	Yes (Kafka sink)	Yes (Kafka, Elastic)	Yes (TiCDC)	Yes (Pub/Sub)
Licensing	BSL (source-available)	Apache 2.0	Apache 2.0	Proprietary
Cloud managed	CockroachDB Cloud	YugabyteDB Managed	TiDB Cloud	Cloud Spanner
Self-host	Yes	Yes	Yes	No

When to Use Distributed SQL

Ideal Use Cases

Use Case	Why Distributed SQL	Example
Global SaaS applications	Users worldwide need low latency, strong consistency	Multi-region user accounts
Financial systems	ACID transactions across shards, no data loss	Payment processing, ledger
E-commerce platforms	Inventory, orders, carts — all need consistency	Black Friday traffic spikes
Gaming	Player state, leaderboards, transactions	Cross-region multiplayer
IoT time series + metadata	SQL for analytics on distributed data	Fleet telemetry with joins

When NOT to Use Distributed SQL

Scenario	Why	Better Alternative
Simple key-value lookups	Overhead of distributed transactions not justified	Redis, DynamoDB
Single-region, small data	Complexity not needed	PostgreSQL
Full-text search	Not designed for text indexing	Elasticsearch
Graph traversals	JOINs on deep relationships are slow	Neo4j
Very high write throughput (>100K writes/sec/node)	Consensus overhead may be too high	Cassandra, ScyllaDB

Data Distribution Strategies

Range-Based Sharding (CockroachDB, Spanner)

-- Data is split by primary key ranges
Table: orders

Range 1: PK '00000000' to '3fffffff'  → Node 1 (LEADER) + Node 2, Node 3
Range 2: PK '40000000' to '7fffffff'  → Node 2 (LEADER) + Node 1, Node 3
Range 3: PK '80000000' to 'bfffffff'  → Node 3 (LEADER) + Node 1, Node 2
Range 4: PK 'c0000000' to 'ffffffff'  → Node 1 (LEADER) + Node 2, Node 3

Pros: Efficient range scans, good for sequential keys. Cons: Write hot spots if inserting sequential keys (solved by hash-sharded indexes).

Hash-Based Sharding (YugabyteDB)

-- Hash-sharded primary key
CREATE TABLE orders (
    id UUID DEFAULT gen_random_uuid(),
    customer_id UUID,
    total NUMERIC,
    PRIMARY KEY (id HASH)
);

-- Data distribution by hash
Hash('00000000-...') = 0.23 → Tablet 1 → Node 1
Hash('11111111-...') = 0.67 → Tablet 2 → Node 2
Hash('22222222-...') = 0.12 → Tablet 3 → Node 3

Pros: Even data distribution, no hot spots. Cons: Range scans across the entire table are inefficient.

Key Distributed SQL Features

Distributed Transactions (2PC + Raft)

-- This transaction spans multiple nodes
BEGIN;
UPDATE account_a SET balance = balance - 100 WHERE id = 'A';
UPDATE account_b SET balance = balance + 100 WHERE id = 'B';  -- Different node
COMMIT;

-- Under the hood:
-- 1. Transaction coordinator (any node) acquires locks
-- 2. Prepares both participants
-- 3. Raft commits the transaction record
-- 4. COMMIT acknowledged to client
-- All-or-nothing — guaranteed ACID

Global Secondary Indexes

Indexes are themselves distributed across nodes:

CREATE INDEX idx_orders_customer ON orders (customer_id);

-- Query that uses the distributed index:
SELECT * FROM orders WHERE customer_id = $1;
-- 1. Routes to any node
-- 2. Node looks up customer_id in distributed index
-- 3. Index returns primary keys → fetch from correct ranges
-- 4. Returns results

Online Schema Migrations

Distributed SQL supports schema changes without downtime:

-- Add a column — non-blocking
ALTER TABLE orders ADD COLUMN discount NUMERIC(5,2) DEFAULT 0.00;

-- Create index — non-blocking (background build)
CREATE INDEX CONCURRENTLY idx_orders_date ON orders (created_at);

Change Data Capture (CDC)

Stream database changes to downstream systems:

-- CockroachDB CDC
CREATE CHANGEFEED FOR TABLE orders, payments
  INTO 'kafka://kafka-cluster:9092'
  WITH updated, resolved, min_checkpoint_frequency = '5s';

-- YugabyteDB CDC connector
# Deploy to Kafka Connect
{
  "name": "yb-cdc-connector",
  "config": {
    "connector.class": "io.yugabyte.cdc.YBChangeDataCaptureConnector",
    "database.hostname": "yb-1.example.com",
    "database.port": "5433",
    "table.include.list": "public.orders"
  }
}

Operational Considerations

Deployment Topologies

Single region (3 nodes):

replication:
  factor: 3
  constraints:
    - "+region=us-east-1: 3"
  # All data in one region, 3 copies, survives 2 node failures

Multi-region (survive region failure):

replication:
  factor: 3
  constraints:
    - "+region=us-east-1: 1"
    - "+region=us-west-1: 1"
    - "+region=eu-west-1: 1"
  # One copy in each region, survives any single region failure
  # Write latency: round-trip to the farthest region

Multi-region with close quorum (for low-latency writes):

replication:
  factor: 5
  constraints:
    - "+region=us-east-1: 3"  # 3 copies here
    - "+region=us-west-1: 1"  # 1 copy for DR
    - "+region=eu-west-1: 1"  # 1 copy for DR
  # Write quorum = 3, achievable within us-east-1
  # Write latency = intra-region

Latency Expectations

Topology	Write Latency (p50)	Read Latency (p50)
Single node	1-5ms	0.5-2ms
3 nodes, same region	5-15ms	1-5ms
3 nodes, 3 regions (US, EU, APAC)	100-300ms	5-50ms (closest)
5 nodes, 2 regions (close quorum)	5-15ms	1-5ms

Migration from PostgreSQL/MySQL

Strategy: NoSQL-to-(Distributed)SQL

Phase 1: Evaluate
  - Identify tables that need horizontal scaling
  - Assess which queries cross shards
  - Plan sharding key

Phase 2: Migrate schema
  - Convert CREATE TABLE statements
  - Choose primary key strategy (hash vs. range)
  - Handle sequences (use UUIDs instead of SERIAL)

Phase 3: Migrate data
  - Use CDC or batch export/import
  - Verify consistency
  - Set up replication from old DB

Phase 4: Cut over
  - Point application to new DB
  - Monitor performance
  - Keep old DB as rollback target for 1 week

Common Migration Issues

-- PostgreSQL → CockroachDB
-- ❌ Sequences (not distributed)
SERIAL PRIMARY KEY  →  UUID DEFAULT gen_random_uuid()

-- ❌ Array functions not supported
array_agg(DISTINCT x)  →  json_agg(DISTINCT x)  -- workaround

-- ❌ Long-running queries
-- Solution: add index or query hint

-- ❌ Cross-database queries (CockroachDB)
-- Each cluster = one database, no cross-DB queries

Conclusion

Distributed SQL represents the convergence of two decades of database evolution — combining the reliability and expressiveness of SQL with the elasticity and resilience of distributed systems.

Key Takeaways

Use Distributed SQL when you need SQL semantics but have outgrown a single node.
Not for small deployments — PostgreSQL with read replicas is simpler and cheaper.
Choose by compatibility — YugabyteDB for PostgreSQL, TiDB for MySQL, CockroachDB for new projects.
Plan for latency — distributed transactions have overhead. Benchmark before committing.
Monitor carefully — watch for hot ranges, slow queries, and replication lag.

Decision Framework

Q: Do you need horizontal write scaling?
  ├─ Yes
  │  Q: Do you need ACID across all data?
  │  ├─ Yes → Distributed SQL (CockroachDB, Spanner)
  │  └─ No  → NoSQL (Cassandra, DynamoDB)
  └─ No
     Q: Do you need SQL features (joins, aggregations)?
     ├─ Yes → PostgreSQL
     └─ No  → Key-value or document store

Distributed SQL is the database architecture for the global, cloud-native era — applications that must work everywhere, with strong consistency, without compromise.

Comments

0 comments

No approved comments are visible yet. New community replies may wait for moderation.