The Gravitational Pull of SQL

And it's force on streaming processing and vector databases

Jun 09, 2025

SQL is the black hole of data processing languages. Every few years, a new language or paradigm is proposed to replace SQL. And yet, inevitably, those alternatives get pulled back into its orbit.

I've seen this phenomenon play out multiple times over my career:

MapReduce was supposed to simplify how we think about large-scale data processing. Then came Hive. And now? We're back to SQL: SparkSQL, Snowflake, BigQuery.

The NoSQL movement launched a direct assault on SQL. It promised freedom from rigid schemas and ACID constraints to achieve “web scale”. But it quickly rebranded as Not-only SQL, then NewSQL, and eventually circled back to... just SQL. These days, PostgreSQL is the cool kid again. It turns 40 next year.

Graph databases promised a more expressive model for connected data. But instead of supplanting SQL, they led to increased support for explicit relationship modeling and pattern matching inside SQL itself.

As the creator of JanusGraph and someone who got acquired into DataStax (the company behind Apache Cassandra), I experienced SQL’s gravitational pull firsthand. The familiarity, readability, and declarative nature are strong selling points.

This pattern isn’t new. Mike Stonebraker wrote a piece called What Goes Around Comes Around in 2005, cataloging similar cycles from decades before my time. Worth a read.

Why SQL Keeps Winning

Why does this keep happening? Why is SQL so sticky?

1. SQL evolves without collapsing under its own weight

Despite being over 50 years old, SQL remains remarkably adaptive. New paradigms get absorbed into SQL not because SQL refuses to die, but because it keeps learning how to live.

SQL incorporates just enough from new ideas to stay relevant—without becoming unmanageable. It focuses on the 80% of use cases that matter, and lets the long tail be handled by specialized tools.

MapReduce is still around. So is MongoDB. And Neo4j. They each have niches where they shine. But their ideas have largely been mainstreamed into SQL systems.

2. SQL stands on rock-solid foundations

Under the hood, SQL is built on relational algebra and relational calculus. That gives it a principled backbone—one that makes it amenable to optimization and extension without devolving into chaos.

There have been more expressive or theoretically elegant query languages proposed over the years. Many were clever - like Datalog. Some even gained a bit of traction - like SPARQL. But most fizzled out because they were either too complex to learn or too niche to adopt.

SQL balances a key trade-off: it is expressive enough to support complex analytics and business logic, yet simple and readable enough to be approachable by a wide range of developers.

According to the Stack Overflow Developer Survey, SQL consistently ranks as one of the most commonly known languages—hovering near the top year after year. That matters.

3. Declarative power and decades of optimization

Data engineering has a higher bar for correctness and performance than general-purpose software development. SQL's declarative model enables systems to perform sophisticated query optimization behind the scenes.

Rather than instructing the database engine step-by-step, SQL lets you describe the shape and conditions of the result. That abstraction enables the system to rewrite and improve the execution plan without requiring changes to the query itself.

And decades of system-level investment make this possible. SQL engines now have robust query planners and optimizers—Volcano models, cost-based optimizations, materialized view rewrites—that competing paradigms must re-implement from scratch.

This isn't just theoretical. It has practical implications for maintainability, portability, and performance tuning at scale.

Who’s Next on SQL’s Menu?

Once you see this cycle, you start wondering: who’s next?

Vector Databases

Vector databases looked like strong contenders during the early AI hype cycle. Pinecone, Milvus, and others pushed the idea of a whole new class of database purpose-built for vector similarity.

But then the SQL crowd responded. PostgreSQL added pgvector. Oracle, Cassandra, and others followed suit. Vector indexes turned out to be a fairly natural extension of the SQL model. So, once again, the idea got absorbed.

Stream and Incremental Data Processing

SQL abstractions have taken longer to mature in stream processing and data pipelines. Systems like Apache Flink, Kafka Streams, and event-driven pipelines emerged with procedural APIs in Java, Scala, or Python. These offered precise control and flexibility—which many developers preferred.

But even here, SQL is closing the gap. Flink SQL has made real strides in bringing continuous data processing into the declarative fold.

It offers a relational abstraction over streaming data, distinguishing between append-only streams and those with updates or retractions. While not a new concept, Flink SQL's implementation has been a practical and scalable realization of these ideas.

Here’s an example query:

SELECT beforeURL AS url, afterURL AS recommendation,
       COUNT(1) AS frequency
FROM (
         SELECT b.url AS beforeURL, a.url AS afterURL,
                a.event_time AS `timestamp`
         FROM Clickstream b
                  INNER JOIN Clickstream a ON b.userid = a.userid
             AND b.event_time < a.event_time
             AND b.event_time >= a.event_time - INTERVAL '10' MINUTE
     )
GROUP BY beforeURL, afterURL
ORDER BY url ASC, frequency DESC;

This query joins a clickstream table against itself to find pairs of page visits by the same user that occurred within a 10-minute window. It then aggregates those transitions to surface popular navigational patterns—useful, say, for building a recommendation engine.

You could write this in Java or Python, but you'd also have to:

manage intermediate state,
handle late or out-of-order events,
implement fault-tolerant state recovery,
and ensure efficient joins and aggregations.

SQL lets you express the intent cleanly while pushing all that operational complexity down into the engine.

Jennifer Widom's foundational work on CQL (Continuous Query Language) laid the theoretical groundwork for this kind of declarative stream processing. Arasu, Babu, & Widom, VLDB Journal 2006.

Flink SQL builds on that legacy and is now evolving to support procedural extensions as well. Flink SQL has long supported user-defined functions in JVM languages in Python. With the addition of polymorphic table functions in Flink 2.1, developers can implement custom operators as “Process Table Functions” that provide low level control over execution. This gives Flink SQL a backdoor when SQL alone isn’t enough—without having to abandon SQL altogether. A critical piece to making SQL the mainstream choice for stream and incremental data processing.

Wrapping Up

Over and over, we see the same pattern: new ideas challenge SQL, gain momentum, and then quietly get folded back into the mainstream. SQL evolves. The ecosystem adapts. And most developers go on writing queries.

That pattern now applies to incremental data processing too. With Flink SQL, we’re seeing declarative abstractions absorb continuous computation in a way that balances control with simplicity.

The gravitational pull of SQL isn’t just inertia. It’s the outcome of deep theoretical roots, practical trade-offs, and relentless adaptation.

And it’s not done yet.