Posts by Infoman

2017/09/27
  • Yahoo open sources its search engine Vespa

    182 mkagenius 6 hrs 22

    https://www.oath.com/press/open-sourcing-vespa-yahoo-s-big-data-processing-and-serving-eng/

    http://news.ycombinator.com/item?id=15345483

    By Jon Bratseth, Distinguished Architect, Vespa

    Logo for Vespa, Yahoo's big data processing engine

    Ever since we open sourced Hadoop in 2006, Yahoo – and now, Oath – has been committed to opening up its big data infrastructure to the larger developer community. Today, we are taking another major step in this direction by making Vespa, Yahoo's big data processing and serving engine, available as open source on GitHub.

    Map of Vespa's big data architecture.

    Vespa architecture overview

    Building applications increasingly means dealing with huge amounts of data. While developers can use the the Hadoop stack to store and batch process big data, and Storm to stream-process data, these technologies do not help with serving results to end users. Serving is challenging at large scale, especially when it is necessary to make computations quickly over data while a user is waiting, as with applications that feature search, recommendation, and personalization.

    By releasing Vespa, we are making it easy for anyone to build applications that can compute responses to user requests, over large datasets, at real time and at internet scale – capabilities that up until now, have been within reach of only a few large companies.

    Serving often involves more than looking up items by ID or computing a few numbers from a model. Many applications need to compute over large datasets at serving time. Two well-known examples are search and recommendation. To deliver a search result or a list of recommended articles to a user, you need to find all the items matching the query, determine how good each item is for the particular request using a relevance/recommendation model, organize the matches to remove duplicates, add navigation aids, and then return a response to the user. As these computations depend on features of the request, such as the user's query or interests, it won't do to compute the result upfront. It must be done at serving time, and since a user is waiting, it has to be done fast. Combining speedy completion of the aforementioned operations with the ability to perform them over large amounts of data requires a lot of infrastructure – distributed algorithms, data distribution and management, efficient data structures and memory management, and more. This is what Vespa provides in a neatly-packaged and easy to use engine.

    With over 1 billion users, we currently use Vespa across many different Oath brands – including Yahoo.com, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini, Flickr, and others – to process and serve billions of daily requests over billions of documents while responding to search queries, making recommendations, and providing personalized content and advertisements, to name just a few use cases. In fact, Vespa processes and serves content and ads almost 90,000 times every second with latencies in the tens of milliseconds. On Flickr alone, Vespa performs keyword and image searches on the scale of a few hundred queries per second on tens of billions of images. Additionally, Vespa makes direct contributions to our company's revenue stream by serving over 3 billion native ad requests per day via Yahoo Gemini, at a peak of 140k requests per second (per Oath internal data).

    With Vespa, our teams build applications that:

    Select content items using SQL-like queries and text search Organize all matches to generate data-driven pages Rank matches by handwritten or machine-learned relevance models Serve results with response times in the lows milliseconds Write data in real-time, thousands of times per second per node Grow, shrink, and re-configure clusters while serving and writing data To achieve both speed and scale, Vespa distributes data and computation over many machines without any single master as a bottleneck. Where conventional applications work by pulling data into a stateless tier for processing, Vespa instead pushes computations to the data. This involves managing clusters of nodes with background redistribution of data in case of machine failures or the addition of new capacity, implementing distributed low latency query and processing algorithms, handling distributed data consistency, and a lot more. It's a ton of hard work!

    As the team behind Vespa, we have been working on developing search and serving capabilities ever since building alltheweb.com, which was later acquired by Yahoo. Over the last couple of years we have rewritten most of the engine from scratch to incorporate our experience onto a modern technology stack. Vespa is larger in scope and lines of code than any open source project we've ever released. Now that this has been battle-proven on Yahoo's largest and most critical systems, we are pleased to release it to the world.

    Vespa gives application developers the ability to feed data and models of any size to the serving system and make the final computations at request time. This often produces a better user experience at lower cost (for buying and running hardware) and complexity compared to pre-computing answers to requests. Furthermore it allows developers to work in a more interactive way where they navigate and interact with complex calculations in real time, rather than having to start offline jobs and check the results later.

    Vespa can be run on premises or in the cloud. We provide both Docker images and rpm packages for Vespa, as well as guides for running them both on your own laptop or as an AWS cluster.

    We'll follow up this initial announcement with a series of posts on our blog showing how to build a real-world application with Vespa, but you can get started right now by following the getting started guide in our comprehensive documentation.

    Managing distributed systems is not easy. We have worked hard to make it easy to develop and operate applications on Vespa so that you can focus on creating features that make use of the ability to compute over large datasets in real time, rather than the details of managing clusters and data. You should be able to get an application up and running in less than ten minutes by following the documentation.

    We can't wait to see what you build with it!

  • 2018 香港加息周期 起點

    美國聯邦儲備局剛宣布,將於10月開始縮減資產負債表(俗稱「縮表」),標誌過去9年投放市場的流動資金陸續回籠,環球市場將恢復2008年環球金融海嘯前的常態。為此,香港官僚已急不及待,再三警告須提防走資潮,港元利率飆升,影響百業民生;外滙基金更坐言起行,搶先增發票據,再度抽資,推高隔夜和短期同業拆息。不過,港元奉行貨幣局聯繫滙率制度,美元港元利息理應同步同軌,況且熱錢往往跑快一步,走資加息的論調,既費解,也困惑。

    須了解熱錢動向

    若要理解為何熱錢撤退,須先理解熱錢為何湧入。2008年環球金融海嘯爆發,存貸各不相讓,慎防有失,資金市場癱瘓,美國聯儲局率先推行量化寬鬆措施,再三購入金融債務,重新啟動市場流動性,恢復固有交易秩序。

    所謂量化寬鬆,其實是以中央銀行信用取代原來的債務信用,拆解三角債死結,令市場鬆綁。聯儲局資產負債同步增加,資產是所購入債務,負債是銀行清算結餘,屬於存款儲備;銀行儲備上升,貸款及存款循環擴張,資產及負債也同步增加,為經濟注入新血,恢復元氣。

    另一方面,聯儲局也降低利息至零作配合,減低官民舉債負擔,加速鬆綁效應。多年來市場已消化銀根寬鬆和零息的新形勢,現時聯儲局確認終止寬鬆貨幣措施,並啟動收縮資產負債,配合利率正常化,市場也須重新適應。早前利率開始提升後,「縮表」是預期之內,但撥亂反正不能一蹴而就。

    量化寬鬆下,美元泛濫,找尋機會投機投資增值,港元首當其衝則有多個因素所致。其一、港元掛鈎美元,幾乎全無外滙風險;其二、港元利率跟隨美元下調,推高實質及金融資產價格,等於資產通脹;其三、人民幣滙率改革後連年升值,香港離岸市場暢旺,吸引外資借道港元炒賣尋租。據官方估計,2008年尾季至2009年底的15個月,流入熱錢共達6400億港元等值之多。

    熱錢流入,香港銀行海外美元存款(資產)與客戶港元存款(負債)同步增加,清算結餘對活期存款(支票及儲蓄)比率相應下降,須向外滙基金兌換港元補充結餘。

    在客觀供求規律下,港元滙率高出官價,最終觸及外滙基金強方保證,而港元美元息差也拉闊,以維持固定滙率。

    翻查紀錄,2008年10月量化寬鬆啟動後,港元滙率從官價7.8升至7.52;銀行清算比率跳升至10%,比前高十倍;港元同業拆息(HIBOR)隔夜及一周跌至1厘之下,年底更跌至0.5厘之下;而一個月拆息,年底亦跌破0.5厘;銀行存貸利率牌價也相應下調。

    港元利率結構受扭曲

    事後分析,熱錢湧入有三個特徵:其一、港元滙率驟升至強方保證(7.75)價位;其二,港元同業市場流動性驟增,銀行清算結餘比率跳升;其三、港元利率驟降,與基礎利率(即貼現窗利率)同步,而基礎利率是美聯邦基金利率(即同業拆息)目標價加0.5厘。

    聯儲局結束量化寬鬆措施,啟動資金回籠,熱錢回流也有3個特徵:其一、港元回弱至官價(7.8)水平;其二、港元同業市場流動性減低,銀行清算結餘比率回落至正常水平;其三、港元利率回升,與基礎利率同步。若對照近來市情,兩個特徵已呈現。港元滙率已回落至官價,銀行清算結餘比率亦回落至6%。

    2008年環球金融海嘯平息後,國際間加強監管善後,提高銀行儲備及流動資產標準,清算結餘比率業難以返回從前之1%水平,5%可能是新常態。熱錢是「聰明錢」,往往早着先鞭,聯儲局縮減資產負債既是定局,提早撤資快人一步並非意料之外。不過,港元存貸利率仍未正常化,令人困惑熱錢是否撤離?若否,為何港元偏弱,而清算結餘也回落?

    其實,港元利率結構過去9年被扭曲,最明顯是最優惠利率(BLR)從未隨基礎利率起跌,而儲蓄利率近零,似有若無。最優惠利率向來與儲蓄利率同步,因為儲蓄向來是最穩定的散戶資金來源,而散戶信貸是以最優惠利率計息。港元利率協定2001年全面撤銷,但兩者關係實際仍密切。

    統計自1971年起至2001年止30年,最優惠利率平均約儲蓄的2.5倍。環球海嘯前最優惠利率是5厘,儲蓄利率應是2厘,而基礎利率是3.5厘;現今基礎利率是1.5厘,若以該比率作準則,推算儲蓄利率應為0.85厘(=1.5*﹝2/3.5﹞),最優惠利率應為2.125厘左右。且參照按揭利率印證,無論以「HIBOR加」或「BLR減」報價,皆是2厘多水平,與推算吻合。若撇開利率扭曲及後遺症等影響,港元同業市場拆息實際漸趨常態,所謂港元有待加息論調,乃以偏概全。因此3個撤資特徵應皆齊備,合理推斷是投機熱錢洞悉先機,應已獲利撤退。

    發債抽資換湯不換藥

    不過,上述分析仍有不足之處。所有進出資本無論投資或投機,最終「落戶」銀行資產中「貨幣金融」賬目,即是清算結餘、外滙基金票據、鈔票發行儲備。故此整體增減足以顯示資本性流向,而短期驟變更能反映熱錢進出。銀行業務統計並無區分本土及離岸, 港元存款轉換外幣存款是從本土賬轉移至離岸賬,理論上乃資本外流,反之是資本回流,而實際卻無資金進出。

    若離岸外幣存款起跌大,或影響清算結餘增減, 造成熱錢進退假象。其實,離岸外幣(包括人民幣)存款增減也影響港元滙率,因為買賣須經美元作媒介, 若存款擴張令美元求大於供,港元偏軟;反之存款收縮則美元供大於求,港元偏強。故此單憑港元偏強偏軟或同業拆息上落難下定論以判斷資本性流向,須綜合多個指標互相印證,始可窺全豹。

    美國啟動利率及流動性正常化多時,香港銀行卻遲遲未調整存息貸息牌價,歸根究柢是9年來從未隨市場走勢調整,何來加息空間?利率協議早已撤銷, 卻疑似名亡實存,實匪夷所思。不過,美國金融貨幣正常化指日可待,香港也解困有期,不必逆周期操作揠苗助長。

    熱錢特色是機動性強、觸覺性高,豈會落後於形勢?對照最新金融貨幣指標,切合撤資特徵。外滙基金卻急不及待治標,前後兩度增發票據抽資,似乎另有盤算。其實,9年來香港銀行謹慎有餘,2008年9月底儲備比率是37%,港元貸存比率是80%;今年6月底儲備比提高至43%,港元貸存比率下跌至68%。

    發債抽資是換湯不換藥,港元銀根並無收縮, 實際無助應對熱錢撤資,也無助港元利率正常化,反而發放收緊金融貨幣訊息,令銀行取態更謹慎,面臨加息周期,對經濟百弊而無一利。

    鄭宏泰為香港中文大學香港亞太研究所助理所長;陸觀豪為退休銀行家、香港中文大學香港亞太研究所名譽研究員

  • Why SQL is beating NoSQL, and what this means for the future of data

    104 nreece 7 hrs 45

    https://blog.timescale.com/why-sql-beating-nosql-what-this-means-for-future-of-data-time-series-database-348b777b847a

    http://news.ycombinator.com/item?id=15335717

    After years of being left for dead, SQL today is making a comeback. How come? And what effect will this have on the data community?

    SQL awakens to fight the dark forces of NoSQL Since the dawn of computing, we have been collecting exponentially growing amounts of data, constantly asking more from our data storage, processing, and analysis technology. In the past decade, this caused software developers to cast aside SQL as a relic that couldn’t scale with these growing data volumes, leading to the rise of NoSQL: MapReduce and Bigtable, Cassandra, MongoDB, and more.

    Yet today SQL is resurging. All of the major cloud providers now offer popular managed relational database services: e.g., Amazon RDS, Google Cloud SQL, Azure Database for PostgreSQL (Azure launched just this year). In Amazon’s own words, its PostgreSQL- and MySQL-compatible database Aurora database product has been the “fastest growing service in the history of AWS”. SQL interfaces on top of Hadoop and Spark continue to thrive. And just last month, Kafka launched SQL support. Your humble authors themselves are developers of a new time-series database that fully embraces SQL.

    In this post we examine why the pendulum today is swinging back to SQL, and what this means for the future of the data engineering and analysis community.

    To understand why SQL is making a comeback, let’s start with why it was designed in the first place.

    Like all good stories, ours starts in the 1970s Our story starts at IBM Research in the early 1970s, where the relational database was born. At that time, query languages relied on complex mathematical logic and notation. Two newly minted PhDs, Donald Chamberlin and Raymond Boyce, were impressed by the relational data model but saw that the query language would be a major bottleneck to adoption. They set out to design a new query language that would be (in their own words): “more accessible to users without formal training in mathematics or computer programming.”

    Query languages before SQL (a, b) vs SQL © (source) Think about this. Way before the Internet, before the Personal Computer, when the programming language C was first being introduced to the world, two young computer scientists realized that, “much of the success of the computer industry depends on developing a class of users other than trained computer specialists.” They wanted a query language that was as easy to read as English, and that would also encompass database administration and manipulation.

    The result was SQL, first introduced to the world in 1974. Over the next few decades, SQL would prove to be immensely popular. As relational databases like System R, Ingres, DB2, Oracle, SQL Server, PostgreSQL, MySQL (and more) took over the software industry, SQL became established as the preeminent language for interacting with a database, and became the lingua franca for an increasingly crowded and competitive ecosystem.

    (Sadly, Raymond Boyce never had a chance to witness SQL’s success. He died of a brain aneurysm 1 month after giving one of the earliest SQL presentations, just 26 years of age, leaving behind a wife and young daughter.)

    For a while, it seemed like SQL had successfully fulfilled its mission. But then the Internet happened.

    While Chamberlin and Boyce were developing SQL, what they didn’t realize is that a second group of engineers in California were working on another budding project that would later widely proliferate and threaten SQL’s existence. That project was ARPANET, and on October 29, 1969, it was born.

    Some of the creators of ARPANET, which eventually evolved into today’s Internet (source) But SQL was actually fine until another engineer showed up and invented the World Wide Web, in 1989.

    The physicist who invented the Web (source) Like a weed, the Internet and Web flourished, massively disrupting our world in countless ways, but for the data community it created one particular headache: new sources generating data at much higher volumes and velocities than before.

    As the Internet continued to grow and grow, the software community found that the relational databases of that time couldn’t handle this new load. There was a disturbance in the force, as if a million databases cried out and were suddenly overloaded.

    Then two new Internet giants made breakthroughs, and developed their own distributed non-relational systems to help with this new onslaught of data: MapReduce (published 2004) and Bigtable (published 2006) by Google, and Dynamo (published 2007) by Amazon. These seminal papers led to even more non-relational databases, including Hadoop (based on the MapReduce paper, 2006), Cassandra (heavily inspired by both the Bigtable and Dynamo papers, 2008) and MongoDB (2009). Because these were new systems largely written from scratch, they also eschewed SQL, leading to the rise of the NoSQL movement.

    And boy did the software developer community eat up NoSQL, embracing it arguably much more broadly than the original Google/Amazon authors intended. It’s easy to understand why: NoSQL was new and shiny; it promised scale and power; it seemed like the fast path to engineering success. But then the problems started appearing.

    Classic software developer tempted by NoSQL. Don’t be this guy. Developers soon found that not having SQL was actually quite limiting. Each NoSQL database offered its own unique query language, which meant: more languages to learn (and to teach to your coworkers); increased difficulty in connecting these databases to applications, leading to tons of brittle glue code; a lack of a third party ecosystem, requiring companies to develop their own operational and visualization tools.

    These NoSQL languages, being new, were also not fully developed. For example, there had been years of work in relational databases to add necessary features to SQL (e.g., JOINs); the immaturity of NoSQL languages meant more complexity was needed at the application level. The lack of JOINs also led to denormalization, which led to data bloat and rigidity.

    Some NoSQL databases added their own “SQL-like” query languages, like Cassandra’s CQL. But this often made the problem worse. Using an interface that is almost identical to something more common actually created more mental friction: engineers didn’t know what was supported and what wasn’t.

    SQL-like query languages are like the Star Wars Holiday Special. Accept no imitations. (And always avoid the Star Wars Holiday Special.) Some in the community saw the problems with NoSQL early on (e.g., DeWitt and Stonebraker in 2008). Over time, through hard-earned scars of personal experience, more and more software developers joined them.

    Initially seduced by the dark side, the software community began to see the light and come back to SQL.

    First came the SQL interfaces on top of Hadoop (and later, Spark), leading the industry to “back-cronym” NoSQL to “Not Only SQL” (yeah, nice try).

    Then came the rise of NewSQL: new scalable databases that fully embraced SQL. H-Store (published 2008) from MIT and Brown researchers was one of the first scale-out OLTP databases. Google again led the way for a geo-replicated SQL-interfaced database with their first Spanner paper (published 2012) (whose authors include the original MapReduce authors), followed by other pioneers like CockroachDB (2014).

    At the same time, the PostgreSQL community began to revive, adding critical improvements like a JSON datatype (2012), and a potpourri of new features in PostgreSQL 10: better native support for partitioning and replication, full text search support for JSON, and more (release slated for later this year). Other companies like CitusDB (2016) and yours truly (TimescaleDB, released this year) found new ways to scale PostgreSQL for specialized data workloads.

    In fact, our journey developing TimescaleDB closely mirrors the path the industry has taken. Early internal versions of TimescaleDB featured our own SQL-like query language called “ioQL.” Yes, we too were tempted by the dark side: building our own query language felt powerful. But while it seemed like the easy path, we soon realized that we’d have to do a lot more work: e.g., deciding syntax, building various connectors, educating users, etc. We also found ourselves constantly looking up the proper syntax to queries that we could already express in SQL, for a query language we had written ourselves!

    One day we realized that building our own query language made no sense. That the key was to embrace SQL. And that was one of the best design decisions we have made. Immediately a whole new world opened up. Today, even though we are just a 5 month old database, our users can use us in production and get all kinds of wonderful things out of the box: visualization tools (Tableau), connectors to common ORMs, a variety of tooling and backup options, an abundance of tutorials and syntax explanations online, etc.

    Google has clearly been on the leading edge of data engineering and infrastructure for over a decade now. It behooves us to pay close attention to what they are doing.

    Take a look at Google’s second major Spanner paper, released just four months ago (Spanner: Becoming a SQL System, May 2017), and you’ll find that it bolsters our independent findings.

    For example, Google began building on top of Bigtable, but then found that the lack of SQL created problems (emphasis in all quotes below ours):

    “While these systems provided some of the benefits of a database system, they lacked many traditional database features that application developers often rely on. A key example is a robust query language, meaning that developers had to write complex code to process and aggregate the data in their applications. As a result, we decided to turn Spanner into a full featured SQL system, with query execution tightly integrated with the other architectural features of Spanner (such as strong consistency and global replication).” Later in the paper they further capture the rationale for their transition from NoSQL to SQL:

    The original API of Spanner provided NoSQL methods for point lookups and range scans of individual and interleaved tables. While NoSQL methods provided a simple path to launching Spanner, and continue to be useful in simple retrieval scenarios, SQL has provided significant additional value in expressing more complex data access patterns and pushing computation to the data. The paper also describes how the adoption of SQL doesn’t stop at Spanner, but actually extends across the rest of Google, where multiple systems today share a common SQL dialect:

    Spanner’s SQL engine shares a common SQL dialect, called “Standard SQL”, with several other systems at Google including internal systems such as F1 and Dremel (among others), and external systems such as BigQuery… For users within Google, this lowers the barrier of working across the systems. A developer or data analyst who writes SQL against a Spanner database can transfer their understanding of the language to Dremel without concern over subtle differences in syntax, NULL handling, etc. The success of this approach speaks for itself. Spanner is already the “source of truth” for major Google systems, including AdWords and Google Play, while “Potential Cloud customers are overwhelmingly interested in using SQL.”

    Considering that Google helped initiate the NoSQL movement in the first place, it is quite remarkable that it is embracing SQL today. (Leading some to recently wonder: “Did Google Send the Big Data Industry on a 10 Year Head Fake?”.)

    In computer networking, there is a concept called the “narrow waist.”

    This idea emerged to solve a key problem: On any given networked device, imagine a stack, with layers of hardware at the bottom and layers of software on top. There can exist a variety of networking hardware; similarly there can exist a variety of software and applications. One needs a way to ensure that no matter the hardware, the software can still connect to the network; and no matter the software, that the networking hardware knows how to handle the network requests.

    The Networking Narrow Waist (source) In networking, the role of the narrow waist is played by Internet Protocol (IP), acting as a common interface between lower-level networking protocols designed for local-area network, and higher-level application and transport protocols. (Here’s one nice explanation.) And (in a broad oversimplification), this common interface became the lingua franca for computers, enabling networks to interconnect, devices to communicate, and this “network of networks” to grow into today’s rich and varied Internet.

    We believe that SQL has become the narrow waist for data analysis.

    We live in an era where data is becoming “the world’s most valuable resource” (The Economist, May 2017). As a result, we have seen a Cambrian explosion of specialized databases (OLAP, time-series, document, graph, etc.), data processing tools (Hadoop, Spark, Flink), data buses (Kafka, RabbitMQ), etc. We also have more applications that need to rely on this data infrastructure, whether third-party data visualization tools (Tableau, Grafana, PowerBI, Superset), web frameworks (Rails, Django) or custom-built data-driven applications.

    Like networking we have a complex stack, with infrastructure on the bottom and applications on top. Typically, we end up writing a lot of glue code to make this stack work. But glue code can be brittle: it needs to be maintained and tended to.

    What we need is a common interface that allows pieces of this stack to communicate with one another. Ideally something already standardized in the industry. Something that would allow us to swap in/out various layers with minimal friction.

    That is the power of SQL. Like IP, SQL is a common interface.

    But SQL is in fact much more than IP. Because data also gets analyzed by humans. And true to the purpose that SQL’s creators initially assigned to it, SQL is readable.

    Is SQL perfect? No, but it is the language that most of us in the community know. And while there are already engineers out there working on a more natural language oriented interface, what will those systems then connect to? SQL.

    So there is another layer at the very top of the stack. And that layer is us.

    SQL is back. Not just because writing glue code to kludge together NoSQL tools is annoying. Not just because retraining workforces to learn a myriad of new languages is hard. Not just because standards can be a good thing.

    But also because the world is filled with data. It surrounds us, binds us. At first, we relied on our human senses and sensory nervous systems to process it. Now our software and hardware systems are also getting smart enough to help us. And as we collect more and more data to make better sense of our world, the complexity of our systems to store, process, analyze, and visualize that data will only continue to grow as well.

    Master Data Scientist Yoda Either we can live in a world of brittle systems and a million interfaces. Or we can continue to embrace SQL. And restore balance to the force.

    Like this post? Please recommend and/or share.

    And if you’d like to learn more about TimescaleDB, please check out our GitHub (stars always appreciated), and please let us know how we can help.

    Suggested reading for those who’d like to learn more about the history of databases (aka syllabus for the future TimescaleDB Intro to Databases Class):

    A Relational Model of Data for Large Shared Data Banks (IBM Research, 1970) SEQUEL: A Structured English Query Language (IBM Research, 1974) System R: Relational Approach to Database Management (IBM Research, 1976) MapReduce: Simplified Data Processing on Large Clusters (Google, 2004) C-Store: A Column-oriented DBMS (MIT, others, 2005) Bigtable: A Distributed Storage System for Structured Data (Google, 2006) Dynamo: Amazon’s Highly Available Key-value Store (Amazon, 2007) MapReduce: A major step backwards (DeWitt, Stonebreaker, 2008) H-Store: A High-Performance, Distributed Main Memory Transaction Processing System (MIT, Brown, others, 2008) Spark: Cluster Computing with Working Sets (UC Berkeley, 2010) Spanner: Google’s Globally-Distributed Database (Google, 2012) Early History of SQL (Chamberlin, 2012) How the Internet was Born (Hines, 2015) Spanner: Becoming a SQL System (Google, 2017)