Posts by Infoman

2017/09/27
  • Yahoo open sources its search engine Vespa

    182 mkagenius 6 hrs 22

    https://www.oath.com/press/open-sourcing-vespa-yahoo-s-big-data-processing-and-serving-eng/

    http://news.ycombinator.com/item?id=15345483

    By Jon Bratseth, Distinguished Architect, Vespa

    Logo for Vespa, Yahoo's big data processing engine

    Ever since we open sourced Hadoop in 2006, Yahoo – and now, Oath – has been committed to opening up its big data infrastructure to the larger developer community. Today, we are taking another major step in this direction by making Vespa, Yahoo's big data processing and serving engine, available as open source on GitHub.

    Map of Vespa's big data architecture.

    Vespa architecture overview

    Building applications increasingly means dealing with huge amounts of data. While developers can use the the Hadoop stack to store and batch process big data, and Storm to stream-process data, these technologies do not help with serving results to end users. Serving is challenging at large scale, especially when it is necessary to make computations quickly over data while a user is waiting, as with applications that feature search, recommendation, and personalization.

    By releasing Vespa, we are making it easy for anyone to build applications that can compute responses to user requests, over large datasets, at real time and at internet scale – capabilities that up until now, have been within reach of only a few large companies.

    Serving often involves more than looking up items by ID or computing a few numbers from a model. Many applications need to compute over large datasets at serving time. Two well-known examples are search and recommendation. To deliver a search result or a list of recommended articles to a user, you need to find all the items matching the query, determine how good each item is for the particular request using a relevance/recommendation model, organize the matches to remove duplicates, add navigation aids, and then return a response to the user. As these computations depend on features of the request, such as the user's query or interests, it won't do to compute the result upfront. It must be done at serving time, and since a user is waiting, it has to be done fast. Combining speedy completion of the aforementioned operations with the ability to perform them over large amounts of data requires a lot of infrastructure – distributed algorithms, data distribution and management, efficient data structures and memory management, and more. This is what Vespa provides in a neatly-packaged and easy to use engine.

    With over 1 billion users, we currently use Vespa across many different Oath brands – including Yahoo.com, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini, Flickr, and others – to process and serve billions of daily requests over billions of documents while responding to search queries, making recommendations, and providing personalized content and advertisements, to name just a few use cases. In fact, Vespa processes and serves content and ads almost 90,000 times every second with latencies in the tens of milliseconds. On Flickr alone, Vespa performs keyword and image searches on the scale of a few hundred queries per second on tens of billions of images. Additionally, Vespa makes direct contributions to our company's revenue stream by serving over 3 billion native ad requests per day via Yahoo Gemini, at a peak of 140k requests per second (per Oath internal data).

    With Vespa, our teams build applications that:

    Select content items using SQL-like queries and text search Organize all matches to generate data-driven pages Rank matches by handwritten or machine-learned relevance models Serve results with response times in the lows milliseconds Write data in real-time, thousands of times per second per node Grow, shrink, and re-configure clusters while serving and writing data To achieve both speed and scale, Vespa distributes data and computation over many machines without any single master as a bottleneck. Where conventional applications work by pulling data into a stateless tier for processing, Vespa instead pushes computations to the data. This involves managing clusters of nodes with background redistribution of data in case of machine failures or the addition of new capacity, implementing distributed low latency query and processing algorithms, handling distributed data consistency, and a lot more. It's a ton of hard work!

    As the team behind Vespa, we have been working on developing search and serving capabilities ever since building alltheweb.com, which was later acquired by Yahoo. Over the last couple of years we have rewritten most of the engine from scratch to incorporate our experience onto a modern technology stack. Vespa is larger in scope and lines of code than any open source project we've ever released. Now that this has been battle-proven on Yahoo's largest and most critical systems, we are pleased to release it to the world.

    Vespa gives application developers the ability to feed data and models of any size to the serving system and make the final computations at request time. This often produces a better user experience at lower cost (for buying and running hardware) and complexity compared to pre-computing answers to requests. Furthermore it allows developers to work in a more interactive way where they navigate and interact with complex calculations in real time, rather than having to start offline jobs and check the results later.

    Vespa can be run on premises or in the cloud. We provide both Docker images and rpm packages for Vespa, as well as guides for running them both on your own laptop or as an AWS cluster.

    We'll follow up this initial announcement with a series of posts on our blog showing how to build a real-world application with Vespa, but you can get started right now by following the getting started guide in our comprehensive documentation.

    Managing distributed systems is not easy. We have worked hard to make it easy to develop and operate applications on Vespa so that you can focus on creating features that make use of the ability to compute over large datasets in real time, rather than the details of managing clusters and data. You should be able to get an application up and running in less than ten minutes by following the documentation.

    We can't wait to see what you build with it!

  • 2018 香港加息周期 起點

    美國聯邦儲備局剛宣布,將於10月開始縮減資產負債表(俗稱「縮表」),標誌過去9年投放市場的流動資金陸續回籠,環球市場將恢復2008年環球金融海嘯前的常態。為此,香港官僚已急不及待,再三警告須提防走資潮,港元利率飆升,影響百業民生;外滙基金更坐言起行,搶先增發票據,再度抽資,推高隔夜和短期同業拆息。不過,港元奉行貨幣局聯繫滙率制度,美元港元利息理應同步同軌,況且熱錢往往跑快一步,走資加息的論調,既費解,也困惑。

    須了解熱錢動向

    若要理解為何熱錢撤退,須先理解熱錢為何湧入。2008年環球金融海嘯爆發,存貸各不相讓,慎防有失,資金市場癱瘓,美國聯儲局率先推行量化寬鬆措施,再三購入金融債務,重新啟動市場流動性,恢復固有交易秩序。

    所謂量化寬鬆,其實是以中央銀行信用取代原來的債務信用,拆解三角債死結,令市場鬆綁。聯儲局資產負債同步增加,資產是所購入債務,負債是銀行清算結餘,屬於存款儲備;銀行儲備上升,貸款及存款循環擴張,資產及負債也同步增加,為經濟注入新血,恢復元氣。

    另一方面,聯儲局也降低利息至零作配合,減低官民舉債負擔,加速鬆綁效應。多年來市場已消化銀根寬鬆和零息的新形勢,現時聯儲局確認終止寬鬆貨幣措施,並啟動收縮資產負債,配合利率正常化,市場也須重新適應。早前利率開始提升後,「縮表」是預期之內,但撥亂反正不能一蹴而就。

    量化寬鬆下,美元泛濫,找尋機會投機投資增值,港元首當其衝則有多個因素所致。其一、港元掛鈎美元,幾乎全無外滙風險;其二、港元利率跟隨美元下調,推高實質及金融資產價格,等於資產通脹;其三、人民幣滙率改革後連年升值,香港離岸市場暢旺,吸引外資借道港元炒賣尋租。據官方估計,2008年尾季至2009年底的15個月,流入熱錢共達6400億港元等值之多。

    熱錢流入,香港銀行海外美元存款(資產)與客戶港元存款(負債)同步增加,清算結餘對活期存款(支票及儲蓄)比率相應下降,須向外滙基金兌換港元補充結餘。

    在客觀供求規律下,港元滙率高出官價,最終觸及外滙基金強方保證,而港元美元息差也拉闊,以維持固定滙率。

    翻查紀錄,2008年10月量化寬鬆啟動後,港元滙率從官價7.8升至7.52;銀行清算比率跳升至10%,比前高十倍;港元同業拆息(HIBOR)隔夜及一周跌至1厘之下,年底更跌至0.5厘之下;而一個月拆息,年底亦跌破0.5厘;銀行存貸利率牌價也相應下調。

    港元利率結構受扭曲

    事後分析,熱錢湧入有三個特徵:其一、港元滙率驟升至強方保證(7.75)價位;其二,港元同業市場流動性驟增,銀行清算結餘比率跳升;其三、港元利率驟降,與基礎利率(即貼現窗利率)同步,而基礎利率是美聯邦基金利率(即同業拆息)目標價加0.5厘。

    聯儲局結束量化寬鬆措施,啟動資金回籠,熱錢回流也有3個特徵:其一、港元回弱至官價(7.8)水平;其二、港元同業市場流動性減低,銀行清算結餘比率回落至正常水平;其三、港元利率回升,與基礎利率同步。若對照近來市情,兩個特徵已呈現。港元滙率已回落至官價,銀行清算結餘比率亦回落至6%。

    2008年環球金融海嘯平息後,國際間加強監管善後,提高銀行儲備及流動資產標準,清算結餘比率業難以返回從前之1%水平,5%可能是新常態。熱錢是「聰明錢」,往往早着先鞭,聯儲局縮減資產負債既是定局,提早撤資快人一步並非意料之外。不過,港元存貸利率仍未正常化,令人困惑熱錢是否撤離?若否,為何港元偏弱,而清算結餘也回落?

    其實,港元利率結構過去9年被扭曲,最明顯是最優惠利率(BLR)從未隨基礎利率起跌,而儲蓄利率近零,似有若無。最優惠利率向來與儲蓄利率同步,因為儲蓄向來是最穩定的散戶資金來源,而散戶信貸是以最優惠利率計息。港元利率協定2001年全面撤銷,但兩者關係實際仍密切。

    統計自1971年起至2001年止30年,最優惠利率平均約儲蓄的2.5倍。環球海嘯前最優惠利率是5厘,儲蓄利率應是2厘,而基礎利率是3.5厘;現今基礎利率是1.5厘,若以該比率作準則,推算儲蓄利率應為0.85厘(=1.5*﹝2/3.5﹞),最優惠利率應為2.125厘左右。且參照按揭利率印證,無論以「HIBOR加」或「BLR減」報價,皆是2厘多水平,與推算吻合。若撇開利率扭曲及後遺症等影響,港元同業市場拆息實際漸趨常態,所謂港元有待加息論調,乃以偏概全。因此3個撤資特徵應皆齊備,合理推斷是投機熱錢洞悉先機,應已獲利撤退。

    發債抽資換湯不換藥

    不過,上述分析仍有不足之處。所有進出資本無論投資或投機,最終「落戶」銀行資產中「貨幣金融」賬目,即是清算結餘、外滙基金票據、鈔票發行儲備。故此整體增減足以顯示資本性流向,而短期驟變更能反映熱錢進出。銀行業務統計並無區分本土及離岸, 港元存款轉換外幣存款是從本土賬轉移至離岸賬,理論上乃資本外流,反之是資本回流,而實際卻無資金進出。

    若離岸外幣存款起跌大,或影響清算結餘增減, 造成熱錢進退假象。其實,離岸外幣(包括人民幣)存款增減也影響港元滙率,因為買賣須經美元作媒介, 若存款擴張令美元求大於供,港元偏軟;反之存款收縮則美元供大於求,港元偏強。故此單憑港元偏強偏軟或同業拆息上落難下定論以判斷資本性流向,須綜合多個指標互相印證,始可窺全豹。

    美國啟動利率及流動性正常化多時,香港銀行卻遲遲未調整存息貸息牌價,歸根究柢是9年來從未隨市場走勢調整,何來加息空間?利率協議早已撤銷, 卻疑似名亡實存,實匪夷所思。不過,美國金融貨幣正常化指日可待,香港也解困有期,不必逆周期操作揠苗助長。

    熱錢特色是機動性強、觸覺性高,豈會落後於形勢?對照最新金融貨幣指標,切合撤資特徵。外滙基金卻急不及待治標,前後兩度增發票據抽資,似乎另有盤算。其實,9年來香港銀行謹慎有餘,2008年9月底儲備比率是37%,港元貸存比率是80%;今年6月底儲備比提高至43%,港元貸存比率下跌至68%。

    發債抽資是換湯不換藥,港元銀根並無收縮, 實際無助應對熱錢撤資,也無助港元利率正常化,反而發放收緊金融貨幣訊息,令銀行取態更謹慎,面臨加息周期,對經濟百弊而無一利。

    鄭宏泰為香港中文大學香港亞太研究所助理所長;陸觀豪為退休銀行家、香港中文大學香港亞太研究所名譽研究員

  • Why SQL is beating NoSQL, and what this means for the future of data

    104 nreece 7 hrs 45

    https://blog.timescale.com/why-sql-beating-nosql-what-this-means-for-future-of-data-time-series-database-348b777b847a

    http://news.ycombinator.com/item?id=15335717

    After years of being left for dead, SQL today is making a comeback. How come? And what effect will this have on the data community?

    SQL awakens to fight the dark forces of NoSQL Since the dawn of computing, we have been collecting exponentially growing amounts of data, constantly asking more from our data storage, processing, and analysis technology. In the past decade, this caused software developers to cast aside SQL as a relic that couldn’t scale with these growing data volumes, leading to the rise of NoSQL: MapReduce and Bigtable, Cassandra, MongoDB, and more.

    Yet today SQL is resurging. All of the major cloud providers now offer popular managed relational database services: e.g., Amazon RDS, Google Cloud SQL, Azure Database for PostgreSQL (Azure launched just this year). In Amazon’s own words, its PostgreSQL- and MySQL-compatible database Aurora database product has been the “fastest growing service in the history of AWS”. SQL interfaces on top of Hadoop and Spark continue to thrive. And just last month, Kafka launched SQL support. Your humble authors themselves are developers of a new time-series database that fully embraces SQL.

    In this post we examine why the pendulum today is swinging back to SQL, and what this means for the future of the data engineering and analysis community.

    To understand why SQL is making a comeback, let’s start with why it was designed in the first place.

    Like all good stories, ours starts in the 1970s Our story starts at IBM Research in the early 1970s, where the relational database was born. At that time, query languages relied on complex mathematical logic and notation. Two newly minted PhDs, Donald Chamberlin and Raymond Boyce, were impressed by the relational data model but saw that the query language would be a major bottleneck to adoption. They set out to design a new query language that would be (in their own words): “more accessible to users without formal training in mathematics or computer programming.”

    Query languages before SQL (a, b) vs SQL © (source) Think about this. Way before the Internet, before the Personal Computer, when the programming language C was first being introduced to the world, two young computer scientists realized that, “much of the success of the computer industry depends on developing a class of users other than trained computer specialists.” They wanted a query language that was as easy to read as English, and that would also encompass database administration and manipulation.

    The result was SQL, first introduced to the world in 1974. Over the next few decades, SQL would prove to be immensely popular. As relational databases like System R, Ingres, DB2, Oracle, SQL Server, PostgreSQL, MySQL (and more) took over the software industry, SQL became established as the preeminent language for interacting with a database, and became the lingua franca for an increasingly crowded and competitive ecosystem.

    (Sadly, Raymond Boyce never had a chance to witness SQL’s success. He died of a brain aneurysm 1 month after giving one of the earliest SQL presentations, just 26 years of age, leaving behind a wife and young daughter.)

    For a while, it seemed like SQL had successfully fulfilled its mission. But then the Internet happened.

    While Chamberlin and Boyce were developing SQL, what they didn’t realize is that a second group of engineers in California were working on another budding project that would later widely proliferate and threaten SQL’s existence. That project was ARPANET, and on October 29, 1969, it was born.

    Some of the creators of ARPANET, which eventually evolved into today’s Internet (source) But SQL was actually fine until another engineer showed up and invented the World Wide Web, in 1989.

    The physicist who invented the Web (source) Like a weed, the Internet and Web flourished, massively disrupting our world in countless ways, but for the data community it created one particular headache: new sources generating data at much higher volumes and velocities than before.

    As the Internet continued to grow and grow, the software community found that the relational databases of that time couldn’t handle this new load. There was a disturbance in the force, as if a million databases cried out and were suddenly overloaded.

    Then two new Internet giants made breakthroughs, and developed their own distributed non-relational systems to help with this new onslaught of data: MapReduce (published 2004) and Bigtable (published 2006) by Google, and Dynamo (published 2007) by Amazon. These seminal papers led to even more non-relational databases, including Hadoop (based on the MapReduce paper, 2006), Cassandra (heavily inspired by both the Bigtable and Dynamo papers, 2008) and MongoDB (2009). Because these were new systems largely written from scratch, they also eschewed SQL, leading to the rise of the NoSQL movement.

    And boy did the software developer community eat up NoSQL, embracing it arguably much more broadly than the original Google/Amazon authors intended. It’s easy to understand why: NoSQL was new and shiny; it promised scale and power; it seemed like the fast path to engineering success. But then the problems started appearing.

    Classic software developer tempted by NoSQL. Don’t be this guy. Developers soon found that not having SQL was actually quite limiting. Each NoSQL database offered its own unique query language, which meant: more languages to learn (and to teach to your coworkers); increased difficulty in connecting these databases to applications, leading to tons of brittle glue code; a lack of a third party ecosystem, requiring companies to develop their own operational and visualization tools.

    These NoSQL languages, being new, were also not fully developed. For example, there had been years of work in relational databases to add necessary features to SQL (e.g., JOINs); the immaturity of NoSQL languages meant more complexity was needed at the application level. The lack of JOINs also led to denormalization, which led to data bloat and rigidity.

    Some NoSQL databases added their own “SQL-like” query languages, like Cassandra’s CQL. But this often made the problem worse. Using an interface that is almost identical to something more common actually created more mental friction: engineers didn’t know what was supported and what wasn’t.

    SQL-like query languages are like the Star Wars Holiday Special. Accept no imitations. (And always avoid the Star Wars Holiday Special.) Some in the community saw the problems with NoSQL early on (e.g., DeWitt and Stonebraker in 2008). Over time, through hard-earned scars of personal experience, more and more software developers joined them.

    Initially seduced by the dark side, the software community began to see the light and come back to SQL.

    First came the SQL interfaces on top of Hadoop (and later, Spark), leading the industry to “back-cronym” NoSQL to “Not Only SQL” (yeah, nice try).

    Then came the rise of NewSQL: new scalable databases that fully embraced SQL. H-Store (published 2008) from MIT and Brown researchers was one of the first scale-out OLTP databases. Google again led the way for a geo-replicated SQL-interfaced database with their first Spanner paper (published 2012) (whose authors include the original MapReduce authors), followed by other pioneers like CockroachDB (2014).

    At the same time, the PostgreSQL community began to revive, adding critical improvements like a JSON datatype (2012), and a potpourri of new features in PostgreSQL 10: better native support for partitioning and replication, full text search support for JSON, and more (release slated for later this year). Other companies like CitusDB (2016) and yours truly (TimescaleDB, released this year) found new ways to scale PostgreSQL for specialized data workloads.

    In fact, our journey developing TimescaleDB closely mirrors the path the industry has taken. Early internal versions of TimescaleDB featured our own SQL-like query language called “ioQL.” Yes, we too were tempted by the dark side: building our own query language felt powerful. But while it seemed like the easy path, we soon realized that we’d have to do a lot more work: e.g., deciding syntax, building various connectors, educating users, etc. We also found ourselves constantly looking up the proper syntax to queries that we could already express in SQL, for a query language we had written ourselves!

    One day we realized that building our own query language made no sense. That the key was to embrace SQL. And that was one of the best design decisions we have made. Immediately a whole new world opened up. Today, even though we are just a 5 month old database, our users can use us in production and get all kinds of wonderful things out of the box: visualization tools (Tableau), connectors to common ORMs, a variety of tooling and backup options, an abundance of tutorials and syntax explanations online, etc.

    Google has clearly been on the leading edge of data engineering and infrastructure for over a decade now. It behooves us to pay close attention to what they are doing.

    Take a look at Google’s second major Spanner paper, released just four months ago (Spanner: Becoming a SQL System, May 2017), and you’ll find that it bolsters our independent findings.

    For example, Google began building on top of Bigtable, but then found that the lack of SQL created problems (emphasis in all quotes below ours):

    “While these systems provided some of the benefits of a database system, they lacked many traditional database features that application developers often rely on. A key example is a robust query language, meaning that developers had to write complex code to process and aggregate the data in their applications. As a result, we decided to turn Spanner into a full featured SQL system, with query execution tightly integrated with the other architectural features of Spanner (such as strong consistency and global replication).” Later in the paper they further capture the rationale for their transition from NoSQL to SQL:

    The original API of Spanner provided NoSQL methods for point lookups and range scans of individual and interleaved tables. While NoSQL methods provided a simple path to launching Spanner, and continue to be useful in simple retrieval scenarios, SQL has provided significant additional value in expressing more complex data access patterns and pushing computation to the data. The paper also describes how the adoption of SQL doesn’t stop at Spanner, but actually extends across the rest of Google, where multiple systems today share a common SQL dialect:

    Spanner’s SQL engine shares a common SQL dialect, called “Standard SQL”, with several other systems at Google including internal systems such as F1 and Dremel (among others), and external systems such as BigQuery… For users within Google, this lowers the barrier of working across the systems. A developer or data analyst who writes SQL against a Spanner database can transfer their understanding of the language to Dremel without concern over subtle differences in syntax, NULL handling, etc. The success of this approach speaks for itself. Spanner is already the “source of truth” for major Google systems, including AdWords and Google Play, while “Potential Cloud customers are overwhelmingly interested in using SQL.”

    Considering that Google helped initiate the NoSQL movement in the first place, it is quite remarkable that it is embracing SQL today. (Leading some to recently wonder: “Did Google Send the Big Data Industry on a 10 Year Head Fake?”.)

    In computer networking, there is a concept called the “narrow waist.”

    This idea emerged to solve a key problem: On any given networked device, imagine a stack, with layers of hardware at the bottom and layers of software on top. There can exist a variety of networking hardware; similarly there can exist a variety of software and applications. One needs a way to ensure that no matter the hardware, the software can still connect to the network; and no matter the software, that the networking hardware knows how to handle the network requests.

    The Networking Narrow Waist (source) In networking, the role of the narrow waist is played by Internet Protocol (IP), acting as a common interface between lower-level networking protocols designed for local-area network, and higher-level application and transport protocols. (Here’s one nice explanation.) And (in a broad oversimplification), this common interface became the lingua franca for computers, enabling networks to interconnect, devices to communicate, and this “network of networks” to grow into today’s rich and varied Internet.

    We believe that SQL has become the narrow waist for data analysis.

    We live in an era where data is becoming “the world’s most valuable resource” (The Economist, May 2017). As a result, we have seen a Cambrian explosion of specialized databases (OLAP, time-series, document, graph, etc.), data processing tools (Hadoop, Spark, Flink), data buses (Kafka, RabbitMQ), etc. We also have more applications that need to rely on this data infrastructure, whether third-party data visualization tools (Tableau, Grafana, PowerBI, Superset), web frameworks (Rails, Django) or custom-built data-driven applications.

    Like networking we have a complex stack, with infrastructure on the bottom and applications on top. Typically, we end up writing a lot of glue code to make this stack work. But glue code can be brittle: it needs to be maintained and tended to.

    What we need is a common interface that allows pieces of this stack to communicate with one another. Ideally something already standardized in the industry. Something that would allow us to swap in/out various layers with minimal friction.

    That is the power of SQL. Like IP, SQL is a common interface.

    But SQL is in fact much more than IP. Because data also gets analyzed by humans. And true to the purpose that SQL’s creators initially assigned to it, SQL is readable.

    Is SQL perfect? No, but it is the language that most of us in the community know. And while there are already engineers out there working on a more natural language oriented interface, what will those systems then connect to? SQL.

    So there is another layer at the very top of the stack. And that layer is us.

    SQL is back. Not just because writing glue code to kludge together NoSQL tools is annoying. Not just because retraining workforces to learn a myriad of new languages is hard. Not just because standards can be a good thing.

    But also because the world is filled with data. It surrounds us, binds us. At first, we relied on our human senses and sensory nervous systems to process it. Now our software and hardware systems are also getting smart enough to help us. And as we collect more and more data to make better sense of our world, the complexity of our systems to store, process, analyze, and visualize that data will only continue to grow as well.

    Master Data Scientist Yoda Either we can live in a world of brittle systems and a million interfaces. Or we can continue to embrace SQL. And restore balance to the force.

    Like this post? Please recommend and/or share.

    And if you’d like to learn more about TimescaleDB, please check out our GitHub (stars always appreciated), and please let us know how we can help.

    Suggested reading for those who’d like to learn more about the history of databases (aka syllabus for the future TimescaleDB Intro to Databases Class):

    A Relational Model of Data for Large Shared Data Banks (IBM Research, 1970) SEQUEL: A Structured English Query Language (IBM Research, 1974) System R: Relational Approach to Database Management (IBM Research, 1976) MapReduce: Simplified Data Processing on Large Clusters (Google, 2004) C-Store: A Column-oriented DBMS (MIT, others, 2005) Bigtable: A Distributed Storage System for Structured Data (Google, 2006) Dynamo: Amazon’s Highly Available Key-value Store (Amazon, 2007) MapReduce: A major step backwards (DeWitt, Stonebreaker, 2008) H-Store: A High-Performance, Distributed Main Memory Transaction Processing System (MIT, Brown, others, 2008) Spark: Cluster Computing with Working Sets (UC Berkeley, 2010) Spanner: Google’s Globally-Distributed Database (Google, 2012) Early History of SQL (Chamberlin, 2012) How the Internet was Born (Hines, 2015) Spanner: Becoming a SQL System (Google, 2017)

  • YC’s Essential Startup Advice

    258 craigcannon 3 hrs 98

    https://blog.ycombinator.com/ycs-essential-startup-advice/

    http://news.ycombinator.com/item?id=15331016

    A lot of the advice we give startups is tactical; meant to be helpful on a day to day or week to week basis. But some advice is more fundamental. We’ve collected here what we at YC consider the most important, most transformative advice for startups. Whether common sense or counter-intuitive, the guidance below will help most startups find their path to success.

    The first thing we always tell founders is to launch their product right away; for the simple reason that this is the only way to fully understand customers’ problems and whether the product meets their needs. Surprisingly, launching a mediocre product as soon as possible, and then talking to customers and iterating, is much better than waiting to build the “perfect” product. This is true as long as the product contains a “quantum of utility” (Do Things That Don’t Scale by Paul Graham) for customers whose value overwhelms problems any warts might present.

    Once launched, we suggest founders do things that don’t scale (Do Things That Don’t Scale by Paul Graham). Many startup advisors persuade startups to scale way too early. This will require the building of technology and processes to support that scaling, which, if premature, will be a waste of time and effort. This strategy often leads to failure and even startup death. Rather, we tell startups to get their first customer by any means necessary, even by manual work that couldn’t be managed for more than ten, much less 100 or 1000 customers. At this stage, founders are still trying to figure out what needs to be built and the best way to do that is talk directly to customers. For example, the Airbnb founders originally offered to “professionally” photograph the homes and apartments of their earliest customers in order to make their listings more attractive to renters. Then, they went and took the photographs themselves. The listings on their site improved, conversions improved, and they had amazing conversations with their customers. This was entirely unscalable, yet proved essential in learning how to build a vibrant marketplace.

    Talking to users usually yields a long, complicated list of features to build. One piece of advice that YC partner Paul Buchheit (PB) always gives in this case is to look for the “90/10 solution”. That is, look for a way in which you can accomplish 90% of what you want with only 10% of the work/effort/time. If you search hard for it, there is almost always a 90/10 solution available. Most importantly, a 90% solution to a real customer problem which is available right away, is much better than a 100% solution that takes ages to build.

    As companies begin to grow there are often tons of potential distractions. Conferences, dinners, meeting with venture capitalists or large company corporate development types (Don’t Talk to Corp Dev by Paul Graham), chasing after press coverage and so on. (YC co-founder Jessica Livingston created a pretty comprehensive list of the wrong things on which to focus [How Not To Fail by Jessica Livingston .]) We always remind founders not to lose sight that the most important tasks for an early stage company are to write code and talk to users. For any company, software or otherwise, this means that in order to make something people want: you must launch something, talk to your users to see if it serves their needs, and then take their feedback and iterate. These tasks should occupy almost all of your time/focus. For great companies this cycle never ends. Similarly, as your company evolves there will be many times where founders are forced to choose between multiple directions for their company. Sam Altman always points out that it is nearly always better to take the more ambitious path. It is actually extraordinary how often founders manage to avoid tackling these sorts of problems and focus on other things. Sam calls this “fake work”, because it tends to be more fun than real work (The Post YC Slump by Sam Altman).

    When it comes to customers most founders don’t realize that they get to choose customers as much as customers get to choose them. We often say that a small group of customers who love you is better than a large group who kind of like you. In other words, recruiting 10 customers who have a burning problem is much better than 1000 customers who have a passing annoyance. It is easy to make mistakes when choosing your customers so sometimes it’s also critical for startups to fire their customers. Some customers can cost way more than they provide in either revenue or learning. For example, Justin.tv/Twitch only became a breakout success when they focused their efforts toward video game broadcasters and away from people trying to stream copy written content (Users You Don’t Want by Michael Seibel.)

    Growth is always a focus for startups, since a startup without growth is usually a failure. However, how and when to grow is often misunderstood. YC is sometimes criticised for pushing companies to grow at all costs, but in fact we push companies to talk to their users, build what they want, and iterate quickly. Growth is a natural result of doing these three things successfully. Yet, growth is not always the right choice. If you have not yet made something your customers want – in other words, have found product market fit, it makes little sense to grow (The Real Product Market Fit by Michael Seibel). Poor retention is always the result. Also, if you have an unprofitable product, growth merely drains cash from the company. As PB likes to say, it never makes sense to take 80 cents from a customer and then hand them a dollar back. The fact that unit economics really matter shouldn’t come as a surprise, but too many startups seem to forget this basic fact (Unit Economics by Sam Altman).

    Startup founders’ intuition will always be to do more whereas usually the best strategy is almost always to do less, really well. For example, founders are frequently tempted to chase big deals with large companies which represent amazing, company validating relationships. However, deals between large companies and tiny startups seldom end well for the startup. They take too long, cost too much, and often fail completely. One of the hardest things about doing a startup is choosing what to do, since you will always have an infinite list of things that could be done (Startup Priorities by Geoff Ralston). It is vital that very early a startup choose the one or two key metrics it will use to measure success, then founders should choose what to do based nearly exclusively on how the task will impact those metrics. When your early stage product isn’t working it’s often tempting to immediately build new features in order to solve every problem the customer seems to have instead of talking to the customer and focusing only on the most acute problem they have.

    Founders often find it surprising to hear that they shouldn’t worry if their company seems badly broken. It turns out that nearly every startup has deep, fundamental issues, even those that will end up being billion dollar companies. Success is not determined by whether you are broken at the beginning, but rather what the founders do about the inevitable problems. Your job as a founder will often seem to be continuously righting a capsized ship. This is normal.

    It is very difficult as a new startup founder not to obsess about competition, actual and potential. It turns out that spending any time worrying about your competitors is nearly always a very bad idea. We like to say that startup companies always die of suicide not murder. There will come a time when competitive dynamics are intensely important to the success or failure of your company, but it is highly unlikely to be true in the first year or two.

    A few words on fundraising (A Guide to Seed Fundraising by Geoff Ralston). The first, best bit of advice is to raise money as quickly as possible and then get back to work. It is often easy to actually see when a company is fundraising by looking at their growth curve and when it flattens out they are raising money. Equally important is to understand that valuation is not equal to success or even probability of success (Fundraising Rounds are not Milestones by Michael Seibel). Some of Y Combinator’s very best companies raised on tiny initial valuations (Airbnb, Dropbox, Twitch, are all good examples). By the way, it is vital to remember that the money you raise IS NOT your money. You have a fiduciary and ethical/moral duty to spend the money only to improve the prospects of your company.

    It is also important to stay sane during the inevitable craziness of startup life. So we always tell founders to make sure they take breaks, spend time with friends and family, get enough sleep and exercise in between bouts of extraordinarily intense, focused work. Lastly, a brief word on failure. It turns out most companies fail fast because founders fall out. The relationships with your cofounders matter more than you think and open, honest communications between founders makes future debacles much less likely. In fact, it turns out that one of the best things you can do to make your startup successful, in fact, to be successful in life, is to simply be nice (Mean People Fail by Paul Graham.)

    The Pocket Guide of Essential YC Advice

    • Launch now • Build something people want • Do things that don’t scale • Find the 90 / 10 solution • Find 10-100 customers who love your product • All startups are badly broken at some point • 007 – formidable (need this) • Write code – talk to users • “It’s not your money” • Growth is the result of a great product not the precursor • Don’t scale your team/product until you have built something people want • Valuation is not equal to success or even probability of success • Avoid long negotiated deals with big customers if you can • Avoid big company corporate development queries – they will only waste time • Avoid conferences unless they are the best way to get customers • Pre-product market fit – do things that don’t scale: remain small/nimble • Startups can only solve one problem well at any given time • Founder relationships matter more than you think • Sometimes you need to fire your customers (they might be killing you) • Ignore your competitors, you will more likely die of suicide than murder • Most companies don’t die because they run out of money • Be nice! Or at least don’t be a jerk

    • Get sleep and exercise – take care of yourself

    References

    Do Things That Don’t Scale by Paul Graham ↩ Don’t Talk to Corp Dev by Paul Graham ↩ How Not To Fail by Jessica Livingston ↩ The Post YC Slump by Sam Altman ↩ Users You Don’t Want by Michael Seibel ↩ The Real Product Market Fit by Michael Seibel ↩ Unit Economics by Sam Altman ↩ Startup Priorities by Geoff Ralston ↩ A Guide to Seed Fundraising by Geoff Ralston. ↩ Fundraising Rounds are not Milestones by Michael Seibel ↩ Mean People Fail by Paul Graham ↩

    Recommended Reading

    1. A Fundraising Survival Guide by Paul Graham
    2. How to Raise Money by Paul Graham
    3. Taking Advice by Aaron Harris
  • PixelNN – Example-Based Image Synthesis

    393 pentestercrab 10 hrs 118

    http://www.cs.cmu.edu/~aayushb/pixelNN/

    http://news.ycombinator.com/item?id=15328356

    teaserImage We present a simple nearest-neighbor (NN) approach that synthesizes high-frequency photorealistic images from an "incomplete" signal such as a low-resolution image, a surface normal map, or edges. Current state-of-the-art deep generative models designed for such conditional image synthesis lack two important things: (1) they are unable to generate a large set of diverse outputs, due to the mode collapse problem. (2) they are not interpretable, making it difficult to control the synthesized output. We demonstrate that NN approaches potentially address such limitations, but suffer in accuracy on small datasets. We design a simple pipeline that combines the best of both worlds: the first stage uses a convolutional neural network (CNN) to maps the input to a (overly-smoothed) image, and the second stage uses a pixel-wise nearest neighbor method to map the smoothed output to multiple high-quality, high-frequency outputs in a controllable manner. We demonstrate our approach for various input modalities, and for various domains ranging from human faces to cats-and-dogs to shoes and handbags.

    Paper

    PixelNN: Example-based Image Synthesis. A. Bansal, Y. Sheikh, and D. Ramanan

    arXiv | bibtex

    Comparison with Pix-to-Pix

    Comparison Multiple Outputs

    Multiple Multiple Multiple Edges-to-Faces

    Edges2Faces Normals-to-Faces

    Normals2Faces Edges-to-Cats-&-Dogs

    Edges2CatsAndDogs Normals-to-Cats-&-Dogs

    Normals2CatsAndDogs Example Frequency Analysis

    We did frequency analysis via FFT to understand the frequency content in the output of our images. FreqAnalysis

    A. Bansal, B. Russell, and A. Gupta. Marr Revisited: 2D-3D Model Alignment via Surface Normal Prediction. In CVPR, 2016

    A. Bansal, X. Chen, B. Russell, A. Gupta, and D. Ramanan. PixelNet: Representation of the pixels, by the pixels, and for the pixels. In arXiv, 2017

    Comments, questions to Aayush Bansal.

  • How Web Pages Can Extend (or Drain) Mobile Device Battery Life

    Dr. Angela Nicoara on mobile browser energy consumption and ways developers can minimize energy use through design.

    by Jenn Webb | @JennWebb | +Jenn Webb | Comment | May 23, 2013 Comment According to recent Global Mobile Data Traffic Forecasts (PDF), the number of mobile-connected devices will surpass the world’s population this year, and by 2015, there will be 788 million mobile-only Internet users. A recent paper, “Who Killed My Battery: Analyzing Mobile Browser Energy Consumption (PDF),” pulled together by the Deutsche Telekom Innovation Center in Silicon Valley and Stanford University researchers and published in the ACM 21st International World Wide Web Conference (WWW 2012) proceedings (PDF), takes a look at the growing popularity of mobile web browsing and the effects on energy consumption.

    I reached out to Dr. Angela Nicoara, senior research scientist at the Deutsche Telekom Innovation Center in Silicon Valley who worked on the project, to find out why mobile browser energy consumption is a growing concern and what developers need to know going forward. Our interview follows. Dr. Nicoara will present the researchers’ findings in the “Who Killed My Battery: Analyzing Mobile Browser Energy Consumption” session at the Fluent 2013 conference next week in San Francisco, CA.

    Why is browser energy consumption becoming more of an issue with the growth of smartphones and mobile browsing?

    Dr. Angela Nicoara: Despite the explosive growth of smartphones and growing popularity of mobile web browsing, their utility has been and will remain severely limited by the battery life. Smartphones’ energy constraints are here to stay, and as such, optimizing the energy consumption of the phone browser while surfing the Web is of critical importance today and will remain so in the foreseeable future.

    Our research, “Who Killed My Battery: Analyzing Mobile Browser Energy Consumption,” has focused on solving two of the most important and difficult problems pertaining to energy consumption on smartphones: developing an infrastructure for measuring the precise energy used by a mobile browser to render web pages and developing techniques to offload browser-heavy computations to the cloud.

    A fundamental challenge arises as a result of power inefficiency of mobile web browsers at popular websites (e.g., financial, e-commerce, email, blogging, news sites) and how much energy is consumed to render a particular web page. Our work is the first of this kind to show how the structure of web pages can impact battery usage in mobile web browsers. Our research in this area has influenced and will influence the computing industry through the design and implementation of an infrastructure for measuring the precise energy used by a mobile browser to render web pages.

    Learn from experts at this free, live webcast. Angela Nicoara

    Who Killed My Battery: Analyzing Mobile Browser Energy Consumption

    Presented by Angela Nicoara Date: May 24, 2013 at 10am PT

    Register now

    What tools and methods are used to measure mobile browser energy consumption?

    Dr. Nicoara: We developed novel techniques and tools to precisely measure the energy needed to render individual web elements, such as images, JavaScript, cascade style sheets (CSS) and plug-in objects, as well as designed a system that has the potential to dramatically reduce energy consumption of smartphones.

    Our results show that for popular websites, downloading and parsing CSS and JavaScript consumes a significant fraction of the total energy needed to render the web page. We also show that by redesigning websites, the energy needed to render web pages is minimized. Another fundamental challenge stems from estimating the point at which offloading browser computations to a remote server can save energy on the phone. Given the smartphone’s limited energy, there is a strong desire to minimize its work, which can be helped by performing expensive browser computations off the phone. We explored the possibilities that arise when offloading heavy computations to a server cloud to save energy.

    While researchers assume that measuring the energy consumed by a mobile operation can be known using a high-level API for finding out the battery level, we pioneered another approach: providing support for obtaining very precise, fine-grained energy use by a mobile browser to render web pages by hooking an external high-precision digital power multimeter to an open mobile phone’s battery. Given the inadequate nature of the existing tools, we then advanced our research in this field, aiming to accurately model the power draw of the Android mobile platform.

    How can developers put this information to use?

    Dr. Nicoara: We measured the energy needed to render financial, e-commerce, email, blogging, news and social networking sites. The tools are sufficiently precise to measure the energy needed to render individual web elements, such as cascade style sheets (CSS), JavaScript, images, and plug-in objects. Using the collected data, concrete recommendations are made on how to design web pages so as to minimize the energy needed to render the page.

    Our research findings can help developers overcome the resource limitations of smartphones, one of the biggest challenges faced by today’s mobile industry. It allows developers to easily design energy-efficient websites by following concrete guidelines and recommendations. With our research, the improved battery life from our technologies dramatically enhances the usability of mobile devices and impacts consumers’ daily lives.

  • Questions

    I host production Flask apps under uWSGI. It is said that Flask-SocketIO doesn't play well with uWSGI. What are good alternatives to both?

    Miguel Grinberg says that uWSGI is not a good choice for an application server for apps that incorporate Flask-SocketIO.

    My current stack includes nginx, uwsgi, and Flask.

    What would be a good long-term, well-supported alternative to either uWSGI or Flask-SocketIO for this situation?

    6 Comments

    u/miguelgrinberg • Jul 26, 2016, 9:37 PM The problem with uwsgi is that (a) does not support eventlet; (b) supports gevent, but with its own async loop that is incompatible with gevent's own loop; and (c) supports websocket, but with its own implementation, incompatible with gevent and eventlet. So there are really a lot of cons to this approach. In the same way I have code that supports eventlet and gevent I at some point intend to write another bunch of code for uwsgi (so basically you would say async_mode='uwsgi' or something like that).

    Something that a lot of people don't realize is that while app.run() is a development server that almost never is a good choice for running your application on production, socketio.run() is in fact a production ready web server, as long as you use it alongside eventlet or gevent, and with app.debug=False.

    So you could drop uwsgi and run a Flask+Flask-SocketIO server using eventlet or gevent, and you would have a production level stack. And if you need to scale, run more than one, and let nginx load balance your http and websocket.

    I have nothing against gunicorn, by the way, that is also a good choice.

    u/APIglue • Jul 27, 2016, 3:07 AM Is there a noticeable difference in this setup between 2.7/3.3/3.4?

    u/miguelgrinberg• Jul 27, 2016, 9:19 AM If you are going to use Py3 or want to leave the door open to switch in the future, I recommend eventlet over gevent. Both packages have been ported to Py3, but gevent does not come with native websocket support, and the gevent-websocket package does not run on Py3 yet. Also eventlet has been running on Py3 for longer than gevent.

    In terms of performance, I have found eventlet to be marginally faster than gevent+gevent-websocket, both in Py2 and Py3.

    So basically, my recommendation is to go with eventlet, unless you have a good reason to use gevent.

    u/brian15co • Jul 28, 2016, 4:21 PM I was pointing to app.run() from within my uWSGI config and managing the uWSGI emperor from supervisord and serving the uWSGI app through nginx.

    Now I am running gunicorn with supervisord and trying to point nginx to a socket that gunicorn is running through. Here I am having trouble creating the websocket (details on Stackoverflow)

    Would that be a suitable production stack?

    The app object comes from a factory in webapp/init.py like

    from flask_socketio import SocketIO, socketio = SocketIO(logger=True)

    create_app() app = Flask(name) socketio.init_app() ... return app and in deploy_app.py

    if name == 'main': socketio.run(myapp, debug=False) if I configure nginx the way it is in the Flask-IO documentation and just run (env)$ python deploy_app.py then it works. But I was under the impression that this was not as production-ideal as the setup I previously mentioned.

    u/miguelgrinberg • Jul 28, 2016, 5:16 PM The problem is that you are running multiple workers on gunicorn. This is not a configuration that is currently supported, due to the very limited load balancer in gunicorn that does not support sticky sessions. Documentation reference: https://flask-socketio.readthedocs.io/en/latest/#gunicorn-web-server.

    Instead, run several gunicorn instances, each with one worker, and then set up nginx to do the load balancing, using the ip_hash method so that sessions are sticky.

    Also, in case you are not aware, if you run multiple servers you need to also run a message queue, so that the processes can coordinate. This is also covered in the documentation link above.

    u/Celeodor • Jul 26, 2016, 1:22 PM Swap uwsgi for gunicorn and use an eventlet worker or use the built-in socketio.run.

    Top Posts in r/flask

    [AF] Flask, Vue, and Jinja2 integration (overlapping template declarations)

  • Relicensing React, Jest, Flow, and Immutable.js

    1183 dwwoelfel 5 hrs 252

    https://code.facebook.com/posts/300798627056246

    http://news.ycombinator.com/item?id=15316175

    Next week, we are going to relicense our open source projects React, Jest, Flow, and Immutable.js under the MIT license. We're relicensing these projects because React is the foundation of a broad ecosystem of open source software for the web, and we don't want to hold back forward progress for nontechnical reasons.

    This decision comes after several weeks of disappointment and uncertainty for our community. Although we still believe our BSD + Patents license provides some benefits to users of our projects, we acknowledge that we failed to decisively convince this community.

    In the wake of uncertainty about our license, we know that many teams went through the process of selecting an alternative library to React. We're sorry for the churn. We don't expect to win these teams back by making this change, but we do want to leave the door open. Friendly cooperation and competition in this space pushes us all forward, and we want to participate fully.

    This shift naturally raises questions about the rest of Facebook's open source projects. Many of our popular projects will keep the BSD + Patents license for now. We're evaluating those projects' licenses too, but each project is different and alternative licensing options will depend on a variety of factors.

    We'll include the license updates with React 16's release next week. We've been working on React 16 for over a year, and we've completely rewritten its internals in order to unlock powerful features that will benefit everyone building user interfaces at scale. We'll share more soon about how we rewrote React, and we hope that our work will inspire developers everywhere, whether they use React or not. We're looking forward to putting this license discussion behind us and getting back to what we care about most: shipping great products.