Member since 2017-07-15T03:50:57Z. Last seen 2025-01-02T20:05:01Z.
2729 blog posts. 128 comments.
Hi HN! We are the co-founders of Lyrebird (https://lyrebird.ai/) and PhD students in AI at University of Montreal. We are building speech synthesis technologies to improve the way we communicate with computers. Right now, our key innovation is that we can copy the voice of someone else and make it say anything. The tech is still at its early stage but we believe that it is eventually going to make possible a wide range of new applications such as:
reading loud text messages with the voice of the sender,
reading audiobooks with the voice of your choice,
giving a personalized digital voice to people who lost their voice due to a disease,
allowing video game makers to have more customized dialogs generated on the fly, or avatars of their players,
allowing movie makers to freeze the voice of their actors so that they can still use it if the actor ages or dies.
Yesterday we launched a beta version of our voice-cloning software: anyone can record one minute of audio and get a digital voice that sounds like them.
We know that many on HN are concerned about potential misuses surrounding these technologies and we share your concern. We write further on our ethical stance on this page: https://lyrebird.ai/ethics/.
Our blogpost about the launch: https://lyrebird.ai/blog/create-your-voice-avatar that features the first video combining generated audio and generated elements of the video.
There was a thread about us on HN when we launched our website four months ago (https://news.ycombinator.com/item?id=14182262) but at that time, no one could test our software yet and we did not really answer any question of the community. So this time we are ready for questions and would love some feedback!
slackstation 4 hrs Their ethics don't seem to be something they take seriously as the video they use to promote their own site is an impersonation itself.
Seems from right out of the gate, they are breaking their own ethical guidelines as a cheap promotional tactic. If they care that little about themselves and a former president of the United States, what do they care about your likeness.
It also doesn't help that you give them a universal perpetual license to do whatever they want (including selling your likeness for someone else's use) by uploading.
This just seems like a slimy team that put up an ethics page as a CYA.
I'm willing to eat my words if they had Barak Obama's consent to use his digitized voice for this but, it's highly doubtful since there's also the coat and seal of the President of the United States on the flag in the background which would be a massive ethical breach of a former President just to promote a silly little startup.
faitswulff 3 hrs I'll agree. The ethics here seem very wishy-washy. It's not nearly as well-defined as, say, Twilio's Nine Values0 or ThoughtWorks's Pillars1, and those are generic values, not ethics. Hell, Google's "Do No Evil" is better than their two bullet points: 1) raise public awareness, 2) your voice is yours (except these Presidents), ???) Imagine if someone bad invented this first!
mattmaroon 4 hrs They have a giant banner at the bottom disclaiming it. I don't think that's unethical.
dantiberian 7 hrs While it's good that you have an ethics page: https://lyrebird.ai/ethics/, it only has two ethical guidelines:
Spread awareness of this technology
Your digital voice remains yours
I would feel a lot better about this if you also had explicit ethical boundaries, for example disallowing users impersonating someone else, e.g. Donald Trump, Barack Obama. "Your digital voice remains yours" sort of sounds like you won't use/share my digital voice with others, but doesn't directly address whether bad actors can maliciously impersonate someone who hasn't registered with Lyrebird.
jeffmould 7 hrs To build on that a bit. My bank uses voice recognition as a security measure and to authenticate me when I call in. Throw a bad actor in the mix and that becomes a security issue.
gervase 4 hrs Reminds me of the classic "My voice is my passport. Verify me." Hopefully that is not the only second-factor option they provide?
jeffmould 4 hrs Ha! Luckily it opt-in/out and they do allow you to keep a pin/code word as a backup.
woollysammoth 9 hrs This looks really great, congrats! Forgive me if I missed something, but I was wondering if you could clear up some confusion. From the terms: "Subject to the Biometric Data Agreement, you hereby grant to us a fully paid, royalty-free, perpetual, irrevocable, worldwide, non-exclusive and fully sublicensable right (including any moral rights) and license to use, license, distribute, reproduce, modify, adapt, publicly perform, and publicly display Your Voice, Digital Voice..."
Just to be clear, the license of the voice/digital voice is revoked upon deletion of the recordings? I understand it is subject to the biometric agreement, but the words perpetual and irrevocable still worried me. Thanks!
sotelo 8 hrs Yes! This is what our lawyers suggested to protect ourselves.
We delete all the recordings when you click delete, so we can't recreate the voice anymore. However, this is still necessary in case we share some generated sentences in social media or so (like we're doing on twitter now).
JoshTriplett 7 hrs
However, this is still necessary in case we share some generated sentences in social media or so (like we're doing on twitter now).
This is something that you should only do with the permission of the user who provided the voice. You don't need generalized permission to do that for every user, and given the nature of the technology, you shouldn't ask for such permission.
Whitestrake 5 mins From the grandparent comment:
This is what our lawyers suggested to protect ourselves.
Generally speaking, a lawyer's advice is going to be optimized for maximum protection in possibly unforeseeable circumstances, not for what might actually be needed or even reasonable to request of every user.
Generally speaking, companies aren't going to go out of their way to rein in their lawyer. Most people won't even read that fine print, unfortunately.
Raynak 3 hrs I don't believe you.
rayalez 11 hrs Sounds amazing! Just to add a usecase - for many people, creating a decent voiceover is one of the big sticking points for producing youtube videos or educational courses. If I could write a script, and have software generate a decent enough voiceover, it would be amazing.
It's not even necessary to copy anyone's voice, as long as there's a selection of the most comprehensible and human-sounding ones.
Then, you could even automatically generate slideshow presentation from a few illustrations and headlines, and that would make "rendering" articles into videos very fast and easy. I'm sure a lot of people would pay for such service.
By the way, recently I've encountered Deep Voice 2, a similar research project by baidu:
http://research.baidu.com/deep-voice-2-multi-speaker-neural-...
Results are very impressive.
pavel_lishin 9 hrs I made a joke video for work, featuring clips of Sir David Attenborough narrating a fake nature documentary I cut together from video I took at work.
It would have been an order of magnitude better if I could just generate arbitrary phrases in his voice.
(Or maybe not; maybe the constraint made the video better.)
joshschreuder 5 hrs I think the constraint makes the video funnier, take for example dinoflask's edits of Blizzard's Jeff Kaplan
https://www.youtube.com/watch?v=gXTrrTX7YuY
adbrebs 11 hrs Thank you for sharing that! We had not thought about this specific use case yet. It's quite difficult to figure out which use cases are going to become the most popular.
schemathings 4 hrs Sounds like the 100 speakers they used were Irish or Scottish.
aaronsnoswell 7 hrs +1 for this use case!
_lex 11 hrs "I'm using my voice as my password".
Vanguard allows voice authentication (https://investor.vanguard.com/account-conveniences/voice-ver...) - and who knows who else will roll something similar out in the future. Yeah, its really really dumb, but it's happening in production now. I wouldn't use this product if I were you, but honestly you should also not use voice verification/authentication for anything.
tunetine 9 hrs Fidelity began verifying voice for telephone customer service a short while back. They recorded me during the call then at the end said they were going to use it to verify for future calls. No way to opt out.
mysterydip 9 hrs Did they say something to that effect before the call started, or only told you at the end? Or did they just use the "this call may be monitored for quality assurance and training purposes" blanket?
tunetine 1 hr I remember it too vaguely at this point but something was mentioned in the beginning while I was waiting. I feel like it was worded along the lines of a promo or I wouldn't have told the rep. I wasn't interested multiple times. "Verifying is now easier and more secure with voice verification..."
plastroltech 10 hrs Will this technology be licensed for redistribution or only for online API use? I ask because in the video game scenario it would be great to have this in a library I could distribute instead of relying on the API to be available at all times.
adbrebs 9 hrs The first version will only be an online API. I agree with you that we should eventually think about licensing it for offline/embedded redistribution.
0x4f3759df 11 hrs The innovation I'm waiting for is
reading audiobooks with the voice of your choice, AND the speed of my choice.
bckmn 11 hrs Something similar (albeit not in your own voice, but in a wide range of premed voices) at https://www.narro.co
adbrebs 11 hrs Yes definitively! This is also something we are working on.
S_A_P 8 hrs I guess I see a ton of upside here, but I also see that this could easily be abused and possibly a tool to completely destroy someones life. Imagine getting a phone call from your "partner" saying they cheated on you. I dont know how it would be useable(api?) and I do still detect a bit of artificialness to to voice, but as this gets better I worry about the down sides and potential for harm by copying someones voice.
adbrebs 8 hrs Thank you for raising those concerns. We take those very seriously. You can read more about our ethical stance in this article: https://lyrebird.ai/ethics/
To recap:
we want to start by raising public awareness about the technology and we did demos with the voices of Trump/Obama for that,
your digital voice is yours, people can not use it without your authorization.
tekromancr 9 hrs Really fun stuff. I noticed that it seems to have problems starting sentences. Especially if I try to start a sentence with "hi,". Interesting nonetheless. This passage seems to be rendered fairly well: https://lyrebird.ai/g/LYoVuaZm
Also, https://lyrebird.ai/g/D3Fw328D
adbrebs 9 hrs Unfortunately for certain voices our model has difficulties to generate the very beginning of the sentence. We hope to fix this problem soon.
Some other people shared their voices on twitter if you want to compare: https://twitter.com/LyrebirdAi
kiddico 9 hrs I got a good giggle out of that first one, thank you haha.
Abundnce10 7 hrs I just tried to signup with a Hotmail email address and I got this error message: This email cannot be used to create an account. It might be due to your email domain name.
I realize Hotmail isn't the sexiest email provider these days but it's one of the more commonly used. Do you have a list of email domains you allow?
sotelo 6 hrs We accept hotmail. It might be because of some special characters. Do you use + ?
sarreph 6 hrs Even if they did use '+', that should still definitely be allowed. I immediately get turned off when a service actively disallows a '+' because then I start to wonder why they don't want me to be able to filter their messages in my inbox.
It's the only sane and easy (but obviously not bullet-proof) way of catching spammers out.
Abundnce10 6 hrs Nope. Just letters and numbers. Same with my password.
I tried with my Gmail address and it worked fine. That address has no numbers in it. I used the same password. If you aren't prohibiting Hotmail addresses then it must be the numbers in the email address that are triggering some validation.
Regardless, I have access now. Looking forward to trying your product!
bitwize 3 hrs Don't reveal your powerlevel in HN, dude. Now you've not only reduced the search space for your Hotmail password, yoy've clued an attacker in that that's also your Gmail password!
songzme 10 hrs When the demo page was launched it seemed like Lyrebird was going to be an API. Will there still be an api?
adbrebs 10 hrs Yes definitively, we are starting a private beta at the moment.
songzme 10 hrs awesome! I signed up back then but haven't heard anything since. Is there anything else I can do to try out the beta?
adbrebs 10 hrs Not yet. We are starting with a few developers/companies only and will expand it progressively.
What would be your use case?
songzme 7 hrs My wife built an app that teaches people (foreigners) how to speak english. Based on the words in their flashcards, we generate dynamic sentences so during practice their flashcards are rarely the same. For example, if I have (happy, sad, run, write) in my backpack, then a sample flashcard would show up as: "When I run, I will be sad".
I see lyrebird api being very helpful in helping my users practice listening skills and add a level of creative fun! If we had 10-20 different voices, the flashcards will be read a little differently each time. Right now (since our flashcards is dynamic), our audio feels very monotone. We would love to help you beta test your API and work something out.
StavrosK 7 hrs This is only tangentially on topic, but is there an API or some engine that I can feed short sentences into and get high-quality generation back?
I have an RC controller radio that supports voice prompts, and I would like to add some short phrases that are missing, such as "air mode on", "throttle warning", etc.
Is there anything on par in quality with Google's/Siri's voice? Not the Google TTS, but the voice they use in Now.
sova 9 hrs I assume you guys know about VocalID that got an NSF SBiR grant for giving mute people a voice (through similar means) https://www.vocalid.co/
mipmap04 9 hrs This is incredible - recorded my voice and I'm blown away with the results.
One thing: I found that I was in such a hurry to record that I probably spoke faster than normal. It'd be nice if there was a way to tune a few parameters manually (tempo, pitch, etc).
If I ever lose my voice and have to have a TTS appliance speak for me, I'll be contacting you all to get my voice profile!
EDIT: For those interested, pretty impressive that it figured out the appropriate cadence for this: https://lyrebird.ai/g/v7MpYaUA
adbrebs 9 hrs Thank you for the feedback!
It'd be nice if there was a way to tune a few parameters manually (tempo, pitch, etc).
Yes we are currently exploring ways to control the generation: volume, pitch, tempo, speed but also intonation and emotion.
mipmap04 9 hrs Emotion would be a nice one - my wife's first comment was that it sounded too bored.
drusepth 9 hrs This looks awesome. I commented on the original post about how exciting this is for worldbuilding (and creating realistic voices for fictional characters, with all the uses that come there).
Random question: it's said that people think their own voices sound weird when they hear recordings of themselves played back. Do you have a way to measure that phenomenon? Have you seen people complaining about the accuracy when in fact it was just that effect making people sound "weird" (to themselves)?
gasbag 7 hrs The reason for the phenomenon is that some large percentage of how you hear your own voice comes from bone conduction. In addition, the higher harmonics of your voice are more directional, which is to say "aimed away from your ears", and tend to be diminished when reflected back to you by the objects around you.
The end result of this is that your own voice, when recorded and played back to you, will generally sound less bassy and more harmonically rich than you expect it to.
adbrebs 8 hrs Yes, it's actually quite interesting! It's a recurrent observation that we have inside the team with our own voices. Friends of the person usually better appreciate the quality of his/her digital voice. You can also observe these reactions to some extent on twitter: https://twitter.com/LyrebirdAi
Other interesting observation are the sentences that people generate for the first time with their digital voice...
Vermeulen 8 hrs Amazing - I cant wait to integrate this with our VR product. We previously used Amazon Polly attached to a chatbot: https://twitter.com/Alientrap/status/829032930626383873
First uses that come to mind are players adding themselves to a VR world - or maybe celebrities / public figures.
jonahx 6 hrs When I try test my digital voice, after clicking "Generate," I get this error after about 10 seconds:
Something went wrong. Please try again!
I've tried about 5 times.
EDIT: I went to back to the page a few minutes later, and the recordings were all there. So it looks like it works, but is giving a false error message.
sotelo 6 hrs Can you refresh and try again? Let me know if it works.
jonahx 6 hrs It's working now. Thanks. Very cool.
Small issue: Would be nice if you could delete recordings.
sotelo 6 hrs You can!
jonahx 6 hrs Sorry, I meant the test recordings, not the originals of my voice.
jlgosse 9 hrs This is probably going to be great, but I just tested out voice generation with the bare minimum of 30 recordings, and it really fell flat. When I tried playback with an input, all it could produce was a high-pitched buzzing sound and then maybe 1/4 of the words I typed in, which sounded nothing like me.
Perhaps you should increase the minimum from 30 recordings to 100?
sotelo 9 hrs Hi! Thanks for testing it! For many voices it works well with only 30 recordings. For some, you need a bit more. It seems that quality of the audio (no background noise, clear and loud voice, lots of intonation) is what matters the most.
webwanderings 8 hrs I am confused about the functionality. What is that I will be able to do, if I go through recording 30 sentences?
adbrebs 8 hrs You will be able to create a digital voice that sounds like you and generate any sentence from it.
And thanks, we are going to update the instructions to make them more clear.
webwanderings 7 hrs Thanks. Such an explanation on the website would be helpful. BTW, the Trump/Obama tweets do not add value. Using political objects to define a technical service, is a mismatch under the context. It also doesn't help in explaining what this service provides (people wouldn't expect that Trump and Obama have given you consent to use their voice). Just an opinion.
gaius 8 hrs The owners of this service will be able to impersonate you at their whim. You're only populating their database for them.
sotelo 8 hrs This is just a beta version. In the future, we expect to integrate the tech with some other apps.
webwanderings 54 mins This technology didn't work out for me. After spending time and effort in providing what it needs, the results I got back in return, were terrible for the time invested. In any case, good to see such attempt at evolving what is potentially possible in the future.
capocannoniere 11 hrs Congrats on the launch! The tech is amazing
Quick q's (purely out of curiosity):
1) > We are [...] PhD students in AI at University of Montreal
Are you doing the startup on the side/planning on going back to school?
2) I don't recall reading about you guys in articles about YC S17 demo days. What are reasons why some companies might not participate in demo day or remain off-the-record? In your case, you seem to have had a working product long before demo day
adbrebs 10 hrs Thank you!
1) The research of the PhD and the startup are quite complementary at the end, so we hope we can continue doing both.
2) We didn't do demo day because we raised our seed round just before YC and did not want to raise again.
sjbase 9 hrs Cool stuff! Question from your FAQ:
Q: Will I be able to copy another person's voice?
A: Yes but only if you have the authorization of the person whose voice is being copied.
Perhaps you can unpack that answer a bit? What's the authorization process?
adbrebs 9 hrs Sure, good question!
There will be two scenarii:
you want to use the voice of someone that has a Lyrebird account: he or she has to give you their authorization.
you want to use the voice of someone who does not have an account. We have specific contracts for that. Say you want to copy the voice of Morgan Freeman, the contract will be between him/her, you and Lyrebird. We will also probably explore alternative ways for that.
pashabitz 8 hrs ...make possible a wide range of new applications such as
hacking voice-controlled interfaces
generating fake news
FTFY
don't @ me saying "sure any technology can be used for good and bad stop being a ludite" yeah I know that just messing with you
sotelo 8 hrs Yes, this is a tricky subject! We have thought a lot about it and we think we are doing the right thing for society.
We write more about it here: https://lyrebird.ai/ethics/
mbonzo 10 hrs I have a youtube channel (vimgirl) and before recording I have to write scripts for what I plan to say in the video. The digital voice doesn't seem to be working right now, but when it does it would cut down my screencast production time by at least half.
mindhash 11 hrs Hey.. how does lyrebird handle accent? I work in education space and due to accent of people in my country, the content doesnt work well with global audience.
are you open for beta? would like to try out your api on education content.
adbrebs 11 hrs For now, it works better with American English accent but it is still able to adapt to other accents.
Our upcoming versions should be more robust to different accents and we also plan to extend it to other languages.
SimbaOnSteroids 8 hrs This is exciting I've been following you guys since at least May. How do you plan on getting the voices out of the uncanny valley?
adbrebs 8 hrs This is going to be very tricky! No clear answer to that, we are putting a lot of effort on research but our progress is quite difficult to predict.
tranv94 8 hrs How about Adobe Voice? This seems to share a lot of the same breakthroughs as Adobe Voice.
uoaei 9 hrs I wonder how it would work using training data from one language in generating voice in another language.
echan00 8 hrs Awesome idea. It was just a matter of time!
frag 11 hrs voice upload is not working :(
adbrebs 11 hrs Thanks for pointing this out, this was reported by a few others. We are investigating it. For now, just refresh the page and it should work.
adbrebs 9 hrs We've fixed the bug!
sixftmonster 6 hrs Getting failed upload after clicking validation... Chrome showing this in console: "VM291:1 POST https://lyrebird.ai/my/recordings/ 400 (Bad Request)"
kwitze 6 hrs Unfortunately still not working on IOS. Tried with both safari and chrome.
https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/
At The New York Times we have a number of different systems that are used for producing content. We have several Content Management Systems, and we use third-party data and wire stories. Furthermore, given 161 years of journalism and 21 years of publishing content online, we have huge archives of content that still need to be available online, that need to be searchable, and that generally need to be available to different services and applications.
These are all sources of what we call published content. This is content that has been written, edited, and that is considered ready for public consumption.
On the other side we have a wide range of services and applications that need access to this published content — there are search engines, personalization services, feed generators, as well as all the different front-end applications, like the website and the native apps. Whenever an asset is published, it should be made available to all these systems with very low latency — this is news, after all — and without data loss.
This article describes a new approach we developed to solving this problem, based on a log-based architecture powered by Apache KafkaTM. We call it the Publishing Pipeline. The focus of the article will be on back-end systems. Specifically, we will cover how Kafka is used for storing all the articles ever published by The New York Times, and how Kafka and the Streams API is used to feed published content in real-time to the various applications and systems that make it available to our readers. The new architecture is summarized in the diagram below, and we will deep-dive into the architecture in the remainder of this article.
Figure 1: The new New York Times log/Kafka-based publishing architecture.
The problem with API-based approaches
The different back-end systems that need access to published content have very different requirements:
We have a service that provides live content for the web site and the native applications. This service needs to make assets available immediately after they are published, but it only ever needs the latest version of each asset. We have different services that provide lists of content. Some of these lists are manually curated, some are query-based. For the query-based lists, whenever an asset is published that matches the query, requests for that list need to include the new asset. Similarly, if an update is published causing the asset no longer to match the query, it should be removed from the list. We also have to support changes to the query itself, and the creation of new lists, which requires accessing previously published content to (re)generate the lists. We have an Elasticsearch cluster powering site search. Here the latency requirements are less severe — if it takes a minute or two after an asset is published before it can be found by a search it is usually not a big deal. However, the search engine needs easy access to previously published content, since we need to reindex everything whenever the Elasticsearch schema definition changes, or when we alter the search ingestion pipeline. We have personalization systems that only care about recent content, but that need to reprocess this content whenever the personalization algorithms change. Our previous approach to giving all those different consumers access to published content involved building APIs. The producers of content would provide APIs for accessing that content, and also feeds you could subscribe to for notifications for new assets being published. Other back-end systems, the consumers of content, would then call those APIs to get the content they needed.
Figure 2: A sketch of our previous API-based architecture that has since been replaced by the new log/Kafka-based architecture described in this article.
This approach, a fairly typical API-based architecture, had a number of issues.
Since the different APIs had been developed at different times by different teams, they typically worked in drastically different ways. The actual endpoints made available were different, they had different semantics, and they took different parameters. That could be fixed, of course, but it would require coordination between a number of teams.
More importantly, they all had their own, implicitly defined schemas. The names of fields in one CMS were different than the same fields in another CMS, and the same field name could mean different things in different systems.
This meant that every system that needed access to content had to know all these different APIs and their idiosyncrasies, and they would then need to handle normalization between the different schemas.
An additional problem was that it was difficult to get access to previously published content. Most systems did not provide a way to efficiently stream content archives, and the databases they were using for storage wouldn’t have supported it (more about this in the next section). Even if you have a list of all published assets, making an individual API call to retrieve each individual asset would take a very long time and put a lot of unpredictable load on the APIs.
Log-based architectures
The solution described in this article uses a log-based architecture. This is an idea that was first covered by Martin Kleppmann in Turning the database inside-out with Apache Samza[1], and is described in more detail in Designing Data-Intensive Applications[2]. The log as a generic data structure is covered in The Log: What every software engineer should know about real-time data’s unifying abstraction[3]. In our case the log is Kafka, and all published content is appended to a Kafka topic in chronological order. Other services access it by consuming the log.
Traditionally, databases have been used as the source of truth for many systems. Despite having a lot of obvious benefits, databases can be difficult to manage in the long run. First, it’s often tricky to change the schema of a database. Adding and removing fields is not too hard, but more fundamental schema changes can be difficult to organize without downtime. A deeper problem is that databases become hard to replace. Most database systems don’t have good APIs for streaming changes; you can take snapshots, but they will immediately become outdated. This means that it’s also hard to create derived stores, like the search indexes we use to power site search on nytimes.com and in the native apps — these indexes need to contain every article ever published, while also being up to date with new content as it is being published. The workaround often ends up being clients writing to multiple stores at the same time, leading to consistency issues when one of these writes succeeds and the other fails.
Because of this, databases, as long-term maintainers as state, tend to end up being complex monoliths that try to be everything to everyone.
Log-based architectures solve this problem by making the log the source of truth. Whereas a database typically stores the result of some event, the log stores the event itself — the log therefore becomes an ordered representation of all events that happened in the system. Using this log, you can then create any number of custom data stores. These stores becomes materialized views of the log — they contain derived, not original, content. If you want to change the schema in such a data store, you can just create a new one, have it consume the log from the beginning until it catches up, and then just throw away the old one.
With the log as the source of truth, there is no longer any need for a single database that all systems have to use. Instead, every system can create its own data store (database) – its own materialized view – representing only the data it needs, in the form that is the most useful for that system. This massively simplifies the role of databases in an architecture, and makes them more suited to the need of each application.
Furthermore, a log-based architecture simplifies accessing streams of content. In a traditional data store, accessing a full dump (i.e., as a snapshot) and accessing “live” data (i.e., as a feed) are distinct ways of operating. An important facet of consuming a log is that this distinction goes away. You start consuming the log at some specific offset – this can be the beginning, the end, or any point in-between — and then just keep going. This means that if you want to recreate a data store, you simply start consuming the log at the beginning of time. At some point you will catch up with live traffic, but this is transparent to the consumer of the log.
A log consumer is therefore “always replaying”.
Log-based architectures also provide a lot of benefits when it comes to deploying systems. Immutable deployments of stateless services have long been a common practice when deploying to VMs. By always redeploying a new instance from scratch instead of modifying a running one, a whole category of problems go away. With the log as the source of truth, we can now do immutable deployments of stateful systems. Since any data store can be recreated from the log, we can create them from scratch every time we deploy changes, instead of changing things in-place — a practical example of this is given later in the article.
Why Google PubSub or AWS SNS/SQS/Kinesis don’t work for this problem
Apache Kafka is typically used to solve two very distinct use cases.
The most common one by far is where Apache Kafka is used as a message broker. This can cover both analytics and data integration cases. Kafka arguably has a lot of advantages in this area, but services like Google Pub/Sub, AWS SNS/AWS SQS, and AWS Kinesis have other approaches to solving the same problem. These services all let multiple consumers subscribe to messages published by multiple producers, keep of track of which messages they have and haven’t seen, and gracefully handle consumer downtime without data loss. For these use cases, the fact that Kafka is a log is an implementation detail.
Log-based architectures, like the one described in this article, are different. In these cases, the log is not an implementation detail, it is the central feature. The requirements are very different from what the other services offer:
We need the log to retain all events forever, otherwise it is not possible to recreate a data store from scratch. We need log consumption to be ordered. If events with causal relationships are processed out of order, the result will be wrong. Only Kafka supports both of these requirements.
The Monolog
The Monolog is our new source of truth for published content. Every system that creates content, when it’s ready to be published, will write it to the Monolog, where it is appended to the end. The actual write happens through a gateway service, which validates that the published asset is compliant with our schema.
Figure 3: The Monolog, containing all assets every published by The New York Times.
The Monolog contains every asset published since 1851. They are totally ordered according to publication time. This means that a consumer can pick the point in time when it wants to start consuming. Consumers that need all of the content can start at the beginning of time (i.e., in 1851), other consumers may want only future updates, or at some time in-between.
As an example, we have a service that provides lists of content — all assets published by specific authors, everything that should go on the science section, etc. This service starts consuming the Monolog at the beginning of time, and builds up its internal representation of these lists, ready to serve on request. We have another service that just provides a list of the latest published assets. This service does not need its own permanent store: instead it just goes a few hours back in time on the log when it starts up, and begins consuming there, while maintaining a list in memory.
Assets are published to the Monolog in normalized form, that is, each independent piece of content is written to Kafka as a separate message. For example, an image is independent from an article, because several articles may include the same image.
The figure gives an example:
Figure 4: Normalized assets.
This is very similar to a normalized model in a relational database, with many-to-many relationships between the assets.
In the example we have two articles that reference other assets. For instance, the byline is published separately, and then referenced by the two articles. All assets are identified using URIs of the form nyt://article/577d0341-9a0a-46df-b454-ea0718026d30. We have a native asset browser that (using an OS-level scheme handler) lets us click on these URIs, see the asset in a JSON form, and follow references. The assets themselves are published to the Monolog as protobuf binaries.
In Apache Kafka, the Monolog is implemented as a single-partition topic. It’s single-partition because we want to maintain the total ordering — specifically, we want to ensure that when you are consuming the log, you always see a referenced asset before the asset doing the referencing. This ensures internal consistency for a top-level asset — if we add an image to an article while adding text referencing the image, we do not want the change to the article to be visible before the image is.
The above means that the assets are actually published to the log topologically sorted. For the example above, it looks like this:
Figure 5: Normalized assets in publishing order.
As a log consumer you can then easily build your materialized view of log, since you know that the version of an asset referenced is always the last version of that asset that you saw on the log.
Because the topic is single-partition, it needs to be stored on a single disk, due to the way Kafka stores partitions. This is not a problem for us in practice, since all our content is text produced by humans — our total corpus right now is less than 100GB, and disks are growing bigger faster than our journalists can write.
The denormalized log and Kafka’s Streams API
The Monolog is great for consumers that want a normalized view of the data. For some consumers that is not the case. For instance, in order to index data in Elasticsearch you need a denormalized view of the data, since Elasticsearch does not support many-to-many relationships between objects. If you want to be able to search for articles by matching image captions, those image captions will have to be represented inside the article object.
In order to support this kind of view of the data, we also have a denormalized log. In the denormalized log, all the components making up a top-level asset are published together. For the example above, when Article 1 is published, we write a message to the denormalized log, containing the article and all its dependencies along with it in a single message:
Figure 6: The denormalized log after publishing Article 1.
The Kafka consumer that feeds Elasticsearch can just pick this message off the log, reorganize the assets into the desired shape, and push to the index. When Article 2 is published, again all the dependencies are bundled, including the ones that were already published for Article 1:
Figure 7: The denormalized log after publishing Article 2.
If a dependency is updated, the whole asset is republished. For instance, if Image 2 is updated, all of Article 1 goes on the log again:
Figure 8: The denormalized log after updating Image 2, used by Article 1.
A component called the Denormalizer actually creates the denormalized log.
The Denormalizer is a Java application that uses Kafka’s Streams API. It consumes the Monolog, and maintains a local store of the latest version of every asset, along with the references to that asset. This store is continuously updated when assets are published. When a top-level asset is published, the Denormalizer collects all the dependencies for this asset from local storage, and writes it as a bundle to the denormalized log. If an asset referenced by a top-level asset is published, the Denormalizer republishes all the top-level assets that reference it as a dependency.
Since this log is denormalized, it no longer needs total ordering. We now only need to make sure that the different versions of the same top-level asset come in the correct order. This means that we can use a partitioned log, and have multiple clients consume the log in parallel. We do this using Kafka Streams, and the ability to scale up the number of application instances reading from the denormalized log allows us to do a very fast replay of our entire publication history — the next section will show an example of this.
Elasticsearch example
The following sketch shows an example of how this setup works end-to-end for a backend search service. As mentioned above, we use Elasticsearch to power the site search on NYTimes.com:
Figure 9: A sketch showing how published assets flow through the system from the CMS to Elasticsearch.
The data flow is as follows:
An asset is published or updated by the CMS. The asset is written to the Gateway as a protobuf binary. The Gateway validates the asset, and writes it to the Monolog. The Denormalizer consumes the asset from the Monolog. If this is a top-level asset, it collects all its dependencies from its local store and writes them together to the denormalized log. If this asset is a dependency of other top-level assets, all of those top-level assets are written to the denormalized log. The Kafka partitioner assigns assets to partitions based on the URI of the top-level asset. The search ingestion nodes all run an application that uses Kafka Streams to access the denormalized log. Each node reads a partition, creates the JSON objects we want to index in Elasticsearch, and writes them to specific Elasticsearch nodes. During replay we do this with Elasticsearch replication turned off, to make indexing faster. We turn replication back on when we catch up with live traffic before the new index goes live. Implementation
This Publishing Pipeline runs on Google Cloud Platform/GCP. The details of our setup are beyond the scope of this article, but the high-level architecture looks like the sketch below. We run Kafka and ZooKeeper on GCP Compute instances. All other processes the Gateway, all Kafka replicators, the Denormalizer application built with Kafka’s Streams API, etc. — run in containers on GKE/Kubernetes. We use gRPC/Cloud Endpoint for our APIs, and mutual SSL authentication/authorization for keeping Kafka itself secure.
Figure 10: Implementation on Google Cloud Platform.
Conclusion
We have been working on this new publishing architecture for a little over a year. We are now in production, but it’s still early days, and we have a good number of systems we still have to move over to the Publishing Pipeline.
We are already seeing a lot of advantages. The fact that all content is coming through the same pipeline is simplifying our software development processes, both for front-end applications and back-end systems. Deployments have also become simpler – for instance, we are now starting to do full replays into new Elasticsearch indexes when we make changes to analyzers or the schema, instead of trying to make in-place changes to the live index, which we have found to be error-prone. Furthermore, we are also in the process of building out a better system for monitoring how published assets progress through the stack. All assets published through the Gateway are assigned a unique message ID, and this ID is provided back to the publisher as well as passed along through Kafka and to the consuming applications, allowing us to track and monitor when each individual update is processed in each system, all the way out to the end-user applications. This is useful both for tracking performance and for pinpointing problems when something goes wrong.
Finally, this is a new way of building applications, and it requires a mental shift for developers who are used to working with databases and traditional pub/sub-models. In order to take full advantage of this setup, we need to build applications in such a way that it is easy to deploy new instances that use replay to recreate their materialized view of the log, and we are putting a lot of effort into providing tools and infrastructure that makes this easy.
I want to thank Martin Kleppmann, Michael Noll and Mike Kaminski for reviewing this article.
About Apache Kafka’s Streams API
If you have enjoyed this article, you might want to continue with the following resources to learn more about Apache Kafka’s Streams API:
————————————————————————————————————————————–
[1] “Turning the database inside-out with Apache Samza – Martin Kleppmann.” 4 Mar. 2015. Accessed 14 Jul. 2017. [2] “Designing Data-Intensive Applications.” Accessed 14 Jul. 2017. [3] “The Log: What every software engineer should know about real-time …” 16 Dec. 2013. Accessed 14 Jul. 2017.
# Optimizing web servers for high throughput and low latency 554 nuriaion 9 hrs 54
https://blogs.dropbox.com/tech/2017/09/optimizing-web-servers-for-high-throughput-and-low-latency/
This is an expanded version of my talk at NginxConf 2017 on September 6, 2017. As an SRE on the Dropbox Traffic Team, I’m responsible for our Edge network: its reliability, performance, and efficiency. The Dropbox edge network is an nginx-based proxy tier designed to handle both latency-sensitive metadata transactions and high-throughput data transfers. In a system that is handling tens of gigabits per second while simultaneously processing tens of thousands latency-sensitive transactions there are efficiency/performance optimizations throughout the proxy stack, from drivers and interrupts, through TCP/IP and kernel, to library, and application level tunings.
Disclaimer
In this post we’ll be discussing lots of ways to tune web servers and proxies. Please do not cargo-cult them. For the sake of the scientific method, apply them one-by-one, measure their effect, and decide wether they are indeed useful in your environment.
This is not a Linux performance post, even though I will make lots of references to bcc tools, eBPF, and perf, this is by no means the comprehensive guide to using performance profiling tools. If you want to learn more about them you may want to read through Brendan Gregg’s blog.
This is not a browser-performance post either. I’ll be touching client-side performance when I cover latency-related optimizations, but only briefly. If you want to know more, you should read High Performance Browser Networking by Ilya Grigorik.
And, this is also not the TLS best practices compilation. Though I’ll be mentioning TLS libraries and their settings a bunch of times, you and your security team, should evaluate the performance and security implications of each of them. You can use Qualys SSL Test, to verify your endpoint against the current set of best practices, and if you want to know more about TLS in general, consider subscribing to Feisty Duck Bulletproof TLS Newsletter.
Structure of the post
We are going to discuss efficiency/performance optimizations of different layers of the system. Starting from the lowest levels like hardware and drivers: these tunings can be applied to pretty much any high-load server. Then we’ll move to linux kernel and it’s TCP/IP stack: these are the knobs you want to try on any of your TCP-heavy boxes. Finally we’ll discuss library and application-level tunings, which are mostly applicable to web servers in general and nginx specifically.
For each potential area of optimization I’ll try to give some background on latency/throughput tradeoffs (if any), monitoring guidelines, and, finally, suggest tunings for different workloads.
Hardware
For good asymmetric RSA/EC performance you are looking for processors with at least AVX2 (avx2 in /proc/cpuinfo) support and preferably for ones with large integer arithmetic capable hardware (bmi and adx). For the symmetric cases you should look for AES-NI for AES ciphers and AVX512 for ChaCha+Poly. Intel has a performance comparison of different hardware generations with OpenSSL 1.0.2, that illustrates effect of these hardware offloads.
Latency sensitive use-cases, like routing, will benefit from fewer NUMA nodes and disabled HT. High-throughput tasks do better with more cores, and will benefit from Hyper-Threading (unless they are cache-bound), and generally won’t care about NUMA too much.
Specifically, if you go the Intel path, you are looking for at least Haswell/Broadwell and ideally Skylake CPUs. If you are going with AMD, EPYC has quite impressive performance.
NIC
Here you are looking for at least 10G, preferably even 25G. If you want to push more than that through a single server over TLS, the tuning described here will not be sufficient, and you may need to push TLS framing down to the kernel level (e.g. FreeBSD, Linux).
On the software side, you should look for open source drivers with active mailing lists and user communities. This will be very important if (but most likely, when) you’ll be debugging driver-related problems.
Memory
The rule of thumb here is that latency-sensitive tasks need faster memory, while throughput-sensitive tasks need more memory.
Hard Drive
It depends on your buffering/caching requirements, but if you are going to buffer or cache a lot you should go for flash-based storage. Some go as far as using a specialized flash-friendly filesystem (usually log-structured), but they do not always perform better than plain ext4/xfs.
Anyway just be careful to not burn through your flash because you forgot to turn enable TRIM, or update the firmware.
Operating systems: Low level
You should keep your firmware up-to-date to avoid painful and lengthy troubleshooting sessions. Try to stay recent with CPU Microcode, Motherboard, NICs, and SSDs firmwares. That does not mean you should always run bleeding edge—the rule of thumb here is to run the second to the latest firmware, unless it has critical bugs fixed in the latest version, but not run too far behind.
Drivers
The update rules here are pretty much the same as for firmware. Try staying close to current. One caveat here is to try to decoupling kernel upgrades from driver updates if possible. For example you can pack your drivers with DKMS, or pre-compile drivers for all the kernel versions you use. That way when you update the kernel and something does not work as expected there is one less thing to troubleshoot.
CPU
Your best friend here is the kernel repo and tools that come with it. In Ubuntu/Debian you can install the linux-tools package, with handful of utils, but now we only use cpupower, turbostat, and x86_energy_perf_policy. To verify CPU-related optimizations you can stress-test your software with your favorite load-generating tool (for example, Yandex uses Yandex.Tank.) Here is a presentation from the last NginxConf from developers about nginx loadtesting best-practices: “NGINX Performance testing.”
cpupower Using this tool is way easier than crawling /proc/. To see info about your processor and it’s frequency governor you should run:
$ cpupower frequency-info ... driver: intel_pstate ... available cpufreq governors: performance powersave ... The governor "performance" may decide which speed to use ... boost state support: Supported: yes Active: yes Check that Turbo Boost is enabled, and for Intel CPUs make sure that you are running with intel_pstate, not the acpi-cpufreq, or even pcc-cpufreq. If you still using acpi-cpufreq, then you should upgrade the kernel, or if that’s not possible, make sure you are using performance governor. When running with intel_pstate, even powersave governor should perform well, but you need to verify it yourself.
And speaking about idling, to see what is really happening with your CPU, you can use turbostat to directly look into processor’s MSRs and fetch Power, Frequency, and Idle State information:
Here you can see the actual CPU frequency (yes, /proc/cpuinfo is lying to you), and core/package idle states.
If even with the intel_pstate driver the CPU spends more time in idle than you think it should, you can:
Set governor to performance. Set x86_energy_perf_policy to performance. Or, only for very latency critical tasks you can: Use /dev/cpu_dma_latency interface. For UDP traffic, use busy-polling. You can learn more about processor power management in general and P-states specifically in the Intel OpenSource Technology Center presentation “Balancing Power and Performance in the Linux Kernel” from LinuxCon Europe 2015.
CPU Affinity
You can additionally reduce latency by applying cpu affinity on each thread/process, e.g. nginx has worker_cpu_affinity directive, that can automatically bind each web server process to it’s own core. This should eliminate cpu migrations, reduce cache misses and pagefaults, and slightly increase instructions per cycle. All of this is verifiable through perf stat.
Sadly, enabling affinity can also negatively affect performance by increasing the amount of time a process spends waiting for a free CPU. This can be monitored by running runqlat on one of your nginx workers PIDs:
usecs : count distribution 0 -> 1 : 819 | | 2 -> 3 : 58888 |**** | 4 -> 7 : 77984 |*******| 8 -> 15 : 10529 | | 16 -> 31 : 4853 |** | ... 4096 -> 8191 : 34 | | 8192 -> 16383 : 39 | | 16384 -> 32767 : 17 | | If you see multi-millisecond tail latencies there, then there is probably too much stuff going on on your servers besides nginx itself, and affinity will increase latency, instead of decreasing it.
Memory
All mm/ tunings are usually very workflow specific, there are only a handful of things to recommend:
Modern CPUs are actually multiple separate cpu dies connected by very fast interconnect and sharing various resources, starting from L1 cache on the HT cores, through L3 cache within the package, to Memory and PCIe links within sockets. This is basically what NUMA is: multiple execution and storage units with a fast interconnect.
For the comprehensive overview of NUMA and its implications you can consult “NUMA Deep Dive Series” by Frank Denneman.
But, long story short, you have a choice of:
Ignoring it, by disabling it in BIOS or running your software under numactl --interleave=all, you can get mediocre, but somewhat consistent performance. Denying it, by using single node servers, just like Facebook does with OCP Yosemite platform. Embracing it, by optimizing cpu/memory placing in both user- and kernel-space. Let’s talk about the third option, since there is not much optimization needed for the first two.
To utilize NUMA properly you need to treat each numa node as a separate server, for that you should first inspect the topology, which can be done with numactl --hardware:
$ numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 16 17 18 19 node 0 size: 32149 MB node 1 cpus: 4 5 6 7 20 21 22 23 node 1 size: 32213 MB node 2 cpus: 8 9 10 11 24 25 26 27 node 2 size: 0 MB node 3 cpus: 12 13 14 15 28 29 30 31 node 3 size: 0 MB node distances: node 0 1 2 3 0: 10 16 16 16 1: 16 10 16 16 2: 16 16 10 16 3: 16 16 16 10 Things to look after:
number of nodes. memory sizes for each node. number of CPUs for each node. distances between nodes. This is a particularly bad example since it has 4 nodes as well as nodes without memory attached. It is impossible to treat each node here as a separate server without sacrificing half of the cores on the system.
We can verify that by using numastat:
$ numastat -n -c Node 0 Node 1 Node 2 Node 3 Total -------- -------- ------ ------ -------- Numa_Hit 26833500 11885723 0 0 38719223 Numa_Miss 18672 8561876 0 0 8580548 Numa_Foreign 8561876 18672 0 0 8580548 Interleave_Hit 392066 553771 0 0 945836 Local_Node 8222745 11507968 0 0 19730712 Other_Node 18629427 8939632 0 0 27569060 You can also ask numastat to output per-node memory usage statistics in the /proc/meminfo format:
$ numastat -m -c Node 0 Node 1 Node 2 Node 3 Total ------ ------ ------ ------ ----- MemTotal 32150 32214 0 0 64363 MemFree 462 5793 0 0 6255 MemUsed 31688 26421 0 0 58109 Active 16021 8588 0 0 24608 Inactive 13436 16121 0 0 29557 Active(anon) 1193 970 0 0 2163 Inactive(anon) 121 108 0 0 229 Active(file) 14828 7618 0 0 22446 Inactive(file) 13315 16013 0 0 29327 ... FilePages 28498 23957 0 0 52454 Mapped 131 130 0 0 261 AnonPages 962 757 0 0 1718 Shmem 355 323 0 0 678 KernelStack 10 5 0 0 16 Now lets look at the example of a simpler topology.
$ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 0 size: 46967 MB node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 node 1 size: 48355 MB Since the nodes are mostly symmetrical we can bind an instance of our application to each NUMA node with numactl --cpunodebind=X --membind=X and then expose it on a different port, that way you can get better throughput by utilizing both nodes and better latency by preserving memory locality.
You can verify NUMA placement efficiency by latency of your memory operations, e.g. by using bcc’s funclatency to measure latency of the memory-heavy operation, e.g. memmove.
On the kernel side, you can observe efficiency by using perf stat and looking for corresponding memory and scheduler events:
The last bit of NUMA-related optimizations for network-heavy workloads comes from the fact that a network card is a PCIe device and each device is bound to its own NUMA-node, therefore some CPUs will have lower latency when talking to the network. We’ll discuss optimizations that can be applied there when we discuss NIC→CPU affinity, but for now lets switch gears to PCI-Express…
PCIe
Normally you do not need to go too deep into PCIe troubleshooting unless you have some kind of hardware malfunction. Therefore it’s usually worth spending minimal effort there by just creating “link width”, “link speed”, and possibly RxErr/BadTLP alerts for your PCIe devices. This should save you troubleshooting hours because of broken hardware or failed PCIe negotiation. You can use lspci for that:
PCIe may become a bottleneck though if you have multiple high-speed devices competing for the bandwidth (e.g. when you combine fast network with fast storage), therefore you may need to physically shard your PCIe devices across CPUs to get maximum throughput.
source: https://en.wikipedia.org/wiki/PCI_Express#History_and_revisions
Also see the article, “Understanding PCIe Configuration for Maximum Performance,” on the Mellanox website, that goes a bit deeper into PCIe configuration, which may be helpful at higher speeds if you observe packet loss between the card and the OS.
Intel suggests that sometimes PCIe power management (ASPM) may lead to higher latencies and therefore higher packet loss. You can disable it by adding pcie_aspm=off to the kernel cmdline.
NIC
Before we start, it worth mentioning that both Intel and Mellanox have their own performance tuning guides and regardless of the vendor you pick it’s beneficial to read both of them. Also drivers usually come with a README on their own and a set of useful utilities.
Next place to check for the guidelines is your operating system’s manuals, e.g. Red Hat Enterprise Linux Network Performance Tuning Guide, which explains most of the optimizations mentioned below and even more.
Cloudflare also has a good article about tuning that part of the network stack on their blog, though it is mostly aimed at low latency use-cases.
When optimizing NICs ethtool will be your best friend.
A small note here: if you are using a newer kernel (and you really should!) you should also bump some parts of your userland, e.g. for network operations you probably want newer versions of: ethtool, iproute2, and maybe iptables/nftables packages.
Valuable insight into what is happening with you network card can be obtained via ethtool -S:
$ ethtool -S eth0 | egrep 'miss|over|drop|lost|fifo' rx_dropped: 0 tx_dropped: 0 port.rx_dropped: 0 port.tx_dropped_link_down: 0 port.rx_oversize: 0 port.arq_overflows: 0 Consult with your NIC manufacturer for detailed stats description, e.g. Mellanox have a dedicated wiki page for them.
From the kernel side of things you’ll be looking at /proc/interrupts, /proc/softirqs, and /proc/net/softnet_stat. There are two useful bcc tools here: hardirqs and softirqs. Your goal in optimizing the network is to tune the system until you have minimal CPU usage while having no packet loss.
Interrupt Affinity Tunings here usually start with spreading interrupts across the processors. How specifically you should do that depends on your workload:
For maximum throughput you can distribute interrupts across all NUMA-nodes in the system. To minimize latency you can limit interrupts to a single NUMA-node. To do that you may need to reduce the number of queues to fit into a single node (this usually implies cutting their number in half with ethtool -L). Vendors usually provide scripts to do that, e.g. Intel has set_irq_affinity.
Ring buffer sizes Network cards need to exchange information with the kernel. This is usually done through a data structure called a “ring”, current/maximum size of that ring viewed via ethtool -g:
$ ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 4096 TX: 4096 Current hardware settings: RX: 4096 TX: 4096 You can adjust these values within pre-set maximums with -G. Generally bigger is better here (esp. if you are using interrupt coalescing), since it will give you more protection against bursts and in-kernel hiccups, therefore reducing amount of dropped packets due to no buffer space/missed interrupt. But there are couple of caveats:
On older kernels, or drivers without BQL support, high values may attribute to a higher bufferbloat on the tx-side. Bigger buffers will also increase cache pressure, so if you are experiencing one, try lowing them. Coalescing Interrupt coalescing allows you to delay notifying the kernel about new events by aggregating multiple events in a single interrupt. Current setting can be viewed via ethtool -c:
$ ethtool -c eth0 Coalesce parameters for eth0: ... rx-usecs: 50 tx-usecs: 50 You can either go with static limits, hard-limiting maximum number of interrupts per second per core, or depend on the hardware to automatically adjust the interrupt rate based on the throughput.
Enabling coalescing (with -C) will increase latency and possibly introduce packet loss, so you may want to avoid it for latency sensitive. On the other hand, disabling it completely may lead to interrupt throttling and therefore limit your performance.
Offloads Modern network cards are relatively smart and can offload a great deal of work to either hardware or emulate that offload in drivers themselves.
All possible offloads can be obtained with ethtool -k:
$ ethtool -k eth0 Features for eth0: ... tcp-segmentation-offload: on generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off [fixed] In the output all non-tunable offloads are marked with [fixed] suffix.
There is a lot to say about all of them, but here are some rules of thumb:
do not enable LRO, use GRO instead. be cautious about TSO, since it highly depends on the quality of your drivers/firmware. do not enable TSO/GSO on old kernels, since it may lead to excessive bufferbloat. **** Packet Steering All modern NICs are optimized for multi-core hardware, therefore they internally split packets into virtual queues, usually one-per CPU. When it is done in hardware it is called RSS, when the OS is responsible for loadbalancing packets across CPUs it is called RPS (with its TX-counterpart called XPS). When the OS also tries to be smart and route flows to the CPUs that are currently handling that socket, it is called RFS. When hardware does that it is called “Accelerated RFS” or aRFS for short. Here are couple of best practices from our production:
If you are using newer 25G+ hardware it probably has enough queues and a huge indirection table to be able to just RSS across all your cores. Some older NICs have limitations of only utilizing the first 16 CPUs. You can try enabling RPS if: you have more CPUs than hardware queues and you want to sacrifice latency for throughput. you are using internal tunneling (e.g. GRE/IPinIP) that NIC can’t RSS; Do not enable RPS if your CPU is quite old and does not have x2APIC. Binding each CPU to its own TX queue through XPS is generally a good idea. Effectiveness of RFS is highly depended on your workload and whether you apply CPU affinity to it. **** Flow Director and ATR Enabled flow director (or fdir in Intel terminology) operates by default in an Application Targeting Routing mode which implements aRFS by sampling packets and steering flows to the core where they presumably are being handled. Its stats are also accessible through ethtool -S:$ ethtool -S eth0 | egrep ‘fdir’ port.fdir_flush_cnt: 0 … Though Intel claims that fdir increases performance in some cases, external research suggests that it can also introduce up to 1% of packet reordering, which can be quite damaging for TCP performance. Therefore try testing it for yourself and see if FD is useful for your workload, while keeping an eye for the TCPOFOQueue counter.
Operating systems: Networking stack
There are countless books, videos, and tutorials for the tuning the Linux networking stack. And sadly tons of “sysctl.conf cargo-culting” that comes with them. Even though recent kernel versions do not require as much tuning as they used to 10 years ago and most of the new TCP/IP features are enabled and well-tuned by default, people are still copy-pasting their old sysctls.conf that they’ve used to tune 2.6.18/2.6.32 kernels.
To verify effectiveness of network-related optimizations you should:
Collect system-wide TCP metrics via /proc/net/snmp and /proc/net/netstat.
Aggregate per-connection metrics obtained either from ss -n --extended --info, or from calling getsockopt([TCP_INFO](http://linuxgazette.net/136/pfeiffer.html)
)/getsockopt([TCP_CC_INFO](https://patchwork.ozlabs.org/patch/465806/)
) inside your werbserver.
tcptrace(1)’es of sampled TCP flows.
Analyze RUM metrics from the app/browser.
For sources of information about network optimizations, I usually enjoy conference talks by CDN-folks since they generally know what they are doing, e.g. Fastly on LinuxCon Australia. Listening what Linux kernel devs say about networking is quite enlightening too, for example netdevconf talks and NETCONF transcripts.
It worth highlighting good deep-dives into Linux networking stack by PackageCloud, especially since they put an accent on monitoring instead of blindly tuning things:
Before we start, let me state it one more time: upgrade your kernel! There are tons of new network stack improvements, and I’m not even talking about IW10 (which is so 2010). I am talking about new hotness like: TSO autosizing, FQ, pacing, TLP, and RACK, but more on that later. As a bonus by upgrading to a new kernel you’ll get a bunch of scalability improvements, e.g.: removed routing cache, lockless listen sockets, SO_REUSEPORT, and many more.
Overview
From the recent Linux networking papers the one that stands out is “Making Linux TCP Fast.” It manages to consolidate multiple years of Linux kernel improvements on 4 pages by breaking down Linux sender-side TCP stack into functional pieces:
Fair Queueing and Pacing
Fair Queueing is responsible for improving fairness and reducing head of line blocking between TCP flows, which positively affects packet drop rates. Pacing schedules packets at rate set by congestion control equally spaced over time, which reduces packet loss even further, therefore increasing throughput.
As a side note: Fair Queueing and Pacing are available in linux via fq qdisc. Some of you may know that these are a requirement for BBR (not anymore though), but both of them can be used with CUBIC, yielding up to 15-20% reduction in packet loss and therefore better throughput on loss-based CCs. Just don’t use it in older kernels (< 3.19), since you will end up pacing pure ACKs and cripple your uploads/RPCs.
TSO autosizing and TSQ
Both of these are responsible for limiting buffering inside the TCP stack and hence reducing latency, without sacrificing throughput.
Congestion Control
CC algorithms are a huge subject by itself, and there was a lot of activity around them in recent years. Some of that activity was codified as: tcp_cdg (CAIA), tcp_nv (Facebook), and tcp_bbr (Google). We won’t go too deep into discussing their inner-workings, let’s just say that all of them rely more on delay increases than packet drops for a congestion indication.
BBR is arguably the most well-documented, tested, and practical out of all new congestion controls. The basic idea is to create a model of the network path based on packet delivery rate and then execute control loops to maximize bandwidth while minimizing rtt. This is exactly what we are looking for in our proxy stack.
Preliminary data from BBR experiments on our Edge PoPs shows an increase of file download speeds:
6 hour TCP BBR experiment in Tokyo PoP: x-axis — time, y-axis — client download speed
Here I want to stress out that we observe speed increase across all percentiles. That is not the case for backend changes. These usually only benefit p90+ users (the ones with the fastest internet connectivity), since we consider everyone else being bandwidth-limited already. Network-level tunings like changing congestion control or enabling FQ/pacing show that users are not being bandwidth-limited but, if I can say this, they are “TCP-limited.”
If you want to know more about BBR, APNIC has a good entry-level overview of BBR (and its comparison to loss-based congestions controls). For more in-depth information on BBR you probably want to read through bbr-dev mailing list archives (it has a ton of useful links pinned at the top). For people interested in congestion control in general it may be fun to follow Internet Congestion Control Research Group activity.
ACK Processing and Loss Detection
But enough about congestion control, let’s talk about let’s talk about loss detection, here once again running the latest kernel will help quite a bit. New heuristics like TLP and RACK are constantly being added to TCP, while the old stuff like FACK and ER is being retired. Once added, they are enabled by default so you do not need to tune any system settings after the upgrade.
Userspace prioritization and HOL
Userspace socket APIs provide implicit buffering and no way to re-order chunks once they are sent, therefore in multiplexed scenarios (e.g. HTTP/2) this may result in a HOL blocking, and inversion of h2 priorities. TCP_NOTSENT_LOWAT socket option (and corresponding net.ipv4.tcp_notsent_lowat sysctl) were designed to solve this problem by setting a threshold at which the socket considers itself writable (i.e. epoll will lie to your app). This can solve problems with HTTP/2 prioritization, but it can also potentially negatively affect throughput, so you know the drill—test it yourself.
Sysctls
One does not simply give a networking optimization talk without mentioning sysctls that need to be tuned. But let me first start with the stuff you don’t want to touch:
As for sysctls that you should be using:
It also worth noting that there is an RFC draft (though a bit inactive) from the author of curl, Daniel Stenberg, named TCP Tuning for HTTP, that tries to aggregate all system tunings that may be beneficial to HTTP in a single place.
Application level: Midlevel
Just like with the kernel, having up-to-date userspace is very important. You should start with upgrading your tools, for example you can package newer versions of perf, bcc, etc.
Once you have new tooling you are ready to properly tune and observe the behavior of a system. Through out this part of the post we’ll be mostly relying on on-cpu profiling with perf top, on-CPU flamegraphs, and adhoc histograms from bcc’s funclatency.
Compiler Toolchain
Having a modern compiler toolchain is essential if you want to compile hardware-optimized assembly, which is present in many libraries commonly used by web servers.
Aside from the performance, newer compilers have new security features (e.g. -fstack-protector-strong or SafeStack) that you want to be applied on the edge. The other use case for modern toolchains is when you want to run your test harnesses against binaries compiled with sanitizers (e.g. AddressSanitizer, and friends).
System libraries
It’s also worth upgrading system libraries, like glibc, since otherwise you may be missing out on recent optimizations in low-level functions from -lc, -lm, -lrt, etc. Test-it-yourself warning also applies here, since occasional regressions creep in.
Zlib
Normally web server would be responsible for compression. Depending on how much data is going though that proxy, you may occasionally see zlib’s symbols in perf top, e.g.:
There are ways of optimizing that on the lowest levels: both Intel and Cloudflare, as well as a standalone zlib-ng project, have their zlib forks which provide better performance by utilizing new instructions sets.
Malloc
We’ve been mostly CPU-oriented when discussing optimizations up until now, but let’s switch gears and discuss memory-related optimizations. If you use lots of Lua with FFI or heavy third party modules that do their own memory management, you may observe increased memory usage due to fragmentation. You can try solving that problem by switching to either jemalloc or tcmalloc.
Using custom malloc also has the following benefits:
Separating your nginx binary from the environment, so that glibc version upgrades and OS migration will affect it less. Better introspection, profiling and stats. ## PCRE If you use many complex regular expressions in your nginx configs or heavily rely on Lua, you may see pcre-related symbols in perf top. You can optimize that by compiling PCRE with JIT, and also enabling it in nginx via pcre_jit on;.
You can check the result of optimization by either looking at flame graphs, or using funclatency:
TLS
If you are terminating TLS on the edge w/o being fronted by a CDN, then TLS performance optimizations may be highly valuable. When discussing tunings we’ll be mostly focusing server-side efficiency.
So, nowadays first thing you need to decide is which TLS library to use: Vanilla OpenSSL, OpenBSD’s LibreSSL, or Google’s BoringSSL. After picking the TLS library flavor, you need to properly build it: OpenSSL for example has a bunch of built-time heuristics that enable optimizations based on build environment; BoringSSL has deterministic builds, but sadly is way more conservative and just disables some optimizations by default. Anyway, here is where choosing a modern CPU should finally pay off: most TLS libraries can utilize everything from AES-NI and SSE to ADX and AVX512. You can use built-in performance tests that come with your TLS library, e.g. in BoringSSL case it’s the bssl speed.
Most of performance comes not from the hardware you have, but from cipher-suites you are going to use, so you have to optimize them carefully. Also know that changes here can (and will!) affect security of your web server—the fastest ciphersuites are not necessarily the best. If unsure what encryption settings to use, Mozilla SSL Configuration Generator is a good place to start.
Asymmetric Encryption If your service is on the edge, then you may observe a considerable amount of TLS handshakes and therefore have a good chunk of your CPU consumed by the asymmetric crypto, making it an obvious target for optimizations.
To optimize server-side CPU usage you can switch to ECDSA certs, which are generally 10x faster than RSA. Also they are considerably smaller, so it may speedup handshake in presence of packet-loss. But ECDSA is also heavily dependent on the quality of your system’s random number generator, so if you are using OpenSSL, be sure to have enough entropy (with BoringSSL you do not need to worry about that).
As a side note, it worth mentioning that bigger is not always better, e.g. using 4096 RSA certs will degrade your performance by 10x:
$ bssl speed Did 1517 RSA 2048 signing ... (1507.3 ops/sec) Did 160 RSA 4096 signing ... (153.4 ops/sec) To make it worse, smaller isn’t necessarily the best choice either: by using non-common p-224 field for ECDSA you’ll get 60% worse performance compared to a more common p-256:
$ bssl speed Did 7056 ECDSA P-224 signing ... (6831.1 ops/sec) Did 17000 ECDSA P-256 signing ... (16885.3 ops/sec) The rule of thumb here is that the most commonly used encryption is generally the most optimized one.
When running properly optimized OpenTLS-based library using RSA certs, you should see the following traces in your perf top: AVX2-capable, but not ADX-capable boxes (e.g. Haswell) should use AVX2 codepath:
6.42% nginx [.] rsaz_1024_sqr_avx2 1.61% nginx [.] rsaz_1024_mul_avx2 While newer hardware should use a generic montgomery multiplication with ADX codepath:
7.08% nginx [.] sqrx8x_internal 2.30% nginx [.] mulx4x_internal Symmetric Encryption If you have lot’s of bulk transfers like videos, photos, or more generically files, then you may start observing symmetric encryption symbols in profiler’s output. Here you just need to make sure that your CPU has AES-NI support and you set your server-side preferences for AES-GCM ciphers. Properly tuned hardware should have following in perf top:
8.47% nginx [.] aesni_ctr32_ghash_6x But it’s not only your servers that will need to deal with encryption/decryption—your clients will share the same burden on a way less capable CPU. Without hardware acceleration this may be quite challenging, therefore you may consider using an algorithm that was designed to be fast without hardware acceleration, e.g. ChaCha20-Poly1305. This will reduce TTLB for some of your mobile clients.
ChaCha20-Poly1305 is supported in BoringSSL out of the box, for OpenSSL 1.0.2 you may consider using Cloudflare patches. BoringSSL also supports “equal preference cipher groups,” so you may use the following config to let clients decide what ciphers to use based on their hardware capabilities (shamelessly stolen from cloudflare/sslconfig):
ssl_ciphers '[ECDHE-ECDSA-AES128-GCM-SHA256|ECDHE-ECDSA-CHACHA20-POLY1305|ECDHE-RSA-AES128-GCM-SHA256|ECDHE-RSA-CHACHA20-POLY1305]:ECDHE+AES128:RSA+AES128:ECDHE+AES256:RSA+AES256:ECDHE+3DES:RSA+3DES'; ssl_prefer_server_ciphers on; Application level: Highlevel
To analyze effectiveness of your optimizations on that level you will need to collect RUM data. In browsers you can use Navigation Timing APIs and Resource Timing APIs. Your main metrics are TTFB and TTV/TTI. Having that data in an easily queriable and graphable formats will greatly simplify iteration.
Compression
Compression in nginx starts with mime.types file, which defines default correspondence between file extension and response MIME type. Then you need to define what types you want to pass to your compressor with e.g. gzip_types. If you want the complete list you can use mime-db to autogenerate your mime.types and to add those with .compressible == true to gzip_types.
When enabling gzip, be careful about two aspects of it:
Increased memory usage. This can be solved by limiting gzip_buffers. Increased TTFB due to the buffering. This can be solved by using gzip_no_buffer. As a side note, http compression is not limited to gzip exclusively: nginx has a third party ngx_brotli module that can improve compression ratio by up to 30% compared to gzip.
As for compression settings themselves, let’s discuss two separate use-cases: static and dynamic data.
For static data you can archive maximum compression ratios by pre-compressing your static assets as a part of the build process. We discussed that in quite a detail in the Deploying Brotli for static content post for both gzip and brotli. For dynamic data you need to carefully balance a full roundtrip: time to compress the data + time to transfer it + time to decompress on the client. Therefore setting the highest possible compression level may be unwise, not only from CPU usage perspective, but also from TTFB. ## Buffering Buffering inside the proxy can greatly affect web server performance, especially with respect to latency. The nginx proxy module has various buffering knobs that are togglable on a per-location basis, each of them is useful for its own purpose. You can separately control buffering in both directions via proxy_request_buffering and proxy_buffering. If buffering is enabled the upper limit on memory consumption is set by client_body_buffer_size and proxy_buffers, after hitting these thresholds request/response is buffered to disk. For responses this can be disabled by setting proxy_max_temp_file_size to 0.
Most common approaches to buffering are:
Buffer request/response up to some threshold in memory and then overflow to disk. If request buffering is enabled, you only send a request to the backend once it is fully received, and with response buffering, you can instantaneously free a backend thread once it is ready with the response. This approach has the benefits of improved throughput and backend protection at the cost of increased latency and memory/io usage (though if you use SSDs that may not be much of a problem). No buffering. Buffering may not be a good choice for latency sensitive routes, especially ones that use streaming. For them you may want to disable it, but now your backend needs to deal with slow clients (incl. malicious slow-POST/slow-read kind of attacks). Application-controlled response buffering through the X-Accel-Buffering header. Whatever path you choose, do not forget to test its effect on both TTFB and TTLB. Also, as mentioned before, buffering can affect IO usage and even backend utilization, so keep an eye out for that too.
TLS
Now we are going to talk about high-level aspects of TLS and latency improvements that could be done by properly configuring nginx. Most of the optimizations I’ll be mentioning are covered in the High Performance Browser Networking’s “Optimizing for TLS” section and Making HTTPS Fast(er) talk at nginx.conf 2014. Tunings mentioned in this part will affect both performance and security of your web server, if unsure, please consult with Mozilla’s Server Side TLS Guide and/or your Security Team.
To verify the results of optimizations you can use:
Session resumption As DBAs love to say “the fastest query is the one you never make.” The same goes for TLS—you can reduce latency by one RTT if you cache the result of the handshake. There are two ways of doing that:
You can ask the client to store all session parameters (in a signed and encrypted way), and send it to you during the next handshake (similar to a cookie). On the nginx side this is configured via the ssl_session_tickets directive. This does not not consume any memory on the server-side but has a number of downsides: You need the infrastructure to create, rotate, and distribute random encryption/signing keys for your TLS sessions. Just remember that you really shouldn’t 1) use source control to store ticket keys 2) generate these keys from other non-ephemeral material e.g. date or cert. PFS won’t be on a per-session basis but on a per-tls-ticket-key basis, so if an attacker gets a hold of the ticket key, they can potentially decrypt any captured traffic for the duration of the ticket. Your encryption will be limited to the size of your ticket key. It does not make much sense to use AES256 if you are using 128-bit ticket key. Nginx supports both 128 bit and 256 bit TLS ticket keys. Not all clients support ticket keys (all modern browsers do support them though). Or you can store TLS session parameters on the server and only give a reference (an id) to the client. This is done via the ssl_session_cache directive. It has a benefit of preserving PFS between sessions and greatly limiting attack surface. Though ticket keys have downsides: They consume ~256 bytes of memory per session on the server, which means you can’t store many of them for too long. They can not be easily shared between servers. Therefore you either need a loadbalancer which will send the same client to the same server to preserve cache locality, or write a distributed TLS session storage on top off something like ngx_http_lua_module. As a side note, if you go with session ticket approach, then it’s worth using 3 keys instead of one, e.g.:
ssl_session_tickets on; ssl_session_timeout 1h; ssl_session_ticket_key /run/nginx-ephemeral/nginx_session_ticket_curr; ssl_session_ticket_key /run/nginx-ephemeral/nginx_session_ticket_prev; ssl_session_ticket_key /run/nginx-ephemeral/nginx_session_ticket_next; You will be always encrypting with the current key, but accepting sessions encrypted with both next and previous keys.
OCSP Stapling You should staple your OCSP responses, since otherwise:
Your TLS handshake may take longer because the client will need to contact the certificate authority to fetch OCSP status. On OCSP fetch failure may result in availability hit. You may compromise users’ privacy since their browser will contact a third party service indicating that they want to connect to your site. To staple the OCSP response you can periodically fetch it from your certificate authority, distribute the result to your web servers, and use it with the ssl_stapling_file directive:
ssl_stapling_file /var/cache/nginx/ocsp/www.der; TLS record size TLS breaks data into chunks called records, which you can’t verify and decrypt until you receive it in its entirety. You can measure this latency as the difference between TTFB from the network stack and application points of view.
By default nginx uses 16k chunks, which do not even fit into IW10 congestion window, therefore require an additional roundtrip. Out-of-the box nginx provides a way to set record sizes via ssl_buffer_size directive:
To optimize for low latency you should set it to something small, e.g. 4k. Decreasing it further will be more expensive from a CPU usage perspective. To optimize for high throughput you should leave it at 16k. There are two problems with static tuning:
You need to tune it manually. You can only set ssl_buffer_size on a per-nginx config or per-server block basis, therefore if you have a server with mixed latency/throughput workloads you’ll need to compromize. There is an alternative approach: dynamic record size tuning. There is an nginx patch from Cloudflare that adds support for dynamic record sizes. It may be a pain to initially configure it, but once you over with it, it works quite nicely.
*TLS 1.3* TLS 1.3 features indeed sound very nice, but unless you have resources to be troubleshooting TLS full-time I would suggest not enabling it, since:
It is still a draft. 0-RTT handshake has some security implications. And your application needs to be ready for it. There are still middleboxes (antiviruses, DPIs, etc) that block unknown TLS versions. ## Avoid Eventloop Stalls Nginx is an eventloop-based web server, which means it can only do one thing at a time. Even though it seems that it does all of these things simultaneously, like in time-division multiplexing, all nginx does is just quickly switches between the events, handling one after another. It all works because handling each event takes only couple of microseconds. But if it starts taking too much time, e.g. because it requires going to a spinning disk, latency can skyrocket.
If you start noticing that your nginx are spending too much time inside the ngx_process_events_and_timers function, and distribution is bimodal, then you probably are affected by eventloop stalls.
AIO and Threadpools Since the main source of eventloop stalls especially on spinning disks is IO, you should probably look there first. You can measure how much you are affected by it by running fileslower:
To fix this, nginx has support for offloading IO to a threadpool (it also has support for AIO, but native AIO in Unixes have lots of quirks, so better to avoid it unless you know what you doing). A basic setup consists of simply:
aio threads; aio_write on; For more complicated cases you can set up custom thread_pool‘s, e.g. one per-disk, so that if one drive becomes wonky, it won’t affect the rest of the requests. Thread pools can greatly reduce the number of nginx processes stuck in D state, improving both latency and throughput. But it won’t eliminate eventloop stalls fully, since not all IO operations are currently offloaded to it.
Logging Writing logs can also take a considerable amount of time, since it is hitting disks. You can check whether that’s that case by running ext4slower and looking for access/error log references:
It is possible to workaround this by spooling access logs in memory before writing them by using buffer parameter for the access_log directive. By using gzip parameter you can also compress the logs before writing them to disk, reducing IO pressure even more.
But to fully eliminate IO stalls on log writes you should just write logs via syslog, this way logs will be fully integrated with nginx eventloop.
Open file cache Since open(2) calls are inherently blocking and web servers are routinely opening/reading/closing files it may be beneficial to have a cache of open files. You can see how much benefit there is by looking at ngx_open_cached_file function latency:
If you see that either there are too many open calls or there are some that take too much time, you can can look at enabling open file cache:
open_file_cache max=10000; open_file_cache_min_uses 2; open_file_cache_errors on; After enabling open_file_cache you can observe all the cache misses by looking at opensnoop and deciding whether you need to tune the cache limits:
Wrapping up
All optimizations that were described in this post are local to a single web server box. Some of them improve scalability and performance. Others are relevant if you want to serve requests with minimal latency or deliver bytes faster to the client. But in our experience a huge chunk of user-visible performance comes from a more high-level optimizations that affect behavior of the Dropbox Edge Network as a whole, like ingress/egress traffic engineering and smarter Internal Load Balancing. These problems are on the edge (pun intended) of knowledge, and the industry has only just started approaching them.
If you’ve read this far you probably want to work on solving these and other interesting problems! You’re in luck: Dropbox is looking for experienced SWEs, SREs, and Managers.
![blogpic]https://unites.click/static/uploadfilesimg-49094b0bc4f2431c9674463748f324ca.png)