微型服務是軟體開發時所用的架構和組織方法
Member since 2017-07-15T03:50:57Z. Last seen 2024-09-20T22:23:02Z.
2675 blog posts. 128 comments.
微型服務是軟體開發時所用的架構和組織方法
For dentists, a cavity is a conundrum—in order to save the tooth they must further damage it. Currently, the primary way to treat a cavity is to excavate the decay and the surrounding area before filling the resulting crater with a durable surrogate material such as metal, plastic or glass cement.
Hi Morning Baby
Test on Audio
Test on Audio
Test on Audio
90 rcarmo 1 day 12
news.ycombinator.com/item?id=16272322 By Lucas Theis and Zehan Wang Wednesday, 24 January 2018
The ability to share photos directly on Twitter has existed since 2011 and is now an integral part of the Twitter experience. Today, millions of images are uploaded to Twitter every day. However, they can come in all sorts of shapes and sizes, which presents a challenge for rendering a consistent UI experience. The photos in your timeline are cropped to improve consistency and to allow you to see more Tweets at a glance. How do we decide what to crop, that is, which part of the image do we show you?
Previously, we used face detection to focus the view on the most prominent face we could find. While this is not an unreasonable heuristic, the approach has obvious limitations since not all images contain faces. Additionally, our face detector often missed faces and sometimes mistakenly detected faces when there were none. If no faces were found, we would focus the view on the center of the image. This could lead to awkwardly cropped preview images.
This Tweet is unavailable
A better way to crop is to focus on “salient” image regions. A region having high saliency means that a person is likely to look at it when freely viewing the image. Academics have studied and measured saliency by using eye trackers, which record the pixels people fixated with their eyes. In general, people tend to pay more attention to faces, text, animals, but also other objects and regions of high contrast. This data can be used to train neural networks and other algorithms to predict what people might want to look at. The basic idea is to use these predictions to center a crop around the most interesting region [1].
This Tweet is unavailable
Thanks to recent advances in machine learning, saliency prediction has gotten a lot better [2]. Unfortunately, the neural networks used to predict saliency are too slow to run in production, since we need to process every image uploaded to Twitter and enable cropping without impacting the ability to share in real-time. On the other hand, we don’t need fine-grained, pixel-level predictions, since we are only interested in roughly knowing where the most salient regions are. In addition to optimizing the neural network’s implementation, we used two techniques to reduce its size and computational requirements.
First, we used a technique called knowledge distillation to train a smaller network to imitate the slower but more powerful network [3]. With this, an ensemble of large networks is used to generate predictions on a set of images. These predictions, together with some third-party saliency data, are then used to train a smaller, faster network.
This Tweet is unavailable
Second, we developed a pruning technique to iteratively remove feature maps of the neural network which were costly to compute but did not contribute much to the performance. To decide which feature maps to prune, we computed the number of floating point operations required for each feature map and combined it with an estimate of the performance loss that would be suffered by removing it. More details on our pruning approach can be found in our paper which has been released on arXiv [4].
Together, these two methods allowed us to crop media 10x faster than just a vanilla implementation of the model and before any implementation optimizations. This lets us perform saliency detection on all images as soon as they are uploaded and crop them in real-time.
These updates are currently in the process of being rolled out to everyone on twitter.com, iOS and Android. Below are some more examples of how this new algorithm affects image cropping on Twitter.
This Tweet is unavailable
Before:
This Tweet is unavailable
After:
This Tweet is unavailable
We’d like to thank everyone involved at Twitter who worked with us on this new update. In particular, the Video Org leadership, Media Platform, Magic Pony, Comms and Legal teams, with special thanks to:
References
[1] E. Ardizzone, A. Bruno, G. Mazzola
Saliency Based Image Cropping
ICIAP, 2013
[2] M. Kümmerer, L. Theis, M. Bethge
Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet
ICLR Workshop, 2015
[3] G. Hinton, O. Vinyals, J. Dean
Distilling the Knowledge in a Neural Network
NIPS workshop, 2014
[4] L. Theis, I. Korshunova, A. Tejani, F. Huszar
Faster gaze prediction with dense networks and Fisher pruning
arXiv:1801.05787, 2018
This Tweet is unavailable
41 turrini 2 hrs 7
Abilities to combine algorithms and progressively achieve desired results
Interactive components used to tweak parameters Support for cross-platform executable scripts
Libraries to link and extend existing c++ algorithms Since Live CV is open source and free of use, the time spent on it’s development can vary, depending on the amount of time I can invest, and the number of people interested in the project. Knowing that more people are interested, I'm happier to allocate more time in its development.
You can support the project by:
contributing with plugins following the project on github following the project on twitter subscribing to the newsletter for updates
sharing it on facebook.
300 susi22 9 hrs 116
Supports hundreds of audio and video formats and codecs thanks to FFmpeg. No import required which means native editing, plus multi-format timelines, resolutions and frame-rates within a project. Frame accurate seeking supported for many video formats.
More Info
Blackmagic Design SDI and HDMI for input and preview monitoring. Screen, webcam and audio capture. Network stream playback. Supports resolutions up to 4k and capture from SDI, HDMI, webcam, JACK & Pulse audio, IP stream, X11 screen and Windows DirectShow devices.
More Info
Multiple dockable and undockable panels, including detailed media properties, recent files with search, playlist with thumbnail view, filter panel, history view, encoding panel, jobs queue, and melted server and playlist. Also supports drag-n-drop of assets from file manager.
More Info
本文見報之時,聯儲局今年首場亦是主席耶倫任內最後一次的公開市場委員會會議已有結果,如無意外,聯儲局會延至今年3月才開展今年首次加息行動;市場普遍預期,2018年全年將加息3次和累計上調0.75厘;然而,本欄對此有所保留,預料年內加息幅度可達1厘或以上。今期試從多個角度探討這課題。
從利率期貨角度看,市場普遍預期是次息口不會有變動(是次議息會議加息機率只有兩成);若以全年計,市場預計今年加息次數為兩次起、三次止(每次25個基點)。當然,年內加息三次,這與議息前聯儲局利率前瞻點陣圖(dot plot)預測中值水平相若【圖1】。換言之,聯儲局跟市場暫時「達到共識」:預計今年同樣加息3次,把目標聯邦基金利率上調至2厘至2.25厘水平。
是否如此呢?本欄預期年內聯儲局加息幅度可達1厘或以上,箇中原因有三。
一、經濟增長過熱
自2016年中開始,美國(以至環球)經濟便再度進入擴張周期,增長勢頭暫仍未有明顯放緩下來的趨勢,這點可從有前瞻啟示性的美國主要股指成分股綜合盈利表現維持反覆向上找到佐證【圖2】;再者,今年美國將落實執行減稅措施,更可能成為增長已屬不俗的經濟另一助燃劑。
值得留意的是,隨着美國股市持續反覆造好(儘管短期或有鞏固/調整壓力),在正財富效應(wealth effect)推動下,經濟表現只會變得更熾熱(過熱),尤其從最新民調顯示,美國散戶入市意欲隨着股市連綿的升浪開始變得積極(見上周本欄),令財富效應對經濟(甚至通脹)升溫的壓力變得更加敏感(見下文)。故此,聯儲局有需要加快加息步伐,抑制經濟過熱的風險。
二、通脹升溫
雖然從最新通脹數據顯示,美國消費物價指數按年只有2.1%的升幅,而聯儲局看重的核心個人消費開支增長亦只是按年回升1.5%而已,在在反映目前通脹壓力不大;然而,隨着經濟增長上升周期已進入第二個年頭,通脹是否仍可以保持「沉睡狀態」值得深究。
事實上,美國名義GDP增長多年來的變化跟滯後6個季度(即18個月)的核心消費物價指數有甚高相關性(相關系數高達0.78;註:相關系數愈接近+/-1,代表兩者正/負相關性便愈高),兩者大部分時間呈亦步亦趨【圖3】。由於2016年第二季開始,美國名義經濟增長見底回升,意味18個月後的今天,將可逐步感受到物價上升的壓力。
再者,隨着環球經濟增長勢頭持續,大宗商品,包括石油需求向上,促成商品價格近年反覆回升,這亦勢將進一步刺激美國(及環球)今年通脹。換言之,隨着今年通脹逐步升溫和明顯化,聯儲局有必要和需要加快加息步伐,以防通脹失控。事實上,從某程度可反映市場對通脹預期的打和利率(break-even interest rate)可見,自去年底開始,再次出現明顯抽升的情況,並創去年首兩季以來高位【圖4】,反映市場對通脹升溫開始感到憂慮。
三、股市升速過快
自2008年金融海嘯以來,美股探底後愈升愈有的因素眾多,早年異常寬鬆貨幣政策當然是其中的重要因素。不過,2014年後聯儲局終止量寬,甚至在2015年底開展加息周期,惟似乎對抑制這隻「股市牛」的作用不大;反而,受到隨後企業盈利改善和美元明顯轉弱,加上減稅等財政政策支持下,令美股牛市的壽命得以延續至今,且升勢更有進一步加快跡象。自2015年12月17日,即聯儲局本輪加息行動開始至今短短兩年間,道瓊斯工業平均指數、標普500指數和納斯特指數便分別累積上升56.9%、44.3%和51.9%(連股息)!僅是今年短短一個月內,三大指數也累漲逾5%。
正如前文所述,資產價格持續攀升,透過正財富效應會對經濟造成過熱的風險,反過來為通脹升溫帶來威脅。事實上,從多年來標普500指數兩年的變速,跟美國GDP變化關係同樣密切【圖5】,可以預期若然美股升勢進一步加快,今年美國經濟出現過熱風險不是危言聳聽。故此,聯儲局預料有機會防患未然加快加息步伐,以抑制股市、進而經濟過熱的情況。
總括而言,美國目前經濟表現保持相對熾熱的狀態,加上今年稅改政策落實執行,以及股市若然維持較高速攀升,估計年內美國經濟出現過熱,進而推升通脹風險不容抹殺;再者,美國經過持續近兩年的經濟擴張,通脹升溫情況預期會於今年內逐步浮現,在在迫使聯儲局年內貨幣政策變得更鷹。全年計,我們預計聯儲局加息次數不少於4次,累計加息幅度可達1厘或以上。
信報投資研究部
網上有個流傳很久有關「蓋茨女婿」的笑話。某個聚會上,商人走到蓋茨面前說:「我兒子要娶你的女兒。」蓋茨說:「不行。」商人說:「我兒子可是世界銀行副總裁喔。」「哇,這樣嘛……」接着商人又去見世銀總裁,說:「我想介紹我兒子來當副總裁。」總裁說:「我們已經有太多副總裁了。」商人應曰:「但我兒子是蓋茨的女婿啊。」結果商人的兒子,不但娶了蓋茨的女兒,還當上世銀副總裁。故事的結語是:「生意便是這樣談成的。」
內地網上零售增長三成,所以網購平台增長便是三成?社交媒體多年快速增長後,增速將放慢?說這話的人,大概並不了解兩者的生意模式。
網購平台逐漸寡頭化
網購平台本身不賣產品給你,是平台上的網店賣,然後平台靠網店給予的廣告費過活,故網購增長和平台收入,並不是完全畫上等號。如果網店不下廣告也能做到生意,平台的收入可以是零。
10年前內地網購剛開始,百廢待舉,網店只要下一點廣告,已經效果宏大,加上賣得平,馬上滙集人流,生意爆紅。
品牌見網購增長快,於是紛紛「上網」,反過來又刺激人們在網上購物,待網購人數達5億、6億,全國主要品牌皆上網後,變成大家都離不開網購平台。平台依舊不會向消費者收費,但可以向商家收費。 I 做生意有營銷開支,以往網店賺得多,皆因基本不用下廣告。人們覺得賣得平,客戶多,對手又少,輕輕鬆鬆也能賺到錢。但現在價錢降得差不多了,對手也漸多,不下廣告不行了。以往10元廣告預算,網上只佔1元,現在可能已是2元,漸漸加到3元、4元,甚至把部分舖租節省的也加進去,網購平台不就是網上商場嗎?
愈到後期,廣告的邊際效益遞減,但網購是增長所在,所以網店不能不繼續下廣告;企業的營銷開支(甚至租金)也是那麼多,不過從過往付給好幾家廣告公司和商場,現在全落入一兩家網購平台而已。
社交媒體也一樣。以往的廣告,是品牌請廣告公司設計方案,找藝人拍攝,然後在電視台播出。社交媒體出現後,人人都能化身KOL,如少女模特兒會收到化妝品或名牌包包的廣告費,然後在社交媒體上展示產品,吸引消費者購買。
社交媒體過往是按時序排列,少女模特兒半夜發帖,大眾早上便能看到,營銷效果特別好,又不用付廣告費,所以有一段時間KOL真的很好賺。
免費吸用戶 再賺商戶錢
可是,網民漸眾,網紅和KOL也愈來愈多,社交媒體可以改變排序方式,網紅不付錢,帖子便「沉底」。
想想其實也是正常,網紅憑什麼能免費使用平台而長期沒有營銷費用、賺這麼多?一切不過是回歸均衡。社交媒體不只賺品牌的廣告費,品牌給網紅的廣告費現在也要分一部分給社交媒體,社交媒體心裏又怎會不樂開花?
網購平台和社交媒體的生意模式,便是先用免費吸引用戶,用戶夠多了,商戶不敢不來,然後再從商戶身上慢慢賺錢。人的生活習慣很難改變,慣了上網和在網上下廣告,網購平台和社交媒體便每天坐着等收錢了。
hcl.hkej@gmail.com
(編者按:郝承林著作《致富新世代2──科網君臨天下》現已發售)
歡迎訂購:實體書、電子書
310 jph00 5 hrs 50
Terence Parr and Jeremy Howard
(We teach in University of San Francisco's MS in Data Science program and have other nefarious projects underway. You might know Terence as the creator of the ANTLR parser generator. For more material, see Jeremy's fast.ai courses and University of San Francisco's Data Institute in-person version of the deep learning course.)
Printable version (This HTML was generated from markup using bookish)
Abstract
This paper is an attempt to explain all the matrix calculus you need in order to understand the training of deep neural networks. We assume no math knowledge beyond what you learned in calculus 1, and provide links to help you refresh the necessary math where needed. Note that you do not need to understand this material before you start learning to train and use deep learning in practice; rather, this material is for those who are already familiar with the basics of neural networks, and wish to deepen their understanding of the underlying math. Don't worry if you get stuck at some point along the way---just go back and reread the previous section, and try writing down and working through some examples. And if you're still stuck, we're happy to answer your questions in the Theory category at forums.fast.ai. Note: There is a reference section at the end of the paper summarizing all the key matrix calculus rules and terminology discussed here.
Introduction
Most of us last saw calculus in school, but derivatives are a critical part of machine learning, particularly deep neural networks, which are trained by optimizing a loss function. Pick up a machine learning paper or the documentation of a library such as PyTorch and calculus comes screeching back into your life like distant relatives around the holidays. And it's not just any old scalar calculus that pops up---you need differential matrix calculus, the shotgun wedding of linear algebra and multivariate calculus.
Well... maybe need isn't the right word; Jeremy's courses show how to become a world-class deep learning practitioner with only a minimal level of scalar calculus, thanks to leveraging the automatic differentiation built in to modern deep learning libraries. But if you really want to really understand what's going on under the hood of these libraries, and grok academic papers discussing the latest advances in model training techniques, you'll need to understand certain bits of the field of matrix calculus.
For example, the activation of a single computation unit in a neural network is typically calculated using the dot product (from linear algebra) of an edge weight vector w with an input vector x plus a scalar bias (threshold): . Function is called the unit's affine function and is followed by a rectified linear unit, which clips negative values to zero: . Such a computational unit is sometimes referred to as an “artificial neuron” and looks like:
neuron.png Neural networks consist of many of these units, organized into multiple collections of neurons called layers. The activation of one layer's units become the input to the next layer's units. The activation of the unit or units in the final layer is called the network output.
Training this neuron means choosing weights w and bias b so that we get the desired output for all N inputs x. To do that, we minimize a loss function that compares the network's final with the (desired output of x) for all input x vectors. To minimize the loss, we use some variation on gradient descent, such as plain stochastic gradient descent (SGD), SGD with momentum, or Adam. All of those require the partial derivative (the gradient) of with respect to the model parameters w and b. Our goal is to gradually tweak w and b so that the overall loss function keeps getting smaller across all x inputs.
If we're careful, we can derive the gradient by differentiating the scalar version of a common loss function (mean squared error):
But this is just one neuron, and neural networks must train the weights and biases of all neurons in all layers simultaneously. Because there are multiple inputs and (potentially) multiple network outputs, we really need general rules for the derivative of a function with respect to a vector and even rules for the derivative of a vector-valued function with respect to a vector.
This article walks through the derivation of some important rules for computing partial derivatives with respect to vectors, particularly those useful for training neural networks. This field is known as matrix calculus, and the good news is, we only need a small subset of that field, which we introduce here. While there is a lot of online material on multivariate calculus and linear algebra, they are typically taught as two separate undergraduate courses so most material treats them in isolation. The pages that do discuss matrix calculus often are really just lists of rules with minimal explanation or are just pieces of the story. They also tend to be quite obscure to all but a narrow audience of mathematicians, thanks to their use of dense notation and minimal discussion of foundational concepts. (See the annotated list of resources at the end.)
In contrast, we're going to rederive and rediscover some key matrix calculus rules in an effort to explain them. It turns out that matrix calculus is really not that hard! There aren't dozens of new rules to learn; just a couple of key concepts. Our hope is that this short paper will get you started quickly in the world of matrix calculus as it relates to training neural networks. We're assuming you're already familiar with the basics of neural network architecture and training. If you're not, head over to Jeremy's course and complete part 1 of that, then we'll see you back here when you're done. (Note that, unlike many more academic approaches, we strongly suggest first learning to train and use neural networks in practice and then study the underlying math. The math will be much more understandable with the context in place; besides, it's not necessary to grok all this calculus to become an effective practitioner.)
A note on notation: Jeremy's course exclusively uses code, instead of math notation, to explain concepts since unfamiliar functions in code are easy to search for and experiment with. In this paper, we do the opposite: there is a lot of math notation because one of the goals of this paper is to help you understand the notation that you'll see in deep learning papers and books. At the end of the paper, you'll find a brief table of the notation used, including a word or phrase you can use to search for more details.
Review: Scalar derivative rules
Hopefully you remember some of these main scalar derivative rules. If your memory is a bit fuzzy on this, have a look at Khan academy vid on scalar derivative rules.
Rule Scalar derivative notation with respect to x
Example
Constant
c
Multiplication by constant
cf
Power Rule
Sum Rule
Difference Rule
Product Rule
fg
Chain Rule , let
There are other rules for trigonometry, exponentials, etc., which you can find at Khan Academy differential calculus course.
When a function has a single parameter, , you'll often see and used as shorthands for . We recommend against this notation as it does not make clear the variable we're taking the derivative with respect to.
You can think of as an operator that maps a function of one parameter to another function. That means that maps to its derivative with respect to x, which is the same thing as . Also, if , then . Thinking of the derivative as an operator helps to simplify complicated derivatives because the operator is distributive and lets us pull out constants. For example, in the following equation, we can pull out the constant 9 and distribute the derivative operator across the elements within the parentheses.
That procedure reduced the derivative of to a bit of arithmetic and the derivatives of x and , which are much easier to solve than the original derivative.
Introduction to vector calculus and partial derivatives
Neural network layers are not single functions of a single parameter, . So, let's move on to functions of multiple parameters such as . For example, what is the derivative of xy (i.e., the multiplication of x and y)? In other words, how does the product xy change when we wiggle the variables? Well, it depends on whether we are changing x or y. We compute derivatives with respect to one variable (parameter) at a time, giving us two different partial derivatives for this two-parameter function (one for x and one for y). Instead of using operator , the partial derivative operator is (a stylized d and not the Greek letter ). So, and are the partial derivatives of xy; often, these are just called the partials. For functions of a single parameter, operator is equivalent to (for sufficiently smooth functions). However, it's better to use to make it clear you're referring to a scalar derivative.
The partial derivative with respect to x is just the usual scalar derivative, simply treating any other variable in the equation as a constant. Consider function . The partial derivative with respect to x is written . There are three constants from the perspective of : 3, 2, and y. Therefore, . The partial derivative with respect to y treats x like a constant: . It's a good idea to derive these yourself before continuing otherwise the rest of the article won't make sense. Here's the Khan Academy video on partials if you need help.
To make it clear we are doing vector calculus and not just multivariate calculus, let's consider what we do with the partial derivatives and (another way to say and ) that we computed for . Instead of having them just floating around and not organized in any way, let's organize them into a horizontal vector. We call this vector the gradient of and write it as:
So the gradient of is simply a vector of its partials. Gradients are part of the vector calculus world, which deals with functions that map n scalar parameters to a single scalar. Now, let's get crazy and consider derivatives of multiple functions simultaneously.
Matrix calculus
When we move from derivatives of one function to derivatives of many functions, we move from the world of vector calculus to matrix calculus. Let's compute partial derivatives for two functions, both of which take two parameters. We can keep the same from the last section, but let's also bring in . The gradient for g has two entries, a partial derivative for each parameter:
and
giving us gradient .
Gradient vectors organize all of the partial derivatives for a specific scalar function. If we have two functions, we can also organize their gradients into a matrix by stacking the gradients. When we do so, we get the Jacobian matrix (or just the Jacobian) where the gradients are rows:
Welcome to matrix calculus!
Note that there are multiple ways to represent the Jacobian. We are using the so-called numerator layout but many papers and software will use the denominator layout. This is just transpose of the numerator layout Jacobian (flip it around its diagonal):
Generalization of the Jacobian
So far, we've looked at a specific example of a Jacobian matrix. To define the Jacobian matrix more generally, let's combine multiple parameters into a single vector argument: . (You will sometimes see notation for vectors in the literature as well.) Lowercase letters in bold font such as x are vectors and those in italics font like x are scalars. xi is the element of vector x and is in italics because a single vector element is a scalar. We also have to define an orientation for vector x. We'll assume that all vectors are vertical by default of size :
With multiple scalar-valued functions, we can combine them all into a vector just like we did with the parameters. Let be a vector of m scalar-valued functions that each take a vector x of length where is the cardinality (count) of elements in x. Each fi function within f returns a scalar just as in the previous section:
For instance, we'd represent and from the last section as
It's very often the case that because we will have a scalar function result for each element of the x vector. For example, consider the identity function :
So we have functions and parameters, in this case. Generally speaking, though, the Jacobian matrix is the collection of all possible partial derivatives (m rows and n columns), which is the stack of m gradients with respect to x:
Each is a horizontal n-vector because the partial derivative is with respect to a vector, x, whose length is . The width of the Jacobian is n if we're taking the partial derivative with respect to x because there are n parameters we can wiggle, each potentially changing the function's value. Therefore, the Jacobian is always m rows for m equations. It helps to think about the possible Jacobian shapes visually:
\begin{tabular}{c|ccl} & \begin{tabular}[t]{c} scalar\ \framebox(18,18){$x$}\ \end{tabular} & \begin{tabular}{c} vector\ \framebox(18,40){$\mathbf{x}$} \end{tabular}\ \hline \[\dimexpr-\normalbaselineskip+5pt] \begin{tabular}[b]{c} scalar\ \framebox(18,18){$f$}\ \end{tabular} &\framebox(18,18){$\frac{\partial f}{\partial {x}}$} & \framebox(40,18){$\frac{\partial f}{\partial {\mathbf{x}}}$}&\ \begin{tabular}[b]{c} vector\ \framebox(18,40){$\mathbf{f}$}\ \end{tabular} & \framebox(18,40){$\frac{\partial \mathbf{f}}{\partial {x}}$} & \framebox(40,40){$\frac{\partial \mathbf{f}}{\partial \mathbf{x}}$}\ \end{tabular} The Jacobian of the identity function , with , has n functions and each function has n parameters held in a single vector x. The Jacobian is, therefore, a square matrix since :
\begin{eqnarray} \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial}{\partial {x}} f_1(\mathbf{x}) \ \frac{\partial}{\partial {x}} f_2(\mathbf{x})\ \ldots\ \frac{\partial}{\partial {x}} f_m(\mathbf{x}) \end{bmatrix} &=& \begin{bmatrix} \frac{\partial}{\partial {x_1}} f_1(\mathbf{x})~ \frac{\partial}{\partial {x_2}} f_1(\mathbf{x}) ~\ldots~ \frac{\partial}{\partial {x_n}} f_1(\mathbf{x}) \ \frac{\partial}{\partial {x_1}} f_2(\mathbf{x})~ \frac{\partial}{\partial {x_2}} f_2(\mathbf{x}) ~\ldots~ \frac{\partial}{\partial {x_n}} f_2(\mathbf{x}) \ \ldots\ ~\frac{\partial}{\partial {x_1}} f_m(\mathbf{x})~ \frac{\partial}{\partial {x_2}} f_m(\mathbf{x}) ~\ldots~ \frac{\partial}{\partial {x_n}} f_m(\mathbf{x}) \ \end{bmatrix}\\ & = & \begin{bmatrix} \frac{\partial}{\partial {x_1}} x_1~ \frac{\partial}{\partial {x_2}} x_1 ~\ldots~ \frac{\partial}{\partial {x_n}} x_1 \ \frac{\partial}{\partial {x_1}} x_2~ \frac{\partial}{\partial {x_2}} x_2 ~\ldots~ \frac{\partial}{\partial {x_n}} x_2 \ \ldots\ ~\frac{\partial}{\partial {x_1}} x_n~ \frac{\partial}{\partial {x_2}} x_n ~\ldots~ \frac{\partial}{\partial {x_n}} x_n \ \end{bmatrix}\\ & & (\text{and since } \frac{\partial}{\partial {x_j}} x_i = 0 \text{ for } j \neq i)\ & = & \begin{bmatrix} \frac{\partial}{\partial {x_1}} x_1 & 0 & \ldots& 0 \ 0 & \frac{\partial}{\partial {x_2}} x_2 &\ldots & 0 \ & & \ddots\ 0 & 0 &\ldots& \frac{\partial}{\partial {x_n}} x_n \ \end{bmatrix}\\ & = & \begin{bmatrix} 1 & 0 & \ldots& 0 \ 0 &1 &\ldots & 0 \ & & \ddots\ 0 & 0 & \ldots &1 \ \end{bmatrix}\\ & = & I ~~~(I \text{ is the identity matrix with ones down the diagonal})\ \end{eqnarray} Make sure that you can derive each step above before moving on. If you get stuck, just consider each element of the matrix in isolation and apply the usual scalar derivative rules. That is a generally useful trick: Reduce vector expressions down to a set of scalar expressions and then take all of the partials, combining the results appropriately into vectors and matrices at the end.
Also be careful to track whether a matrix is vertical, x, or horizontal, where means x transpose. Also make sure you pay attention to whether something is a scalar-valued function, , or a vector of functions (or a vector-valued function), .
Derivatives of vector element-wise binary operators
Element-wise binary operations on vectors, such as vector addition , are important because we can express many common vector operations, such as the multiplication of a vector by a scalar, as element-wise binary operations. By “element-wise binary operations” we simply mean applying an operator to the first item of each vector to get the first item of the output, then to the second items of the inputs for the second item of the output, and so forth. This is how all the basic math operators are applied by default in numpy or tensorflow, for example. Examples that often crop up in deep learning are and (returns a vector of ones and zeros).
We can generalize the element-wise binary operations with notation where . (Reminder: is the number of items in x.) The symbol represents any element-wise operator (such as ) and not the function composition operator. Here's what equation looks like when we zoom in to examine the scalar equations:
where we write n (not m) equations vertically to emphasize the fact that the result of element-wise operators give sized vector results.
Using the ideas from the last section, we can see that the general case for the Jacobian with respect to w is the square matrix:
and the Jacobian with respect to x is:
That's quite a furball, but fortunately the Jacobian is very often a diagonal matrix, a matrix that is zero everywhere but the diagonal. Because this greatly simplifies the Jacobian, let's examine in detail when the Jacobian reduces to a diagonal matrix for element-wise operations.
In a diagonal Jacobian, all elements off the diagonal are zero, where . (Notice that we are taking the partial derivative with respect to wj not wi.) Under what conditions are those off-diagonal elements zero? Precisely when fi and gi are contants with respect to wj, . Regardless of the operator, if those partial derivatives go to zero, the operation goes to zero, no matter what, and the partial derivative of a constant is zero.
Those partials go to zero when fi and gi are not functions of wj. We know that element-wise operations imply that fi is purely a function of wi and gi is purely a function of xi. For example, sums . Consequently, reduces to and the goal becomes . and look like constants to the partial differentiation operator with respect to wj when so the partials are zero off the diagonal. (Notation is technically an abuse of our notation because fi and gi are functions of vectors not individual elements. We should really write something like , but that would muddy the equations further, and programmers are comfortable overloading functions, so we'll proceed with the notation anyway.)
We'll take advantage of this simplification later and refer to the constraint that and access at most wi and xi, respectively, as the element-wise diagonal condition.
Under this condition, the elements along the diagonal of the Jacobian are :
(The large “0”s are a shorthand indicating all of the off-diagonal are 0.)
More succinctly, we can write:
and
where constructs a matrix whose diagonal elements are taken from vector x: . I represents the square identity matrix of appropriate dimensions that is zero everywhere but the diagonal, which contains all ones. The T exponent of represents the transpose of the indicated vector. In this case, it flips a vertical vector to a horizontal vector.
Because we do lots of simple vector arithmetic, the general function in the binary element-wise operation is often just the vector w. Any time the general function is a vector, we know that reduces to . For example, vector addition fits our element-wise diagonal condition because has scalar equations that reduce to just with partial derivatives:
That gives us , the identity matrix, because every element along the diagonal is 1.
Given the simplicity of this special case, reducing to , you should be able to derive the Jacobians for the common element-wise binary operations on vectors:
The and operators are element-wise multiplication and division; is sometimes called the Hadamard product. There isn't a standard notation for element-wise multiplication and division so we're using an approach consistent with our general binary operation notation.
Derivatives involving scalar expansion
When we multiply or add scalars to vectors, we're implicitly expanding the scalar to a vector and then performing an element-wise binary operation. For example, adding scalar z to vector x, , is really where and . (The notation represents a vector of ones of appropriate length.) z is any scalar that doesn't depend on x, which is useful because then for any xi and that will simplify our partial derivative computations. (It's okay to think of variable z as a constant for our discussion here.) Similarly, multiplying by a scalar, , is really where is the element-wise multiplication (Hadamard product) of the two vectors.
The partial derivatives of vector-scalar addition and multiplication with respect to vector x use our element-wise rule:
This follows because functions and clearly satisfy our element-wise diagonal condition for the Jacobian (that refer at most to xi and refers to the value of the vector).
Using the usual rules for scalar partial derivatives, we arrive at the following diagonal elements of the Jacobian for vector-scalar addition:
So, .
Computing the partial derivative with respect to the scalar parameter z, however, results in a vertical vector, not a diagonal matrix. The elements of the vector are:
Therefore, .
The diagonal elements of the Jacobian for vector-scalar multiplication involve the product rule for scalar derivatives:
So, .
The partial derivative with respect to scalar parameter z is a vertical vector whose elements are:
This gives us .
Vector sum reduction
Summing up the elements of a vector is an important operation in deep learning, such as the network loss function, but we can also use it as a way to simplify computing the derivative of vector dot product and other operations that reduce vectors to scalars.
Let . Notice we were careful here to leave the parameter as a vector x because each function fi could use all values in the vector, not just xi. The sum is over the results of the function and not the parameter. The gradient (Jacobian) of vector summation is:
(The summation inside the gradient elements can be tricky so make sure to keep your notation consistent.)
Let's look at the gradient of the simple . The function inside the summation is just and the gradient is then:
Because for , we can simplify to:
Notice that the result is a horizontal vector full of 1s, not a vertical vector, and so the gradient is . It's very important to keep the shape of all of your vectors and matrices in order otherwise it's impossible to compute the derivatives of complex functions.
As another example, let's sum the result of multiplying a vector by a constant scalar. If then . The gradient is:
The derivative with respect to scalar variable z is :
The Chain Rules
We can't compute partial derivatives of very complicated functions using just the basic matrix calculus rules we've seen so far. For example, we can't take the derivative of nested expressions like directly without reducing it to its scalar equivalent. We need to be able to combine our basic vector rules using what we can call the vector chain rule. Unfortunately, there are a number of rules for differentiation that fall under the name “chain rule” so we have to be careful which chain rule we're talking about. Part of our goal here is to clearly define and name three different chain rules and indicate in which situation they are appropriate. To get warmed up, we'll start with what we'll call the single-variable chain rule, where we want the derivative of a scalar function with respect to a scalar. Then we'll move on to an important concept called the total derivative and use it to define what we'll pedantically call the single-variable total-derivative chain rule. Then, we'll be ready for the vector chain rule in its full glory as needed for neural networks.
The chain rule is conceptually a divide and conquer strategy (like Quicksort) that breaks complicated expressions into subexpressions whose derivatives are easier to compute. Its power derives from the fact that we can process each simple subexpression in isolation yet still combine the intermediate results to get the correct overall result.
The chain rule comes into play when we need the derivative of an expression composed of nested subexpressions. For example, we need the chain rule when confronted with expressions like . The outermost expression takes the sin of an intermediate result, a nested subexpression that squares x. Specifically, we need the single-variable chain rule, so let's start by digging into that in more detail.
Single-variable chain rule
Let's start with the solution to the derivative of our nested expression: . It doesn't take a mathematical genius to recognize components of the solution that smack of scalar differentiation rules, and . It looks like the solution is to multiply the derivative of the outer expression by the derivative of the inner expression or “chain the pieces together,” which is exactly right. In this section, we'll explore the general principle at work and provide a process that works for highly-nested expressions of a single variable.
Chain rules are typically defined in terms of nested functions, such as for single-variable chain rules. (You will also see the chain rule defined using function composition , which is the same thing.) Some sources write the derivative using shorthand notation , but that hides the fact that we are introducing an intermediate variable: , which we'll see shortly. It's better to define the single-variable chain rule of explicitly so we never take the derivative with respect to the wrong variable. Here is the formulation of the single-variable chain rule we recommend:
To deploy the single-variable chain rule, follow these steps:
Introduce intermediate variables for nested subexpressions and subexpressions for both binary and unary operators; e.g., is binary, and other trigonometric functions are usually unary because there is a single operand. This step normalizes all equations to single operators or function applications. Compute derivatives of the intermediate variables with respect to their parameters. Combine all derivatives of intermediate variables by multiplying them together to get the overall result. Substitute intermediate variables back in if any are referenced in the derivative equation. The third step puts the “chain” in “chain rule” because it chains together intermediate results. Multiplying the intermediate derivatives together is the common theme among all variations of the chain rule.
Let's try this process on :
Introduce intermediate variables. Let represent subexpression (shorthand for ). This gives us:
The order of these subexpressions does not affect the answer, but we recommend working in the reverse order of operations dictated by the nesting (innermost to outermost). That way, expressions and derivatives are always functions of previously-computed elements.
Compute derivatives.
Combine.
Substitute.
Notice how easy it is to compute the derivatives of the intermediate variables in isolation! The chain rule says it's legal to do that and tells us how to combine the intermediate results to get .
You can think of the combining step of the chain rule in terms of units canceling. If we let y be gallons of gas, x be the gallons in a gas tank, and u as miles we can interpret as . The gallon denominator and numerator cancel.
Another way to to think about the single-variable chain rule is to visualize the overall expression as a dataflow diagram or chain of operations (or abstract syntax tree for compiler people):
sin-square.png Changes to function parameter x bubble up through a squaring operation then through a sin operation to change result y. You can think of as “getting changes from x to u” and as “getting changes from u to y.” Getting from x to y requires an intermediate hop. The chain rule is, by convention, usually written from the output variable down to the parameter(s), . But, the x-to-y perspective would be more clear if we reversed the flow and used the equivalent .
Conditions under which the single-variable chain rule applies. Notice that there is a single dataflow path from x to the root y. Changes in x can influence output y in only one way. That is the condition under which we can apply the single-variable chain rule. An easier condition to remember, though one that's a bit looser, is that none of the intermediate subexpression functions, and , have more than one parameter. Consider , which would become after introducing intermediate variable u. As we'll see in the next section, has multiple paths from x to y. To handle that situation, we'll deploy the single-variable total-derivative chain rule.
As an aside for those interested in automatic differentiation, papers and library documentation use terminology forward differentiation and backward differentiation (for use in the back-propagation algorithm). From a dataflow perspective, we are computing a forward differentiation because it follows the normal data flow direction. Backward differentiation, naturally, goes the other direction and we're asking how a change in the output would affect function parameter x. Because backward differentiation can determine changes in all function parameters at once, it turns out to be much more efficient for computing the derivative of functions with lots of parameters. Forward differentiation, on the other hand, must consider how a change in each parameter, in turn, affects the function output y. The following table emphasizes the order in which partial derivatives are computed for the two techniques.
Forward differentiation from x
to y
Backward differentiation from y
to x
Automatic differentiation is beyond the scope of this article, but we're setting the stage for a future article.
Many readers can solve in their heads, but our goal is a process that will work even for very complicated expressions. This process is also how automatic differentiation works in libraries like PyTorch. So, by solving derivatives manually in this way, you're also learning how to define functions for custom neural networks in PyTorch.
With deeply nested expressions, it helps to think about deploying the chain rule the way a compiler unravels nested function calls like into a sequence (chain) of calls. The result of calling function fi is saved to a temporary variable called a register, which is then passed as a parameter to . Let's see how that looks in practice by using our process on a highly-nested equation like :
Introduce intermediate variables.
Compute derivatives.
Combine four intermediate values.
Substitute.
Here is a visualization of the data flow through the chain of operations from x to y:
chain-tree.png At this point, we can handle derivatives of nested expressions of a single variable, x, using the chain rule but only if x can affect y through a single data flow path. To handle more complicated expressions, we need to extend our technique, which we'll do next.
Single-variable total-derivative chain rule
Our single-variable chain rule has limited applicability because all intermediate variables must be functions of single variables. But, it demonstrates the core mechanism of the chain rule, that of multiplying out all derivatives of intermediate subexpressions. To handle more general expressions such as , however, we need to augment that basic chain rule.
Of course, we immediately see , but that is using the scalar addition derivative rule, not the chain rule. If we tried to apply the single-variable chain rule, we'd get the wrong answer. In fact, the previous chain rule is meaningless in this case because derivative operator does not apply to multivariate functions, such as among our intermediate variables:
Let's try it anyway to see what happens. If we pretend that and , then instead of the right answer .
Because has multiple parameters, partial derivatives come into play. Let's blindly apply the partial derivative operator to all of our equations and see what we get:
Ooops! The partial is wrong because it violates a key assumption for partial derivatives. When taking the partial derivative with respect to x, the other variables must not vary as x varies. Otherwise, we could not act as if the other variables were constants. Clearly, though, is a function of x and therefore varies with x. because . A quick look at the data flow diagram for shows multiple paths from x to y, thus, making it clear we need to consider direct and indirect (through ) dependencies on x:
plus-square.png A change in x affects y both as an operand of the addition and as the operand of the square operator. Here's an equation that describes how tweaks to x affect the output:
Then, , which we can read as “the change in y is the difference between the original y and y at a tweaked x.”
If we let , then . If we bump x by 1, , then . The change in y is not , as would lead us to believe, but !
Enter the “law” of total derivatives, which basically says that to compute , we need to sum up all possible contributions from changes in x to the change in y. The total derivative with respect to x assumes all variables, such as in this case, are functions of x and potentially vary as x varies. The total derivative of that depends on x directly and indirectly via intermediate variable is given by:
Using this formula, we get the proper answer:
That is an application of what we can call the single-variable total-derivative chain rule:
The total derivative assumes all variables are potentially codependent whereas the partial derivative assumes all variables but x are constants.
There is something subtle going on here with the notation. All of the derivatives are shown as partial derivatives because f and ui are functions of multiple variables. This notation mirrors that of MathWorld's notation but differs from Wikipedia, which uses instead (possibly to emphasize the total derivative nature of the equation). We'll stick with the partial derivative notation so that it's consistent with our discussion of the vector chain rule in the next section.
In practice, just keep in mind that when you take the total derivative with respect to x, other variables might also be functions of x so add in their contributions as well. The left side of the equation looks like a typical partial derivative but the right-hand side is actually the total derivative. It's common, however, that many temporary variables are functions of a single parameter, which means that the single-variable total-derivative chain rule degenerates to the single-variable chain rule.
Let's look at a nested subexpression, such as . We introduce three intermediate variables:
and partials:
where both and have terms that take into account the total derivative.
Also notice that the total derivative formula always sums versus, say, multiplies terms . It's tempting to think that summing up terms in the derivative makes sense because, for example, adds two terms. Nope. The total derivative is adding terms because it represents a weighted sum of all x contributions to the change in y. For example, given instead of , the total-derivative chain rule formula still adds partial derivative terms. (simplifies to but for this demonstration, let's not combine the terms.) Here are the intermediate variables and partial derivatives:
The form of the total derivative remains the same, however:
It's the partials (weights) that change, not the formula, when the intermediate variable operators change.
Those readers with a strong calculus background might wonder why we aggressively introduce intermediate variables even for the non-nested subexpressions such as in . We use this process for three reasons: (i) computing the derivatives for the simplified subexpressions is usually trivial, (ii) we can simplify the chain rule, and (iii) the process mirrors how automatic differentiation works in neural network libraries.
Using the intermediate variables even more aggressively, let's see how we can simplify our single-variable total-derivative chain rule to its final form. The goal is to get rid of the sticking out on the front like a sore thumb:
We can achieve that by simply introducing a new temporary variable as an alias for x: . Then, the formula reduces to our final form:
This chain rule that takes into consideration the total derivative degenerates to the single-variable chain rule when all intermediate variables are functions of a single variable. Consequently, you can remember this more general formula to cover both cases. As a bit of dramatic foreshadowing, notice that the summation sure looks like a vector dot product, , or a vector multiply .
Before we move on, a word of caution about terminology on the web. Unfortunately, the chain rule given in this section, based upon the total derivative, is universally called “multivariable chain rule” in calculus discussions, which is highly misleading! Only the intermediate variables are multivariate functions. The overall function, say, , is a scalar function that accepts a single parameter x. The derivative and parameter are scalars, not vectors, as one would expect with a so-called multivariate chain rule. (Within the context of a non-matrix calculus class, “multivariate chain rule” is likely unambiguous.) To reduce confusion, we use “single-variable total-derivative chain rule” to spell out the distinguishing feature between the simple single-variable chain rule, , and this one.
Vector chain rule
Now that we've got a good handle on the total-derivative chain rule, we're ready to tackle the chain rule for vectors of functions and vector variables. Surprisingly, this more general chain rule is just as simple looking as the single-variable chain rule for scalars. Rather than just presenting the vector chain rule, let's rediscover it ourselves so we get a firm grip on it. We can start by computing the derivative of a sample vector function with respect to a scalar, , to see if we can abstract a general formula.
Let's introduce two intermediate variables, and , one for each fi so that y looks more like :
The derivative of vector y with respect to scalar x is a vertical vector with elements computed using the single-variable total-derivative chain rule:
Ok, so now we have the answer using just the scalar rules, albeit with the derivatives grouped into a vector. Let's try to abstract from that result what it looks like in vector form. The goal is to convert the following vector of scalar operations to a vector operation.
If we split the terms, isolating the terms into a vector, we get a matrix by vector multiplication:
That means that the Jacobian is the multiplication of two other Jacobians, which is kinda cool. Let's check our results:
Whew! We get the same answer as the scalar approach. This vector chain rule for vectors of functions and a single parameter appears to be correct and, indeed, mirrors the single-variable chain rule. Compare the vector rule:
with the single-variable chain rule:
To make this formula work for multiple parameters or vector x, we just have to change x to vector x in the equation. The effect is that and the resulting Jacobian, , are now matrices instead of vertical vectors. Our complete vector chain rule is:
The beauty of the vector formula over the single-variable chain rule is that it automatically takes into consideration the total derivative while maintaining the same notational simplicity. The Jacobian contains all possible combinations of fi with respect to gj and gi with respect to xj. For completeness, here are the two Jacobian components in their full glory:
where , , and . The resulting Jacobian is (an matrix multiplied by a matrix).
Even within this formula, we can simplify further because, for many applications, the Jacobians are square () and the off-diagonal entries are zero. It is the nature of neural networks that the associated mathematics deals with functions of vectors not vectors of functions. For example, the neuron affine function has term and the activation function is ; we'll consider derivatives of these functions in the next section.
As we saw in a previous section, element-wise operations on vectors w and x yield diagonal matrices with elements because wi is a function purely of xi but not xj for . The same thing happens here when fi is purely a function of gi and gi is purely a function of xi:
In this situation, the vector chain rule simplifies to:
Therefore, the Jacobian reduces to a diagonal matrix whose elements are the single-variable chain rule values.
After slogging through all of that mathematics, here's the payoff. All you need is the vector chain rule because the single-variable formulas are special cases of the vector chain rule. The following table summarizes the appropriate components to multiply in order to get the Jacobian.
\begin{tabular}[t]{c|cccc} & \multicolumn{2}{c}{ \begin{tabular}[t]{c} scalar\ \framebox(18,18){$x$}\ \end{tabular}} & &\begin{tabular}{c} vector\ \framebox(18,40){$\mathbf{x}$}\ \end{tabular} \ \begin{tabular}{c}$\frac{\partial}{\partial \mathbf{x}} \mathbf{f}(\mathbf{g}(\mathbf{x}))$ = $\frac{\partial \mathbf{f}}{\partial \mathbf{g}}\frac{\partial\mathbf{g}}{\partial \mathbf{x}}$ \ \end{tabular} & \begin{tabular}[t]{c} scalar\ \framebox(18,18){$u$}\ \end{tabular} & \begin{tabular}{c} vector\ \framebox(18,40){$\mathbf{u}$} \end{tabular}& & \begin{tabular}{c} vector\ \framebox(18,40){$\mathbf{u}$}\ \end{tabular} \ \hline \[\dimexpr-\normalbaselineskip+5pt] \begin{tabular}[b]{c} scalar\ \framebox(18,18){$f$}\ \end{tabular} &\framebox(18,18){$\frac{\partial f}{\partial {u}}$} \framebox(18,18){$\frac{\partial u}{\partial {x}}$} ~~~& \raisebox{22pt}{\framebox(40,18){$\frac{\partial f}{\partial {\mathbf{u}}}$}} \framebox(18,40){$\frac{\partial \mathbf{u}}{\partial x}$} & ~~~& \raisebox{22pt}{\framebox(40,18){$\frac{\partial f}{\partial {\mathbf{u}}}$}} \framebox(40,40){$\frac{\partial \mathbf{u}}{\partial \mathbf{x}}$} \ \begin{tabular}[b]{c} vector\ \framebox(18,40){$\mathbf{f}$}\ \end{tabular} & \framebox(18,40){$\frac{\partial \mathbf{f}}{\partial {u}}$} \raisebox{22pt}{\framebox(18,18){$\frac{\partial u}{\partial {x}}$}} & \framebox(40,40){$\frac{\partial \mathbf{f}}{\partial \mathbf{u}}$} \framebox(18,40){$\frac{\partial \mathbf{u}}{\partial x}$} & & \framebox(40,40){$\frac{\partial \mathbf{f}}{\partial \mathbf{u}}$} \framebox(40,40){$\frac{\partial \mathbf{u}}{\partial \mathbf{x}}$}\ \end{tabular} The gradient of neuron activation
We now have all of the pieces needed to compute the derivative of a typical neuron activation for a single neural network computation unit with respect to the model parameters, w and b:
(This represents a neuron with fully connected weights and rectified linear unit activation. There are, however, other affine functions such as convolution and other activation functions, such as exponential linear units, that follow similar logic.)
Let's worry about max later and focus on computing and . (Recall that neural networks learn through optimization of their weights and biases.) We haven't discussed the derivative of the dot product yet, , but we can use the chain rule to avoid having to memorize yet another rule. (Note notation y not y as the result is a scalar not a vector.)
The dot product is just the summation of the element-wise multiplication of the elements: . (You might also find it useful to remember the linear algebra notation .) We know how to compute the partial derivatives of and but haven't looked at partial derivatives for . We need the chain rule for that and so we can introduce an intermediate vector variable u just as we did using the single-variable chain rule:
Once we've rephrased y, we recognize two subexpressions for which we already know the partial derivatives:
The vector chain rule says to multiply the partials:
To check our results, we can grind the dot product down into a pure scalar function:
Then:
Hooray! Our scalar results match the vector chain rule results.
Now, let , the full expression within the max activation function call. We have two different partials to compute, but we don't need the chain rule:
Let's tackle the partials of the neuron activation, . The use of the function call on scalar z just says to treat all negative z values as 0. The derivative of the max function is a piecewise function. When , the derivative is 0 because z is a constant. When , the derivative of the max function is just the derivative of z, which is :
An aside on broadcasting functions across scalars. When one or both of the max arguments are vectors, such as , we broadcast the single-variable function max across the elements. This is an example of an element-wise unary operator. Just to be clear:
For the derivative of the broadcast version then, we get a vector of zeros and ones where:
To get the derivative of the function, we need the chain rule because of the nested subexpression, . Following our process, let's introduce intermediate scalar variable z to represent the affine function giving:
The vector chain rule tells us:
which we can rewrite as follows:
and then substitute back in:
That equation matches our intuition. When the activation function clips affine function output z to 0, the derivative is zero with respect to any weight wi. When , it's as if the max function disappears and we get just the derivative of z with respect to the weights.
Turning now to the derivative of the neuron activation with respect to b, we get:
Let's use these partial derivatives now to handle the entire loss function.
The gradient of the neural network loss function
Training a neuron requires that we take the derivative of our loss or “cost” function with respect to the parameters of our model, w and b. Because we train with multiple vector inputs (e.g., multiple images) and scalar targets (e.g., one classification per image), we need some more notation. Let
where , and then let
where yi is a scalar. Then the cost equation becomes:
Following our chain rule process introduces these intermediate variables:
Let's compute the gradient with respect to w first.
The gradient with respect to the weights
From before, we know:
and
Then, for the overall gradient, we get:
\begin{eqnarray} \frac{\partial C(v)}{\partial \mathbf{w}} & = & \frac{\partial }{\partial \mathbf{w}}\frac{1}{N} \sum_{i=1}^N v^2\\ & = & \frac{1}{N} \sum_{i=1}^N \frac{\partial}{\partial \mathbf{w}} v^2\\ & = & \frac{1}{N} \sum_{i=1}^N \frac{\partial v^2}{\partial v} \frac{\partial v}{\partial \mathbf{w}} \\ & = & \frac{1}{N} \sum_{i=1}^N 2v \frac{\partial v}{\partial \mathbf{w}} \\ & = & \frac{1}{N} \sum_{i=1}^N \begin{cases} 2v\vec{0}^T = \vec{0}^T & \mathbf{w} \cdot \mathbf{x}i + b \leq 0\ -2v\mathbf{x}^T & \mathbf{w} \cdot \mathbf{x}_i + b > 0\ \end{cases}\\ & = & \frac{1}{N} \sum{i=1}^N \begin{cases} \vec{0}^T & \mathbf{w} \cdot \mathbf{x}i + b \leq 0\ -2(y_i-u)\mathbf{x}_i^T & \mathbf{w} \cdot \mathbf{x}_i + b > 0\ \end{cases}\\ & = & \frac{1}{N} \sum{i=1}^N \begin{cases} \vec{0}^T & \mathbf{w} \cdot \mathbf{x}i + b \leq 0\ -2(y_i-max(0, \mathbf{w}\cdot\mathbf{x}_i+b))\mathbf{x}_i^T & \mathbf{w} \cdot \mathbf{x}_i + b > 0\ \end{cases}\ \phantom{\frac{\partial C(v)}{\partial \mathbf{w}}} & = & \frac{1}{N} \sum{i=1}^N \begin{cases} \vec{0}^T & \mathbf{w} \cdot \mathbf{x}i + b \leq 0\ -2(y_i-(\mathbf{w}\cdot\mathbf{x}_i+b))\mathbf{x}_i^T & \mathbf{w} \cdot \mathbf{x}_i + b > 0\ \end{cases}\\ & = & \begin{cases} \vec{0}^T & \mathbf{w} \cdot \mathbf{x}_i + b \leq 0\ \frac{-2}{N} \sum{i=1}^N (y_i-(\mathbf{w}\cdot\mathbf{x}i+b))\mathbf{x}_i^T & \mathbf{w} \cdot \mathbf{x}_i + b > 0\ \end{cases}\\ & = & \begin{cases} \vec{0}^T & \mathbf{w} \cdot \mathbf{x}_i + b \leq 0\ \frac{2}{N} \sum{i=1}^N (\mathbf{w}\cdot\mathbf{x}_i+b-y_i)\mathbf{x}_i^T & \mathbf{w} \cdot \mathbf{x}_i + b > 0\ \end{cases} \end{eqnarray} To interpret that equation, we can substitute an error term yielding:
From there, notice that this computation is a weighted average across all xi in X. The weights are the error terms, the difference between the target output and the actual neuron output for each xi input. The resulting gradient will, on average, point in the direction of higher cost or loss because large ei emphasize their associated xi. Imagine we only had one input vector, , then the gradient is just . If the error is 0, then the gradient is zero and we have arrived at the minimum loss. If is some small positive difference, the gradient is a small step in the direction of . If is large, the gradient is a large step in that direction. If is negative, the gradient is reversed, meaning the highest cost is in the negative direction.
Of course, we want to reduce, not increase, the loss, which is why the gradient descent recurrence relation takes the negative of the gradient to update the current position (for scalar learning rate ):
Because the gradient indicates the direction of higher cost, we want to update x in the opposite direction.
The derivative with respect to the bias
To optimize the bias, b, we also need the partial with respect to b. Here are the intermediate variables again:
We computed the partial with respect to the bias for equation previously:
For v, the partial is:
And for the partial of the cost function itself we get:
\begin{eqnarray} \frac{\partial C(v)}{\partial b} & = & \frac{\partial }{\partial b}\frac{1}{N} \sum_{i=1}^N v^2\\ & = & \frac{1}{N} \sum_{i=1}^N \frac{\partial}{\partial b} v^2\\ & = & \frac{1}{N} \sum_{i=1}^N \frac{\partial v^2}{\partial v} \frac{\partial v}{\partial b} \\ & = & \frac{1}{N} \sum_{i=1}^N 2v \frac{\partial v}{\partial b} \\ & = & \frac{1}{N} \sum_{i=1}^N \begin{cases} 0 & \mathbf{w} \cdot \mathbf{x} + b \leq 0\ -2v & \mathbf{w} \cdot \mathbf{x} + b > 0\ \end{cases}\\ & = & \frac{1}{N} \sum_{i=1}^N \begin{cases} 0 & \mathbf{w} \cdot \mathbf{x} + b \leq 0\ -2(y_i-max(0, \mathbf{w}\cdot\mathbf{x}i+b)) & \mathbf{w} \cdot \mathbf{x} + b > 0\ \end{cases}\\ & = & \frac{1}{N} \sum{i=1}^N \begin{cases} 0 & \mathbf{w} \cdot \mathbf{x} + b \leq 0\ 2(\mathbf{w}\cdot\mathbf{x}i+b-y_i) & \mathbf{w} \cdot \mathbf{x} + b > 0\ \end{cases}\\ & = & \begin{cases} 0 & \mathbf{w} \cdot \mathbf{x}_i + b \leq 0\ \frac{2}{N} \sum{i=1}^N (\mathbf{w}\cdot\mathbf{x}_i+b-y_i) & \mathbf{w} \cdot \mathbf{x}_i + b > 0\ \end{cases} \end{eqnarray} As before, we can substitute an error term:
The partial derivative is then just the average error or zero, according to the activation level. To update the neuron bias, we nudge it in the opposite direction of increased cost:
In practice, it is convenient to combine w and b into a single vector parameter rather than having to deal with two different partials: . This requires a tweak to the input vector x as well but simplifies the activation function. By tacking a 1 onto the end of x, , becomes .
This finishes off the optimization of the neural network loss function because we have the two partials necessary to perform a gradient descent.
Summary
Hopefully you've made it all the way through to this point. You're well on your way to understanding matrix calculus! We've included a reference that summarizes all of the rules from this article in the next section. Also check out the annotated resource link below.
Your next step would be to learn about the partial derivatives of matrices not just vectors. For example, you can take a look at the matrix differentiation section of Matrix calculus.
Acknowledgements. We thank Yannet Interian (Faculty in MS data science program at University of San Francisco) and David Uminsky (Faculty/director of MS data science) for their help with the notation presented here.
Matrix Calculus Reference
Gradients and Jacobians
The gradient of a function of two variables is a horizontal 2-vector:
The Jacobian of a vector-valued function that is a function of a vector is an (and ) matrix containing all possible scalar partial derivatives:
The Jacobian of the identity function is I.
Element-wise operations on vectors
Define generic element-wise operations on vectors w and x using operator such as :
The Jacobian with respect to w (similar for x) is:
Given the constraint (element-wise diagonal condition) that and access at most wi and xi, respectively, the Jacobian simplifies to a diagonal matrix:
Here are some sample element-wise operators:
Scalar expansion
Adding scalar z to vector x, , is really where and .
Scalar multiplication yields:
Vector reductions
The partial derivative of a vector sum with respect to one of the vectors is:
For :
For and , we get:
Vector dot product . Substituting and using the vector chain rule, we get:
Similarly, .
Chain rules
The vector chain rule is the general form as it degenerates to the others. When f is a function of a single variable x and all intermediate variables u are functions of a single variable, the single-variable chain rule applies. When some or all of the intermediate variables are functions of multiple variables, the single-variable total-derivative chain rule applies. In all other cases, the vector chain rule applies.
Single-variable rule Single-variable total-derivative rule Vector rule
Notation
Lowercase letters in bold font such as x are vectors and those in italics font like x are scalars. xi is the element of vector x and is in italics because a single vector element is a scalar. means “length of vector x.”
The T exponent of represents the transpose of the indicated vector.
is just a for-loop that iterates i from a to b, summing all the xi.
Notation refers to a function called f with an argument of x.
I represents the square “identity matrix” of appropriate dimensions that is zero everywhere but the diagonal, which contains all ones.
constructs a matrix whose diagonal elements are taken from vector x: .
The dot product is the summation of the element-wise multiplication of the elements: . Or, you can look at it as .
Differentiation is an operator that maps a function of one parameter to another function. That means that maps to its derivative with respect to x, which is the same thing as . Also, if , then .
The partial derivative of the function with respect to x, , performs the usual scalar derivative holding all other variables constant.
The gradient of f with respect to vector x, , organizes all of the partial derivatives for a specific scalar function.
The Jacobian organizes the gradients of multiple functions into a matrix by stacking them:
The following notation means that y has the value a upon and value b upon .
Resources
When looking for resources on the web, search for “matrix calculus” not “vector calculus.” Here are some comments on the top links that come up from a Google search:
To learn more about neural networks and the mathematics behind optimization and back propagation, we highly recommend Michael Nielsen's book.
We reference the law of total derivative, which is an important concept that just means derivatives with respect to x must take into consideration the derivative with respect x of all variables that are a function of x.
紐約3月期油跌1.06美元或1.5%,收市報每桶64.50美元。這是紐約油價1月以來首度連跌兩天。
布蘭特3月期油倒退44仙或0.6%,收市報每桶69.02美元。
Resource Economist總監Ehsan Ul-Haq指出,市場關注美國產量升勢,供應過剩問題嚴重,因而拖累油價。
同時,美元滙價近期強勢也不利油價。
美國能源資訊局周三公布上周原油庫存,消息也是市場關注焦點。
Voice Recording
Spelling
Browser (Size / Font / Colour)
Reading
Mind Map
Voice to Text / Text to Speech (save, edit) #
Vocabulary #
Writing (Grammar / Spell Check / Inspire Vocabs) #
Flashcards / Vocab
Notes (Storage / Taking / Highlight / Hints) #
Whiteboard (Notes)
Reward System
Coach to Students #
Tips to Parents / Teachers
Dictation
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/managing-software.html
To automate the EC2 Backup, you will need to write a script to automate the above steps by using AWS’ API.
Below is the step by step process which should be followed in the script:
Get the list of instances. Connect to AWS through API to list the Amazon EBS volumes that are attached locally to the instance. List the snapshots of each volume. Assign a retention period to the snapshot. Create snapshot of each volume. Delete the snapshot if it is older than the retention period. By using AWS Command Line Interface (AWS CLI) you can write a shell script which will be used for automating the EBS volume backup. It’s recommended to install the AWS CLI if it has not already been installed. You can refer to this resource for details:AWS CLI Installation.
Commands to Install AWS CLI curl “https://s3.amazonaws.com/aws-cli/awscli-bundle.zip” -o “awscli-bundle.zip” unzip awscli-bundle.zip ./awscli-bundle/install -b ~/bin/aws
how to automate ec2
After installing AWS CLI, configure it using the aws configure command
aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: ENTER
Creating Shell Script Use the script below to copy code to snapshot.sh and set it as cron for automatic timely backup. You can find an explanation of script in comments.
VOLUMES_LIST = /var/log/volumes-list
SNAPSHOT_INFO = /var/log/snapshot_info
DATE = date +%Y-%m-%d
REGION = “eu-west-1”
RETENTION=6
SNAP_CREATION = /var/log/snap_creation SNAP_DELETION = /var/log/snap_deletion
EMAIL_LIST = abc@domain.com
echo “List of Snapshots Creation Status” > $SNAP_CREATION echo “List of Snapshots Deletion Status” > $SNAP_DELETION
if [ -f $VOLUMES_LIST ]; then
for VOL_INFO in cat $VOLUMES_LIST
do
VOL_ID = echo $VOL_INFO | awk -F”:” ‘{print $1}’
VOL_NAME = echo $VOL_INFO | awk -F”:” ‘{print $2}’
DESCRIPTION = “${VOL_NAME}_${DATE}”
/usr/local/bin/aws ec2 create-snapshot –volume-id $VOL_ID –description “$DESCRIPTION” –region $REGION &>> $SNAP_CREATION done else echo “Volumes list file is not available : $VOLUMES_LIST Exiting.” | mail -s “Snapshots Creation Status” $EMAIL_LIST exit 1 fi
echo >> $SNAP_CREATION echo >> $SNAP_CREATION
for VOL_INFO in cat $VOLUMES_LIST
do
VOL_ID = echo $VOL_INFO | awk -F”:” ‘{print $1}’
VOL_NAME = echo $VOL_INFO | awk -F”:” ‘{print $2}’
/usr/local/bin/aws ec2 describe-snapshots –query Snapshots[*].[SnapshotId,VolumeId,Description,StartTime] –output text –filters “Name=status,Values=completed” “Name=volume-id,Values=$VOL_ID” | grep -v “CreateImage” > $SNAPSHOT_INFO
while read SNAP_INFO
do
SNAP_ID=echo $SNAP_INFO | awk ‘{print $1}’
echo $SNAP_ID
SNAP_DATE=echo $SNAP_INFO | awk ‘{print $4}’ | awk -F”T” ‘{print $1}’
echo $SNAP_DATE
RETENTION_DIFF = echo $(($(($(date -d “$DATE” “+%s”) – $(date -d “$SNAP_DATE” “+%s”))) / 86400))
echo $RETENTION_DIFF
if [ $RETENTION -lt $RETENTION_DIFF ]; then /usr/local/bin/aws ec2 delete-snapshot –snapshot-id $SNAP_ID –region $REGION –output text> /tmp/snap_del echo DELETING $SNAP_INFO >> $SNAP_DELETION fi done < $SNAPSHOT_INFO done
echo >> $SNAP_DELETION
cat $SNAP_CREATION $SNAP_DELETION > /var/log/mail_report
cat /var/log/mail_report | mail -s “Volume Snapshots Status” $EMAIL_LIST
Follow the steps below for creating and running shell script:
Create a script by the name of snapshot.sh using command below. Set it as a cron in crontab.
crontab -e
Hope you liked the article. Taking backup of your infrastructure resources frequently is very important in order to be able to recover from a disaster. It’s important to schedule AWS backups on a timely basis, such as taking backup weekly or monthly on different availability zones. It’s one of the best practices that is followed by devops teams all over the world.
Automating Instance Backup Using CPM While using in-house scripts can provide for a basic backup solution, it doesn’t make business sense for organizations to invest in a fully-featured in-house backup solution rather than focusing on their business critical tasks. Cloud Protection Manager (CPM) is an enterprise-class backup-recovery and disaster recovery solution designed for AWS EC2 covering all the essential backup and recovery features to ensure robustness of the backup and DR solution as well as simplifying processes and saving precious devops time. CPM is available as a service model that allows users to manage multiple AWS accounts and configure policies and schedules to take automated snapshot backups. It also has a Windows agent to consistently back up Windows applications. CPM allows you to recover a volume from a snapshot, increase its size and switch it with an existing attached volume in a single step.
Furthermore, in a dynamic cloud environment you need to be able to keep consistent backup policy across all your instances at any point in time. To be most effective, your solution needs to be dynamic and automated when a server is terminated and a new instance needs to be launched. Using EC2 tags, CPM can automatically assign each one of these new instances the appropriate backup policy based on their purpose and your initial configuration. For additional information, see our previous article about tag-based continuous AWS cloud backup.
NEW Backup Guide