Asking questions can improve the quality and engagement of your discussions in a big way

4 Really Small Things That Say a Lot about You Hint: It’s not what you say.. Photo by Jon Ly on Unsplash Ever been asked to say a few things about yourself? Perhaps you said you’re a good communicator, attentive to….

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Back propagation

One of the main steps in our machine learning algorithm is to set best weights to model. set in the sense is that we have to find those from the training the model with training examples. Undoubtedly, first optimizer (methods to find best weights ) we come across is Gradient Descent optimizer. Let me briefly recall that for you.

Gradient Descent is the most basic but most used optimization algorithm. It’s used heavily in linear regression and classification algorithms. Mathematically speaking…

Weights are altered so that loss function can reach its minima as fast as it possible. This is very basic and easy to compute the function. Where weights are updated for every batch. If batch is very big it can take very long time to reach minima or maybe trap at its minima. Another hurdle is that if features or parameters are too many ( like in deep network models ) find gradient of function wrt all parameters is not so good at some point of time.

So we introduced back propagation to compute gradient of all parameters. In a deep network model, we first forward-feed the all the training examples and through backward propagation we alter the weights of each nodes depending on its contribution to the overall loss of the model.

Before diving in-depth there are some mathematical stuff that would be needed for understanding its nature.

And we also know that gradient of function is vector of all partial derivatives of function. Partial derivative is simply derivative wrt one variable.

This is how a gradient looks like. Remember its a vector!!.
This is how a gradient looks like. Remember its a vector!!.

From below example, derivative at that point wrt to x is -3. this means if we were to increase the values of x by some small amount (h), the whole function or expression is about to decrease by factor 3 times of that amount (3h).

The add gate always takes the gradient on its output and distributes it equally to all or any of its inputs, no matter what their values were during the initial . This follows from the very fact that the local gradient for the add operation is just +1.0, therefore the gradients on all inputs will exactly equal the gradients on the output because it’ll be multiplied by x1.0 (and remain unchanged).

Note that the + gate routed the gradient of two .00 to both of its inputs, equally and unchanged. The max gate routes the gradient. Unlike the add gate which distributed the gradient unchanged to all or any its inputs, the max gate distributes the gradient (unchanged) to precisely one among its inputs (the input that had the highest value during the forward pass).

This is because the local gradient for a max gate is 1.0 for the very best value, and 0.0 for all other values. the max operation routed the gradient of two .00 to the z variable, which had a better value than w, and therefore the gradient on w remains zero.
The multiply gate may be a little less easy to interpret. Its local gradients are the input values (except switched), and this is often multiplied by the gradient on its output during the chain rule. In the example above, the gradient on x is -8.00, which is -4.00 x 2.00.

And we know some basic derivatives of common functions like addition, multiplication and exponential.

Many of complex loss functions can be made up of small parts of these basic ones. Now we are introducing some new terminology called gates which are nodes in model. Each of these gates have two operations.

It is interesting to note that in many cases the backward-flowing gradient can be interpreted on an intuitive level. For example, the three most commonly used gates in neural networks (add, mul, max), all have very simple interpretations in terms of how they act during backpropagation. Consider this example circuit:

Now we will show a example by which it can expanded to the original model.

Lets assume a hypothetical model using basic functions, After forward-feed of the features, loss is calculated and through recursion back propagation occurs.

Initially , we will be at node/gate ‘f’ where final gradient is 1. At this node two operations which are discussed above are calculated. And multiplied with local gradients of the node ‘f’.

In the above figure, we calculated local gradients wrt input z ( 2.1) ,q (2.2) and multiplied with node output gradient. These results are again sent back till it reaches final single input node.

This is first node in our model. Where we again calculated those two operations.

Finally, Notice that if one among the inputs to the multiply gate is extremely small and therefore the other is extremely big, then the multiply gate will do something slightly unintuitive: it’ll assign a huge gradient to the tiny input and a small gradient to the massive input. Note that in linear classifiers where the weights are dot product (multiplied) with the inputs.

This suggests that the size of the info has an impact on the magnitude of the gradient for the weights. for instance , if you multiplied all input file examples xi by 1000.During preprocessing, then the gradient on the weights are going to be 1000 times larger, and you’d need to lower the training rate by that factor to compensate.

This is often why preprocessing matters tons , sometimes in subtle ways! And having intuitive understanding for a way the gradients flow can assist you debug a number of these cases. Because if the function is multiplied with the constant factor, its derivative or gradient is also multiplied with the same amount of that constant .

Finally, instead of updating weights with full scale we use alpha rate to control it. Another important point to note is that — using back propagation we can understand the importance of the regularization of the features.

We developed intuition for what the gradients mean, how they flow backwards within the circuit, and the way they convey which a part of the circuit should increase or decrease and with what force to form the ultimate output higher. We discussed the importance of staged computation for practical implementations of backpropagation. you usually want to interrupt up your function into modules that you’ll easily derive local gradients, then chain them with chain rule.

Crucially, you almost never want to write down out these expressions on paper and differentiate them symbolically fully , because you never need a particular mathematical equation for the gradient of the input variables. Hence, decompose your expressions into stages such you’ll differentiate every stage independently (the stages are going to be matrix vector multiplies, or max operations, or sum operations, etc.) then backprop through the variables one step at a time.

Add a comment

Related posts:

The Greatest Generation and My Uncle Bill

I missed the last reunion but tomorrow, I’ll be on the R train headed for Bay Ridge to visit my brother, my sister-in-law, my father’s third wife, and most of all…my 99-year-old Aunt Ellie! Her late…

Golang for C developers

I have a strong background in real time / c development. For a recent project that I started working on I decided to pick up golang. Mainly to give the language a try and see what all the fuss is…

Keto Creatine is safe ?

Keto Creatine. No matter whether you are a newbie in the gym or a seasoned athlete with years of experience, if you’re keeping an eye on the news and discussions in the fitness world. You have…