On the Expressive Power of Deep Neural Networks
1 minute read ∼ Filed in : A paper noteIntroduction
Findings:
- The complexity of the computed function grows exponentially with depth
- All weights are not equal (initial layers matter more). We find that trained networks are far more sensitive to their lower (initial) layer weights: they are much less robust to noise in these layer weights, and also perform better when these weights are optimized well.
Neural Network Expressivity: how the architectural properties of a neural network (depth, width, layer type) affect the resulting functions it can compute, and its ensuing performance.
Contributions:
- Propose easily computable measures of the neural network expressivity, which capture the expressive of the NN architecture, and are independent of the weights.
- We show how these results on trajectory length suggest that optimizing weights in lower layers of the network is particularly important.
Model
Goal: understand how F_A(x, W) changes as A changes.
All dimension study is time consuming, thus the paper focus on the simple one dimensional trajectories.
- Trajectory: x(t) is a trajectory between two points x_0, and x_1 if x(0) = x_0 and X(1) = x_1. E.g, x(t) = tx_1 + (1-t) x_0
Transitions: a neuron wiht piecewise linear region transitions between x1 and x2 if it’s activation switches (for ReLU, it from inactivate -> activate or verse vise.)
- For a trajectory x(t), we can thus define T (F_A(x(t); W)) to be the number of transitions (or number of linear regions) the output neurons go through as x changes along the trajectory x(t)
- A(FA(x(t); W)) represents the number of distinct activation patterns as we move along the trajectory x(t).
Then, it propose a theorem, which shows the activation pattern A(FA(x(t); W)) (with n hidden layers of width k, and inouts in R^m) is upper bounded by some value.
- A(F_A(n, k))(R^m; W) <= U(n, k, m)
- U(1, N, m) < U(2, N/2, m) <…<N(n, k, m)
Trajectory Length:
-
Give a trajectory, x(t). Trajectory Length is defined as l(x(t))
-
z^(d+1)(t) is basically the output of the layer d. the change between two layer \deta = z^(d+1)(t) - z^(d)(t) increases.
s