On the Expressive Power of Deep Neural Networks

Posted on August 16, 2023 1 minute read ∼ Filed in : A paper note

Introduction

Findings:

The complexity of the computed function grows exponentially with depth
All weights are not equal (initial layers matter more). We find that trained networks are far more sensitive to their lower (initial) layer weights: they are much less robust to noise in these layer weights, and also perform better when these weights are optimized well.

Neural Network Expressivity: how the architectural properties of a neural network (depth, width, layer type) affect the resulting functions it can compute, and its ensuing performance.

Contributions:

Propose easily computable measures of the neural network expressivity, which capture the expressive of the NN architecture, and are independent of the weights.
We show how these results on trajectory length suggest that optimizing weights in lower layers of the network is particularly important.

Model

Goal: understand how F_A(x, W) changes as A changes.

All dimension study is time consuming, thus the paper focus on the simple one dimensional trajectories.

Trajectory: x(t) is a trajectory between two points x_0, and x_1 if x(0) = x_0 and X(1) = x_1. E.g, x(t) = tx_1 + (1-t) x_0

Transitions: a neuron wiht piecewise linear region transitions between x1 and x2 if it’s activation switches (for ReLU, it from inactivate -> activate or verse vise.)

For a trajectory x(t), we can thus define T (F_A(x(t); W)) to be the number of transitions (or number of linear regions) the output neurons go through as x changes along the trajectory x(t)
A(FA(x(t); W)) represents the number of distinct activation patterns as we move along the trajectory x(t).

Then, it propose a theorem, which shows the activation pattern A(FA(x(t); W)) (with n hidden layers of width k, and inouts in R^m) is upper bounded by some value.

A(F_A(n, k))(R^m; W) <= U(n, k, m)
U(1, N, m) < U(2, N/2, m) <…<N(n, k, m)

Trajectory Length:

Give a trajectory, x(t). Trajectory Length is defined as l(x(t))
z^(d+1)(t) is basically the output of the layer d. the change between two layer \deta = z^(d+1)(t) - z^(d)(t) increases.

On the Expressive Power of Deep Neural Networks

Introduction

Model

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.

Introduction

Model

END OF POST

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.