| dbp:mathStatement
|
- Let be a compact subset of . Let be any non-affine continuous function which is continuously differentiable at at least one point, with nonzero derivative at that point. Let denote the space of feed-forward neural networks with input neurons, output neurons, and an arbitrary number of hidden layers each with neurons, such that every hidden neuron has activation function and every output neuron has the identity as its activation function, with input layer and output layer . Then given any and any , there exists such that
In other words, is dense in with respect to the topology of uniform convergence.
Quantitative refinement: The number of layers and the width of each layer required to approximate to precision known; moreover, the result hold true when and are replaced with any non-positively curved Riemannian manifold. (en)
- There exists an activation function which is analytic, strictly increasing and sigmoidal and has the following property: For any and there exist constants , and vectors for which
for all . (en)
- Let be a finite segment of the real line, and be any positive number. Then one can algorithmically construct a computable sigmoidal activation function , which is infinitely differentiable, strictly increasing on , -strictly increasing on , and satisfies the following properties:
# For any and there exist numbers and such that for all
# For any continuous function on the -dimensional box and , there exist constants , , and such that the inequality holds for all . Here the weights , , are fixed as follows: In addition, all the coefficients , except one, are equal. (en)
- For any Bochner–Lebesgue p-integrable function and any , there exists a fully connected ReLU network of width exactly , satisfying
Moreover, there exists a function and some , for which there is no fully connected ReLU network of width less than satisfying the above approximation bound.
Remark: If the activation is replaced by leaky-ReLU, and the input is restricted in a compact domain, then the exact minimum width is .
Quantitative refinement: In the case where , and is the ReLU activation function, the exact depth and width for a ReLU network to achieve error is also known. If, moreover, the target function is smooth, then the required number of layer and their width can be exponentially smaller. Even if is not smooth, the curse of dimensionality can be broken if admits additional "compositional structure". (en)
|
| dbp:proof
|
- It suffices to prove the case where , since uniform convergence in is just uniform convergence in each coordinate.
Let be the set of all one-hidden-layer neural networks constructed with . Let be the set of all with compact support.
If the function is a polynomial of degree , then is contained in the closed subspace of all polynomials of degree , so its closure is also contained in it, which is not all of .
Otherwise, we show that 's closure is all of . Suppose we can construct arbitrarily good approximations of the ramp function
then it can be combined to construct arbitrary compactly-supported continuous function to arbitrary precision. It remains to approximate the ramp function.
Any of the commonly used activation functions used in machine learning can obviously be used to approximate the ramp function, or first approximate the ReLU, then the ramp function.
if is "squashing", that is, it has limits , then one can first affinely scale down its x-axis so that its graph looks like a step-function with two sharp "overshoots", then make a linear sum of enough of them to make a "staircase" approximation of the ramp function. With more steps of the staircase, the overshoots smooth out and we get arbitrarily good approximation of the ramp function.
The case where is a generic non-polynomial function is harder, and the reader is directed to. (en)
|