Happy Sisyphe

Why Neural Networks Can Approximate Any Function 본문

Programming/ML&DL

Why Neural Networks Can Approximate Any Function

happysisyphe 2024. 12. 17. 15:29
반응형

Universal Approximation Theorem

Theorem

Let $f: \mathbb{R}^n \to \mathbb{R}$ be a continuous function defined on a compact domain $D \subseteq \mathbb{R}^n$. For any $\epsilon > 0$, there exists a feedforward neural network with a single hidden layer, using a nonlinear, continuous activation function $\phi: \mathbb{R} \to \mathbb{R}$, such that:

$$
\sup_{x \in D} |f(x) - \hat{f}(x)| < \epsilon,
$$

where $\hat{f}(x)$ is the function implemented by the neural network.


Proof

  1. Form of the Neural Network Approximation
    A single-hidden-layer neural network can approximate a target function $f(x)$ as:

    $$
    \hat{f}(x) = \sum_{i=1}^N w_i \phi(a_i \cdot x + b_i),
    $$

    where:

    • $w_i$ are weights,
    • $a_i$ and $b_i$ are parameters of the activation function,
    • $\phi$ is the nonlinear activation function.
  2. Constructing the Subalgebra
    The set of functions generated by $\phi(a_i \cdot x + b_i)$ forms a subalgebra of $C(D)$ (the space of continuous functions on $D$) because it is:

    • Closed under addition: Adding two such functions gives another function of the same form.
    • Closed under scalar multiplication: Scaling by a constant keeps the function in the set.
    • Non-constant: The activation function $\phi$ is nonlinear and non-constant.
    • Separates points: For any $x_1 \neq x_2$, there exist parameters $a$ and $b$ such that $\phi(a \cdot x_1 + b) \neq \phi(a \cdot x_2 + b)$.
  3. Application of the Stone-Weierstrass Theorem
    By the Stone-Weierstrass Theorem:

    • Any subalgebra of $C(D)$ that separates points and contains a non-constant function is dense in $C(D)$.
    • Therefore, the set of functions generated by the neural network is dense in $C(D)$.
  4. Approximation of the Target Function
    Since the set of neural network functions is dense in $C(D)$, for any continuous function $f$ and any $\epsilon > 0$, there exists a function $\hat{f}$ (realizable by the neural network) such that:

$$
\sup_{x \in D} |f(x) - \hat{f}(x)| < \epsilon.
$$


Conclusion

Thus, a feedforward neural network with a single hidden layer and a nonlinear activation function can approximate any continuous function on a compact domain to arbitrary accuracy.


Appendix 1: Stone-Weierstrass Theorem

The Stone-Weierstrass Theorem is a fundamental result in functional analysis that ensures the density of certain subsets of continuous functions in $C(D)$, where $D$ is a compact domain. Below is the detailed explanation.

Statement of the Theorem:

Let $C(D)$ be the space of all continuous real-valued functions defined on a compact domain $D \subseteq \mathbb{R}^n$. A subalgebra $A \subseteq C(D)$ is dense in $C(D)$ if:

  1. $A$ contains the constant functions.
  2. $A$ separates points, meaning that for any two distinct points $x_1, x_2 \in D$, there exists a function $g \in A$ such that $g(x_1) \neq g(x_2)$.

Intuition:

  • A subalgebra $A$ is “rich enough” to approximate any continuous function if it can distinguish between points (separates points) and generate constant functions.
  • Density means that for any $f \in C(D)$ and $\epsilon > 0$, there exists $g \in A$ such that:

$$
\sup_{x \in D} |f(x) - g(x)| < \epsilon.
$$

Stone-Weierstrass Theorem Applied to Neural Networks:

  1. The Set of Neural Network Functions:
    Neural networks with a single hidden layer and activation function $\phi$ generate functions of the form:

    $$
    \hat{f}(x) = \sum_{i=1}^N w_i \phi(a_i \cdot x + b_i).
    $$

  2. Subalgebra Properties:

    • The set is closed under addition and scalar multiplication.
    • The activation function $\phi$ is nonlinear and non-constant.
    • Neural network functions can separate points due to the flexibility of parameters $a_i$ and $b_i$.
  3. By Stone-Weierstrass:
    This set of functions is dense in $C(D)$, meaning that any continuous function $f$ can be approximated arbitrarily well by a neural network.


Appendix 2: Non-Constancy vs. Separation of Points

Non-Constancy

Non-constancy refers to the ability to produce varying outputs.

  • At least one function $g$ in the set changes its output as $x$ changes.
  • This ensures that the set has variability and is not restricted to producing constant outputs.

Example:

  • $g(x) = c$ (a constant function) gives the same output for all $x$.
  • A non-constant function like $g(x) = x$ produces outputs that vary with $x$.

Intuition:
Without non-constancy, the set of functions cannot approximate any function that varies, such as curves or increasing trends.


Separation of Points

Separation of points refers to the ability to distinguish differences between distinct inputs $x_1$ and $x_2$:

  • For any two points $x_1 \neq x_2$, there exists at least one function $g$ in the set such that $g(x_1) \neq g(x_2)$.
  • This ensures the set has enough resolution to "see" the difference between inputs.

Example:

  • If all functions in the set satisfy $g(x_1) = g(x_2)$, the set cannot separate points.
  • A function like $g(x) = x$ does separate points because $g(x_1) \neq g(x_2)$ when $x_1 \neq x_2$.

Intuition:
Without the ability to separate points, the set cannot resolve the details of the domain and distinguish inputs, making it impossible to approximate general continuous functions.


Key Difference

  • Non-constancy: Ensures there is some variability in the outputs of the functions.
  • Separation of points: Ensures the set of functions can distinguish between different inputs.

Analogy

Imagine you're trying to draw a detailed map:

  • Non-constancy is like having a tool that can produce different elevations (not just a flat surface).
  • Separation of points is like being able to "label" or differentiate between different locations uniquely on the map.

Both tools are needed to create a map that accurately approximates the true terrain.


Conclusion

  • Non-constancy refers to the ability to produce varying outputs.
  • Separation of points refers to the ability to distinguish between different inputs.

Together, these conditions ensure that the set of functions is "rich enough" to approximate any continuous function, as required by the Stone-Weierstrass Theorem.

'Programming > ML&DL' 카테고리의 다른 글

Cross entropy loss  (0) 2024.12.18
Neural Network Optimization Methods  (1) 2024.12.17
Miniconda 설치 및 jupyter-notebook 실행 (windows 기준)  (1) 2024.10.14