Happy Sisyphe

Cross entropy loss 본문

Programming/ML&DL

Cross entropy loss

happysisyphe 2024. 12. 18. 13:42
반응형

Cross-Entropy Loss: Detailed Explanation and Derivation (D2P)


1. What is Cross-Entropy Loss?

Cross-entropy loss is a commonly used loss function in classification tasks. It measures the dissimilarity between the true labels $y$ and the predicted probabilities $\hat{y}$ output by the model. The goal is to minimize the cross-entropy loss to ensure the predicted probabilities match the true labels as closely as possible.

It is widely used for:

  • Binary classification e.g., spam detection.
  • Multi-class classification e.g., image recognition.

2. Mathematical Definition

Binary Classification

For binary classification, where $y \in {0, 1}$ and $\hat{y}$ is the predicted probability of the positive class:

$$
L_{\text{BinaryCE}} = -\frac{1}{N} \sum_{i=1}^N \Big( y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \Big)
$$

Multi-Class Classification

For multi-class classification, where $y_i$ is one-hot encoded, and $\hat{y}_{ik}$ is the predicted probability for class $k$:

$$
L_{\text{CategoricalCE}} = -\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^K y_{ik} \log(\hat{y}_{ik})
$$


3. Derivation of Cross-Entropy Loss

Step 1: Likelihood Function

For a classification model, the output is the predicted probability distribution $\hat{y}$. The likelihood of predicting the correct labels is:

Binary Classification

$$
Likelihood = \prod_{i=1}^N \hat{y}_i^{y_i} (1 - \hat{y}_i)^{1 - y_i}
$$

Multi-Class Classification

$$
Likelihood = \prod_{i=1}^N \prod_{k=1}^K \hat{y}_{ik}^{y_ik}
$$


Step 2: Log-Likelihood

To simplify the calculations, take the logarithm of the likelihood:

Binary Classification

$$
\log(\text{Likelihood}) = \sum_{i=1}^N \Big( y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \Big)
$$

Multi-Class Classification

$$
\log(\text{Likelihood}) = \sum_{i=1}^N \sum_{k=1}^K y_{ik} \log(\hat{y}_{ik})
$$


Step 3: Negative Log-Likelihood

The negative log-likelihood (NLL) is minimized during training. This gives the cross-entropy loss:

Binary Cross-Entropy Loss

$$
L_{BinaryCE} = -\frac{1}{N} \sum_{i=1}^N \Big( y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \Big)
$$

Categorical Cross-Entropy Loss

$$
L_{CategoricalCE} = -\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^K y_{ik} \log(\hat{y}_{ik})
$$


4. Why Cross-Entropy Loss Works

Negative Log-Likelihood as a Metric

  • The negative log-likelihood $-\log(\hat{y}_k)$ penalizes incorrect predictions by assigning a large loss when the predicted probability $\hat{y}_k$ for the true class is low.
  • When $\hat{y}_k$ is close to 1, the penalty is minimal.

Intuition

  • If the true label is $y_i = 1$ and the predicted probability is $\hat{y}_i = 0.9$, the loss is:
    $$
    -\log(0.9) \approx 0.105
    $$
  • If the predicted probability is $\hat{y}_i = 0.1$, the loss is much higher:
    $$
    -\log(0.1) \approx 2.302
    $$

5. The Role of Softmax in Categorical Cross-Entropy

In multi-class classification, the term $\log(\hat{y}_k)$ in the loss function is derived from the softmax function, which converts logits into probabilities:

$$
\hat{y_k} = \frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)}
$$

Here:

  • $z_k$ is the logit (raw output) for class $k$.
  • $\hat{y}_k$ is the normalized probability for class $k$.

Why Use Softmax?

  1. Probability Distribution:
    Softmax ensures the logits are converted into a valid probability distribution:

    • $\hat{y}_k \geq 0$ for all $k$.
    • $\sum_{k=1}^K \hat{y}_k = 1$.
  2. Log of Softmax in Cross-Entropy:
    The categorical cross-entropy loss combines the softmax and the negative log-likelihood into a single operation:
    $$
    L_{\text{CategoricalCE}} = -\frac{1}{N} \sum_{i=1}^N \log\left(\frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)}\right)
    $$

  3. Gradient-Friendly:
    The softmax function is differentiable, making it suitable for backpropagation.


6. Toy Example with Binary Classification

Suppose we have:

  • Two samples.
  • True labels $y = [1, 0]$.
  • Predicted probabilities $\hat{y} = [0.9, 0.3]$.

The binary cross-entropy loss is:

$$
L_{\text{BinaryCE}} = -\frac{1}{2} \Big( y_1 \log(\hat{y}_1) + (1 - y_1) \log(1 - \hat{y}_1) + y_2 \log(\hat{y}_2) + (1 - y_2) \log(1 - \hat{y}_2) \Big)
$$

Substitute values:
$$
L_{\text{BinaryCE}} = -\frac{1}{2} \Big( \log(0.9) + \log(0.7) \Big)
$$

Numerical values:
$$
L_{\text{BinaryCE}} = -\frac{1}{2} \Big( -0.105 + -0.357 \Big) = 0.231
$$


7. Summary

  1. Binary Cross-Entropy Loss:
    Used for binary classification tasks.
    $$
    L_{\text{BinaryCE}} = -\frac{1}{N} \sum \Big( y \log(\hat{y}) + (1-y) \log(1-\hat{y}) \Big)
    $$

  2. Categorical Cross-Entropy Loss:
    Used for multi-class classification tasks.
    $$
    L_{\text{CategoricalCE}} = -\frac{1}{N} \sum \sum y_k \log(\hat{y}_k)
    $$

  3. Softmax:
    In multi-class classification, $\log(\hat{y}_k)$ is derived from softmax, ensuring probabilities for all classes are normalized.

  4. Why It Works:
    Cross-entropy measures the "distance" between the true label distribution $y$ and the predicted probability distribution $\hat{y}$. By minimizing cross-entropy loss, the model learns to predict probabilities close to the true labels.


Appendix: Binary Cross-Entropy for Multi-Label Classification (BCEWithLogitsLoss)


1. What is Multi-Label Classification?

In multi-label classification:

  • Each sample can have multiple labels (classes).
  • Each label is treated as an independent binary classification problem.
  • For example:
    • An image might be tagged as both "cat" and "dog" ([1, 0, 1] for [cat, bird, dog]).

2. BCE Formula for Multi-Label Classification

The loss for a single sample with $K$ classes is:

$$
L = -\frac{1}{K} \sum_{k=1}^K \Big( y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k) \Big)
$$

Where:

  • $y_k$: Ground truth label (1 for positive, 0 for negative) for class $k$.
  • $\hat{y}_k$: Predicted probability for class $k$.
  • $K$: Total number of classes.

3. Derivation of Binary Cross-Entropy Loss

Step 1: Likelihood for a Single Class

For a single binary label $y_k$, the predicted probability $\hat{y}_k$ represents:

  • $P(y_k = 1) = \hat{y}_k$
  • $P(y_k = 0) = 1 - \hat{y}_k$

The likelihood of the prediction being correct is:

$$
\text{Likelihood} = \hat{y}_k^{y_k} (1 - \hat{y}_k)^{1 - y_k}
$$

Step 2: Log-Likelihood

Take the logarithm to simplify:

$$
\log(\text{Likelihood}) = y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k)
$$

Step 3: Negative Log-Likelihood

The negative log-likelihood (NLL) gives the binary cross-entropy loss for a single label:

$$
L_k = - \Big( y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k) \Big)
$$

Step 4: Extend to Multi-Label Classification

For multi-label classification, the total loss is computed by summing over all $K$ classes:

$$
L = -\frac{1}{K} \sum_{k=1}^K \Big( y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k) \Big)
$$


4. Why Use BCE for Multi-Label Classification?

  1. Independent Binary Tasks:
    Each label is treated as an independent binary classification problem:

    • $\hat{y}_k$: Probability of label $k$ being 1 (positive).
    • $1 - \hat{y}_k$: Probability of label $k$ being 0 (negative).
  2. Sigmoid Activation:
    Logits (raw model outputs) are passed through the sigmoid function to normalize them to probabilities between 0 and 1:

    $$
    \hat{y}_k = \frac{1}{1 + e^{-z_k}}
    $$

    where $z_k$ is the logit for class $k$.


5. Toy Example

Dataset:

  • Classes: [cat, bird, dog]
  • True labels (ground truth): [1, 0, 1] (the sample is tagged as "cat" and "dog").
  • Predicted logits (raw model outputs): [2.0, -1.0, 0.5].

Step 1: Compute Sigmoid Probabilities

Convert logits to probabilities using the sigmoid function:

$$
\hat{y}_k = \frac{1}{1 + e^{-z_k}}
$$

For each class:

  • $\hat{y}_1 = \frac{1}{1 + e^{-2.0}} \approx 0.88$ (cat)
  • $\hat{y}_2 = \frac{1}{1 + e^{1.0}} \approx 0.27$ (bird)
  • $\hat{y}_3 = \frac{1}{1 + e^{-0.5}} \approx 0.62$ (dog)

Step 2: Compute BCE for Each Class

Using the formula:

$$
L_k = - \Big( y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k) \Big)
$$

For each class:

  1. Cat $y_1 = 1, \hat{y}_1 = 0.88$:
    $$
    L_1 = - \log(0.88) \approx 0.13
    $$

  2. Bird $y_2 = 0, \hat{y}_2 = 0.27$:
    $$
    L_2 = - \log(0.73) \approx 0.31
    $$

  3. Dog $y_3 = 1, \hat{y}_3 = 0.62$:
    $$
    L_3 = - \log(0.62) \approx 0.48
    $$

Step 3: Compute Total Loss

The total loss is the average BCE over all classes:

$$
L = \frac{1}{3} \Big( L_1 + L_2 + L_3 \Big) \approx \frac{1}{3} (0.13 + 0.31 + 0.48) = 0.31
$$


6. Summary

  1. Binary Cross-Entropy Formula:
    For multi-label classification:

    $$
    L = -\frac{1}{K} \sum_{k=1}^K \Big( y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k) \Big)
    $$

  2. Why It Works:

    • Each label is treated as an independent binary classification problem.
    • Sigmoid activation normalizes logits into probabilities for each class.
  3. Application:
    Used in tasks like:

    • Multi-label image tagging (e.g., "cat", "dog").
    • Text classification with multiple categories.
  4. Example Loss Calculation:
    For logits $[2.0, -1.0, 0.5]$ and true labels $[1, 0, 1]$, the total BCE loss is $0.31$.