Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Happy Sisyphe

Cross entropy loss 본문

Programming/ML&DL

Cross entropy loss

happysisyphe 2024. 12. 18. 13:42

Cross-Entropy Loss: Detailed Explanation and Derivation (D2P)

1. What is Cross-Entropy Loss?

Cross-entropy loss is a commonly used loss function in classification tasks. It measures the dissimilarity between the true labels $y$ and the predicted probabilities $\hat{y}$ output by the model. The goal is to minimize the cross-entropy loss to ensure the predicted probabilities match the true labels as closely as possible.

It is widely used for:

Binary classification e.g., spam detection.
Multi-class classification e.g., image recognition.

2. Mathematical Definition

Binary Classification

For binary classification, where $y \in {0, 1}$ and $\hat{y}$ is the predicted probability of the positive class:

$$
L_{\text{BinaryCE}} = -\frac{1}{N} \sum_{i=1}^N \Big( y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \Big)
$$

Multi-Class Classification

For multi-class classification, where $y_i$ is one-hot encoded, and $\hat{y}_{ik}$ is the predicted probability for class $k$:

$$
L_{\text{CategoricalCE}} = -\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^K y_{ik} \log(\hat{y}_{ik})
$$

3. Derivation of Cross-Entropy Loss

Step 1: Likelihood Function

For a classification model, the output is the predicted probability distribution $\hat{y}$. The likelihood of predicting the correct labels is:

Binary Classification

$$
Likelihood = \prod_{i=1}^N \hat{y}_i^{y_i} (1 - \hat{y}_i)^{1 - y_i}
$$

Multi-Class Classification

$$
Likelihood = \prod_{i=1}^N \prod_{k=1}^K \hat{y}_{ik}^{y_ik}
$$

Step 2: Log-Likelihood

To simplify the calculations, take the logarithm of the likelihood:

Binary Classification

$$
\log(\text{Likelihood}) = \sum_{i=1}^N \Big( y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \Big)
$$

Multi-Class Classification

$$
\log(\text{Likelihood}) = \sum_{i=1}^N \sum_{k=1}^K y_{ik} \log(\hat{y}_{ik})
$$

Step 3: Negative Log-Likelihood

The negative log-likelihood (NLL) is minimized during training. This gives the cross-entropy loss:

Binary Cross-Entropy Loss

$$
L_{BinaryCE} = -\frac{1}{N} \sum_{i=1}^N \Big( y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \Big)
$$

Categorical Cross-Entropy Loss

$$
L_{CategoricalCE} = -\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^K y_{ik} \log(\hat{y}_{ik})
$$

4. Why Cross-Entropy Loss Works

Negative Log-Likelihood as a Metric

The negative log-likelihood $-\log(\hat{y}_k)$ penalizes incorrect predictions by assigning a large loss when the predicted probability $\hat{y}_k$ for the true class is low.
When $\hat{y}_k$ is close to 1, the penalty is minimal.

Intuition

If the true label is $y_i = 1$ and the predicted probability is $\hat{y}_i = 0.9$, the loss is:
$$
-\log(0.9) \approx 0.105
$$
If the predicted probability is $\hat{y}_i = 0.1$, the loss is much higher:
$$
-\log(0.1) \approx 2.302
$$

5. The Role of Softmax in Categorical Cross-Entropy

In multi-class classification, the term $\log(\hat{y}_k)$ in the loss function is derived from the softmax function, which converts logits into probabilities:

$$
\hat{y_k} = \frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)}
$$

Here:

$z_k$ is the logit (raw output) for class $k$.
$\hat{y}_k$ is the normalized probability for class $k$.

Why Use Softmax?

Probability Distribution:
Softmax ensures the logits are converted into a valid probability distribution:
- $\hat{y}_k \geq 0$ for all $k$.
- $\sum_{k=1}^K \hat{y}_k = 1$.
Log of Softmax in Cross-Entropy:
The categorical cross-entropy loss combines the softmax and the negative log-likelihood into a single operation:
$$
L_{\text{CategoricalCE}} = -\frac{1}{N} \sum_{i=1}^N \log\left(\frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)}\right)
$$
Gradient-Friendly:
The softmax function is differentiable, making it suitable for backpropagation.

6. Toy Example with Binary Classification

Suppose we have:

Two samples.
True labels $y = [1, 0]$.
Predicted probabilities $\hat{y} = [0.9, 0.3]$.

The binary cross-entropy loss is:

$$
L_{\text{BinaryCE}} = -\frac{1}{2} \Big( y_1 \log(\hat{y}_1) + (1 - y_1) \log(1 - \hat{y}_1) + y_2 \log(\hat{y}_2) + (1 - y_2) \log(1 - \hat{y}_2) \Big)
$$

Substitute values:
$$
L_{\text{BinaryCE}} = -\frac{1}{2} \Big( \log(0.9) + \log(0.7) \Big)
$$

Numerical values:
$$
L_{\text{BinaryCE}} = -\frac{1}{2} \Big( -0.105 + -0.357 \Big) = 0.231
$$

7. Summary

Binary Cross-Entropy Loss:
Used for binary classification tasks.
$$
L_{\text{BinaryCE}} = -\frac{1}{N} \sum \Big( y \log(\hat{y}) + (1-y) \log(1-\hat{y}) \Big)
$$
Categorical Cross-Entropy Loss:
Used for multi-class classification tasks.
$$
L_{\text{CategoricalCE}} = -\frac{1}{N} \sum \sum y_k \log(\hat{y}_k)
$$
Softmax:
In multi-class classification, $\log(\hat{y}_k)$ is derived from softmax, ensuring probabilities for all classes are normalized.
Why It Works:
Cross-entropy measures the "distance" between the true label distribution $y$ and the predicted probability distribution $\hat{y}$. By minimizing cross-entropy loss, the model learns to predict probabilities close to the true labels.

Appendix: Binary Cross-Entropy for Multi-Label Classification (BCEWithLogitsLoss)

1. What is Multi-Label Classification?

In multi-label classification:

Each sample can have multiple labels (classes).
Each label is treated as an independent binary classification problem.
For example:
- An image might be tagged as both "cat" and "dog" ([1, 0, 1] for [cat, bird, dog]).

2. BCE Formula for Multi-Label Classification

The loss for a single sample with $K$ classes is:

$$
L = -\frac{1}{K} \sum_{k=1}^K \Big( y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k) \Big)
$$

Where:

$y_k$: Ground truth label (1 for positive, 0 for negative) for class $k$.
$\hat{y}_k$: Predicted probability for class $k$.
$K$: Total number of classes.

3. Derivation of Binary Cross-Entropy Loss

Step 1: Likelihood for a Single Class

For a single binary label $y_k$, the predicted probability $\hat{y}_k$ represents:

$P(y_k = 1) = \hat{y}_k$
$P(y_k = 0) = 1 - \hat{y}_k$

The likelihood of the prediction being correct is:

$$
\text{Likelihood} = \hat{y}_k^{y_k} (1 - \hat{y}_k)^{1 - y_k}
$$

Step 2: Log-Likelihood

Take the logarithm to simplify:

$$
\log(\text{Likelihood}) = y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k)
$$

Step 3: Negative Log-Likelihood

The negative log-likelihood (NLL) gives the binary cross-entropy loss for a single label:

$$
L_k = - \Big( y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k) \Big)
$$

Step 4: Extend to Multi-Label Classification

For multi-label classification, the total loss is computed by summing over all $K$ classes:

$$
L = -\frac{1}{K} \sum_{k=1}^K \Big( y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k) \Big)
$$

4. Why Use BCE for Multi-Label Classification?

Independent Binary Tasks:
Each label is treated as an independent binary classification problem:
- $\hat{y}_k$: Probability of label $k$ being 1 (positive).
- $1 - \hat{y}_k$: Probability of label $k$ being 0 (negative).
Sigmoid Activation:
Logits (raw model outputs) are passed through the sigmoid function to normalize them to probabilities between 0 and 1:

$$
\hat{y}_k = \frac{1}{1 + e^{-z_k}}
$$

where $z_k$ is the logit for class $k$.

5. Toy Example

Dataset:

Classes: [cat, bird, dog]
True labels (ground truth): [1, 0, 1] (the sample is tagged as "cat" and "dog").
Predicted logits (raw model outputs): [2.0, -1.0, 0.5].

Step 1: Compute Sigmoid Probabilities

Convert logits to probabilities using the sigmoid function:

$$
\hat{y}_k = \frac{1}{1 + e^{-z_k}}
$$

For each class:

$\hat{y}_1 = \frac{1}{1 + e^{-2.0}} \approx 0.88$ (cat)
$\hat{y}_2 = \frac{1}{1 + e^{1.0}} \approx 0.27$ (bird)
$\hat{y}_3 = \frac{1}{1 + e^{-0.5}} \approx 0.62$ (dog)

Step 2: Compute BCE for Each Class

Using the formula:

$$
L_k = - \Big( y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k) \Big)
$$

For each class:

Cat $y_1 = 1, \hat{y}_1 = 0.88$:
$$
L_1 = - \log(0.88) \approx 0.13
$$
Bird $y_2 = 0, \hat{y}_2 = 0.27$:
$$
L_2 = - \log(0.73) \approx 0.31
$$
Dog $y_3 = 1, \hat{y}_3 = 0.62$:
$$
L_3 = - \log(0.62) \approx 0.48
$$

Step 3: Compute Total Loss

The total loss is the average BCE over all classes:

$$
L = \frac{1}{3} \Big( L_1 + L_2 + L_3 \Big) \approx \frac{1}{3} (0.13 + 0.31 + 0.48) = 0.31
$$

6. Summary

Binary Cross-Entropy Formula:
For multi-label classification:

$$
L = -\frac{1}{K} \sum_{k=1}^K \Big( y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k) \Big)
$$
Why It Works:
- Each label is treated as an independent binary classification problem.
- Sigmoid activation normalizes logits into probabilities for each class.
Application:
Used in tasks like:
- Multi-label image tagging (e.g., "cat", "dog").
- Text classification with multiple categories.
Example Loss Calculation:
For logits $[2.0, -1.0, 0.5]$ and true labels $[1, 0, 1]$, the total BCE loss is $0.31$.

'Programming > ML&DL' 카테고리의 다른 글

Neural Network Optimization Methods (1)	2024.12.17
Why Neural Networks Can Approximate Any Function (0)	2024.12.17
Miniconda 설치 및 jupyter-notebook 실행 (windows 기준) (1)	2024.10.14

'Programming/ML&DL' Related Articles

Happy Sisyphe

Cross entropy loss 본문

Cross entropy loss

Cross-Entropy Loss: Detailed Explanation and Derivation (D2P)

1. What is Cross-Entropy Loss?

2. Mathematical Definition

Binary Classification

Multi-Class Classification

3. Derivation of Cross-Entropy Loss

Step 1: Likelihood Function

Binary Classification

Multi-Class Classification

Step 2: Log-Likelihood

Binary Classification

Multi-Class Classification

Step 3: Negative Log-Likelihood

Binary Cross-Entropy Loss

Categorical Cross-Entropy Loss

4. Why Cross-Entropy Loss Works

Negative Log-Likelihood as a Metric

Intuition

5. The Role of Softmax in Categorical Cross-Entropy

Why Use Softmax?

6. Toy Example with Binary Classification

7. Summary

Appendix: Binary Cross-Entropy for Multi-Label Classification (BCEWithLogitsLoss)

1. What is Multi-Label Classification?

2. BCE Formula for Multi-Label Classification

3. Derivation of Binary Cross-Entropy Loss

Step 1: Likelihood for a Single Class

Step 2: Log-Likelihood

Step 3: Negative Log-Likelihood

Step 4: Extend to Multi-Label Classification

4. Why Use BCE for Multi-Label Classification?

5. Toy Example

Dataset:

Step 1: Compute Sigmoid Probabilities

Step 2: Compute BCE for Each Class

Step 3: Compute Total Loss

6. Summary

'Programming > ML&DL' 카테고리의 다른 글

티스토리툴바