일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | ||||
4 | 5 | 6 | 7 | 8 | 9 | 10 |
11 | 12 | 13 | 14 | 15 | 16 | 17 |
18 | 19 | 20 | 21 | 22 | 23 | 24 |
25 | 26 | 27 | 28 | 29 | 30 | 31 |
- universal-approximation
- 업무강도
- optimizations
- 구매리스트
- 일상
- loadentrypoint
- 다중
- 변속
- vsCode
- neural-networks
- deep-learning
- multiclass-label
- cross-entropy
- serviceworkerversion
- sftp
- 입문
- NN
- 안장
- 회사
- 로드
- miniconda
- jupyter-notebook
- how-to
- flutter
- stone-weierstrass
- Today
- Total
Happy Sisyphe
Cross entropy loss 본문
Cross-Entropy Loss: Detailed Explanation and Derivation (D2P)
1. What is Cross-Entropy Loss?
Cross-entropy loss is a commonly used loss function in classification tasks. It measures the dissimilarity between the true labels $y$ and the predicted probabilities $\hat{y}$ output by the model. The goal is to minimize the cross-entropy loss to ensure the predicted probabilities match the true labels as closely as possible.
It is widely used for:
- Binary classification e.g., spam detection.
- Multi-class classification e.g., image recognition.
2. Mathematical Definition
Binary Classification
For binary classification, where $y \in {0, 1}$ and $\hat{y}$ is the predicted probability of the positive class:
$$
L_{\text{BinaryCE}} = -\frac{1}{N} \sum_{i=1}^N \Big( y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \Big)
$$
Multi-Class Classification
For multi-class classification, where $y_i$ is one-hot encoded, and $\hat{y}_{ik}$ is the predicted probability for class $k$:
$$
L_{\text{CategoricalCE}} = -\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^K y_{ik} \log(\hat{y}_{ik})
$$
3. Derivation of Cross-Entropy Loss
Step 1: Likelihood Function
For a classification model, the output is the predicted probability distribution $\hat{y}$. The likelihood of predicting the correct labels is:
Binary Classification
$$
Likelihood = \prod_{i=1}^N \hat{y}_i^{y_i} (1 - \hat{y}_i)^{1 - y_i}
$$
Multi-Class Classification
$$
Likelihood = \prod_{i=1}^N \prod_{k=1}^K \hat{y}_{ik}^{y_ik}
$$
Step 2: Log-Likelihood
To simplify the calculations, take the logarithm of the likelihood:
Binary Classification
$$
\log(\text{Likelihood}) = \sum_{i=1}^N \Big( y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \Big)
$$
Multi-Class Classification
$$
\log(\text{Likelihood}) = \sum_{i=1}^N \sum_{k=1}^K y_{ik} \log(\hat{y}_{ik})
$$
Step 3: Negative Log-Likelihood
The negative log-likelihood (NLL) is minimized during training. This gives the cross-entropy loss:
Binary Cross-Entropy Loss
$$
L_{BinaryCE} = -\frac{1}{N} \sum_{i=1}^N \Big( y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \Big)
$$
Categorical Cross-Entropy Loss
$$
L_{CategoricalCE} = -\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^K y_{ik} \log(\hat{y}_{ik})
$$
4. Why Cross-Entropy Loss Works
Negative Log-Likelihood as a Metric
- The negative log-likelihood $-\log(\hat{y}_k)$ penalizes incorrect predictions by assigning a large loss when the predicted probability $\hat{y}_k$ for the true class is low.
- When $\hat{y}_k$ is close to 1, the penalty is minimal.
Intuition
- If the true label is $y_i = 1$ and the predicted probability is $\hat{y}_i = 0.9$, the loss is:
$$
-\log(0.9) \approx 0.105
$$ - If the predicted probability is $\hat{y}_i = 0.1$, the loss is much higher:
$$
-\log(0.1) \approx 2.302
$$
5. The Role of Softmax in Categorical Cross-Entropy
In multi-class classification, the term $\log(\hat{y}_k)$ in the loss function is derived from the softmax function, which converts logits into probabilities:
$$
\hat{y_k} = \frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)}
$$
Here:
- $z_k$ is the logit (raw output) for class $k$.
- $\hat{y}_k$ is the normalized probability for class $k$.
Why Use Softmax?
Probability Distribution:
Softmax ensures the logits are converted into a valid probability distribution:- $\hat{y}_k \geq 0$ for all $k$.
- $\sum_{k=1}^K \hat{y}_k = 1$.
Log of Softmax in Cross-Entropy:
The categorical cross-entropy loss combines the softmax and the negative log-likelihood into a single operation:
$$
L_{\text{CategoricalCE}} = -\frac{1}{N} \sum_{i=1}^N \log\left(\frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)}\right)
$$Gradient-Friendly:
The softmax function is differentiable, making it suitable for backpropagation.
6. Toy Example with Binary Classification
Suppose we have:
- Two samples.
- True labels $y = [1, 0]$.
- Predicted probabilities $\hat{y} = [0.9, 0.3]$.
The binary cross-entropy loss is:
$$
L_{\text{BinaryCE}} = -\frac{1}{2} \Big( y_1 \log(\hat{y}_1) + (1 - y_1) \log(1 - \hat{y}_1) + y_2 \log(\hat{y}_2) + (1 - y_2) \log(1 - \hat{y}_2) \Big)
$$
Substitute values:
$$
L_{\text{BinaryCE}} = -\frac{1}{2} \Big( \log(0.9) + \log(0.7) \Big)
$$
Numerical values:
$$
L_{\text{BinaryCE}} = -\frac{1}{2} \Big( -0.105 + -0.357 \Big) = 0.231
$$
7. Summary
Binary Cross-Entropy Loss:
Used for binary classification tasks.
$$
L_{\text{BinaryCE}} = -\frac{1}{N} \sum \Big( y \log(\hat{y}) + (1-y) \log(1-\hat{y}) \Big)
$$Categorical Cross-Entropy Loss:
Used for multi-class classification tasks.
$$
L_{\text{CategoricalCE}} = -\frac{1}{N} \sum \sum y_k \log(\hat{y}_k)
$$Softmax:
In multi-class classification, $\log(\hat{y}_k)$ is derived from softmax, ensuring probabilities for all classes are normalized.Why It Works:
Cross-entropy measures the "distance" between the true label distribution $y$ and the predicted probability distribution $\hat{y}$. By minimizing cross-entropy loss, the model learns to predict probabilities close to the true labels.
Appendix: Binary Cross-Entropy for Multi-Label Classification (BCEWithLogitsLoss)
1. What is Multi-Label Classification?
In multi-label classification:
- Each sample can have multiple labels (classes).
- Each label is treated as an independent binary classification problem.
- For example:
- An image might be tagged as both "cat" and "dog" (
[1, 0, 1]
for[cat, bird, dog]
).
- An image might be tagged as both "cat" and "dog" (
2. BCE Formula for Multi-Label Classification
The loss for a single sample with $K$ classes is:
$$
L = -\frac{1}{K} \sum_{k=1}^K \Big( y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k) \Big)
$$
Where:
- $y_k$: Ground truth label (1 for positive, 0 for negative) for class $k$.
- $\hat{y}_k$: Predicted probability for class $k$.
- $K$: Total number of classes.
3. Derivation of Binary Cross-Entropy Loss
Step 1: Likelihood for a Single Class
For a single binary label $y_k$, the predicted probability $\hat{y}_k$ represents:
- $P(y_k = 1) = \hat{y}_k$
- $P(y_k = 0) = 1 - \hat{y}_k$
The likelihood of the prediction being correct is:
$$
\text{Likelihood} = \hat{y}_k^{y_k} (1 - \hat{y}_k)^{1 - y_k}
$$
Step 2: Log-Likelihood
Take the logarithm to simplify:
$$
\log(\text{Likelihood}) = y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k)
$$
Step 3: Negative Log-Likelihood
The negative log-likelihood (NLL) gives the binary cross-entropy loss for a single label:
$$
L_k = - \Big( y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k) \Big)
$$
Step 4: Extend to Multi-Label Classification
For multi-label classification, the total loss is computed by summing over all $K$ classes:
$$
L = -\frac{1}{K} \sum_{k=1}^K \Big( y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k) \Big)
$$
4. Why Use BCE for Multi-Label Classification?
Independent Binary Tasks:
Each label is treated as an independent binary classification problem:- $\hat{y}_k$: Probability of label $k$ being 1 (positive).
- $1 - \hat{y}_k$: Probability of label $k$ being 0 (negative).
Sigmoid Activation:
Logits (raw model outputs) are passed through the sigmoid function to normalize them to probabilities between 0 and 1:$$
\hat{y}_k = \frac{1}{1 + e^{-z_k}}
$$where $z_k$ is the logit for class $k$.
5. Toy Example
Dataset:
- Classes:
[cat, bird, dog]
- True labels (ground truth):
[1, 0, 1]
(the sample is tagged as "cat" and "dog"). - Predicted logits (raw model outputs):
[2.0, -1.0, 0.5]
.
Step 1: Compute Sigmoid Probabilities
Convert logits to probabilities using the sigmoid function:
$$
\hat{y}_k = \frac{1}{1 + e^{-z_k}}
$$
For each class:
- $\hat{y}_1 = \frac{1}{1 + e^{-2.0}} \approx 0.88$ (cat)
- $\hat{y}_2 = \frac{1}{1 + e^{1.0}} \approx 0.27$ (bird)
- $\hat{y}_3 = \frac{1}{1 + e^{-0.5}} \approx 0.62$ (dog)
Step 2: Compute BCE for Each Class
Using the formula:
$$
L_k = - \Big( y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k) \Big)
$$
For each class:
Cat $y_1 = 1, \hat{y}_1 = 0.88$:
$$
L_1 = - \log(0.88) \approx 0.13
$$Bird $y_2 = 0, \hat{y}_2 = 0.27$:
$$
L_2 = - \log(0.73) \approx 0.31
$$Dog $y_3 = 1, \hat{y}_3 = 0.62$:
$$
L_3 = - \log(0.62) \approx 0.48
$$
Step 3: Compute Total Loss
The total loss is the average BCE over all classes:
$$
L = \frac{1}{3} \Big( L_1 + L_2 + L_3 \Big) \approx \frac{1}{3} (0.13 + 0.31 + 0.48) = 0.31
$$
6. Summary
Binary Cross-Entropy Formula:
For multi-label classification:$$
L = -\frac{1}{K} \sum_{k=1}^K \Big( y_k \log(\hat{y}_k) + (1 - y_k) \log(1 - \hat{y}_k) \Big)
$$Why It Works:
- Each label is treated as an independent binary classification problem.
- Sigmoid activation normalizes logits into probabilities for each class.
Application:
Used in tasks like:- Multi-label image tagging (e.g., "cat", "dog").
- Text classification with multiple categories.
Example Loss Calculation:
For logits $[2.0, -1.0, 0.5]$ and true labels $[1, 0, 1]$, the total BCE loss is $0.31$.
'Programming > ML&DL' 카테고리의 다른 글
Neural Network Optimization Methods (1) | 2024.12.17 |
---|---|
Why Neural Networks Can Approximate Any Function (0) | 2024.12.17 |
Miniconda 설치 및 jupyter-notebook 실행 (windows 기준) (1) | 2024.10.14 |