1 Guide
We're on the previous blog"Learning Theory: Predictor-Rejector Multi-Classification Abstaining Learning"The basic concepts and methods of abstaining learning are introduced in this article, including the following single-stage predictor-rejector abstaining loss for multi-classification problems.\(L_{\text{abst}}\):
in\((x, y)\in \mathcal{X}\times \mathcal{Y}\)(Label\(\mathcal{Y} = \{1, \cdots, n\}\)(\(n\geqslant 2\))),\((h, r)\in \mathcal{H}\times\mathcal{R}\)For predictor-rejector pair (\(\mathcal{H}\)and\(\mathcal{R}\)For two\(\mathcal{X}\)arrive\(\mathbb{R}\)function family composed of functions),\(\text{h}(x) = \text{arg max}_{y\in \mathcal{Y}} {h(x)}_y\)Direct output instance\(x\)predictive tags. To simplify the discussion, we assume that\(c\in (0, 1)\)Spend function for a constant.
set up\(\mathcal{l}\)For in tags\(\mathcal{Y}\)The proxy loss of 0-1 multi-classification loss defined above, we can further define the abstaining proxy loss on this basis.\(L\):
in\(\psi\)Yes, non-decreasing function,\(\phi\)Yes, non-incremental helper functions (as\(z \mapsto \mathbb{I}_{z \leqslant 0}\)the upper boundary),\(\alpha\)、\(\beta\)It is a normal amount. Below, for the sake of simplicity, we mainly deal with\(\phi(z) = \exp(-z)\)Perform analysis, although similar analysis can also be applied to other functions\(\phi\)。
In the previous blog, we also mentioned that single-stage proxy losses are met\((\mathcal{H}, \mathcal{R})\)-Consistency boundary:
Theorem 1 Single-stage agent loss\((\mathcal{H}, \mathcal{R})\)- ConsistencyAssumptions\(\mathcal{H}\)It is symmetrical and complete. Then it is correct\(\alpha=\beta\),\(\mathcal{l} = \mathcal{l}_{\text{mae}}\),or\(\mathcal{l} = \mathcal{l}_{\rho}\)and\(\psi(z) = z\),or\(\mathcal{l} = \mathcal{l}_{\rho - \text{hinge}}\)and\(\psi(z) = z\), there are the following\((\mathcal{H}, \mathcal{R})\)- Consistency Binary\(h\in \mathcal{H}, r\in \mathcal{R}\)And arbitrary distribution is established:
Which is right\(\mathcal{l} = \mathcal{l}_{\text{mae}}\)Pick\(\Gamma (z) = \max\{2n\sqrt{z}, nz\}\);right\(\mathcal{l}=\mathcal{l}_{\rho}\)Pick\(\Gamma (z) = \max\{2\sqrt{z}, z\}\);right\(\mathcal{l} = \mathcal{l}_{\rho - \text{hinge}}\)Pick\(\Gamma (z) = \max\{2\sqrt{nz}, z\}\)。
However, in the previous blog, we did not show single-stage proxy losses\((\mathcal{H}, \mathcal{R})\)- Detailed proof process of the consistency world. In this article, let’s see how to prove this theorem (my supervisor also asked me to take a closer look at the relevant analytical parts in these papers and hope that I can master the proof technology of the single-stage method).
2 Some preparatory concepts for analysis
We assume tagged samples\(S=((x_1, y_1), \cdots, (x_m, y_m))\)Independently distributed\(p(x, y)\). For the target loss\(L_{\text{abst}}\)and agent losses\(L\)In other words, it can be defined separately\(L_{\text{abst}}\)- Expected loss of abstention\(R_{L}(h, r)\)(i.e. the generalization error of the target loss function) and\(L\)- Expected loss of abstaining agent\(R_{L}(h, r)\)(that is, the generalization error of the proxy loss function) is as follows:
set up\(R_{{L}^{*}_{\text{abst}}}(\mathcal{H}, \mathcal{R}) = \inf_{h\in \mathcal{H}, r\in \mathcal{R}}R_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R})\)and\(R_{L}^{*}(\mathcal{H}, \mathcal{R}) = \inf_{h\in \mathcal{H}, r\in \mathcal{R}}R_{L}(\mathcal{H}, \mathcal{R})\)Each\(R_{L_{\text{abst}}}\)and\(R_L\)exist\(\mathcal{H}\times \mathcal{R}\)The upper and lower boundaries are accurate.
To further simplify the subsequent analysis, we will use the multiplication rules of probability.\(R_L(h, r)\)Written as:
We call the conditional expectation item in the inner layer a proxy loss\(L\)ofconditional risk(also known as proxy loss\(L\)pointwise risk), due to the calculation process\(y\)Take the expectation, so the item is only\(h\)、\(r\)、\(x\)Related, so we write it as\(C_L(h, r, x)\):
We use\(C^*_L(\mathcal{H}, \mathcal{R}, x) = \inf_{h\in \mathcal{H}, r\in \mathcal{R}} C_L(h, r, x)\)Let's expressAssume that class is optimal (best-in-class)of\(L\)conditional risks. Similarly, we use\(C_{L_{\text{abst}}}\)To indicate the target loss\(L_{\text{abst}}\)conditions and risks, and use\(C^*_{L_{\text{abst}}}\)To represent the optimal class\(L_{\text{abst}}\)conditional risks.
according to\(R_{L}^*(h, r)\)and\(C^*_L(\mathcal{H}, \mathcal{R}, x)\), we can expressMinimize ability gap:
\(M_{L_{\text{abst}}}\)The same is true for the expression.
So, we can prove\((\mathcal{H}, \mathcal{R})\)-Rewrite the consistency boundary:
in\(R_{L_{\text{abst}}}(h, r)\)and\(R_L(h, r)\)Each\(\mathbb{E}_{p(x)}C_{L_{\text{abst}}}(h, r, x)\)and\(\mathbb{E}_{p(x)}C_{L}(h, r, x)\), so the above inequality is
We abbreviated the desired term on both sides of the above inequality as\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)and\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\),in\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\)Calledcalibration gap. By definition\(\Gamma(\cdot)\)It is a concave function, and the Jensen inequality has:
Therefore, if we can prove the following inequality, the original inequality is proved:
We will see later\((\mathcal{H}, \mathcal{R})\)- An important step in the proof process of consistency is proof\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)Can be\(\Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)Definition.
3 \(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)Expression of
Let's take a look first\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) = C_{L_{\text{abst}}}(h, r, x) - C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}, x)\)How to express. By definition, we have:
Because it's about\(y\)The conditional expectation is only required in the last line of the above formula\(\mathbb{I}_{\text{h}(x) \neq y}\)Just perform weighted summing. For further consideration\(C_{L_{\text{abst}}}(h, r, x)\)We need to express it\(r(x)\)The positive and negative situations are discussed in a classified manner:
- \(r(x) > 0\):at this time\(C_{L_{\text{abst}}}(h, r, x) = \sum_{y\in \mathcal{Y}}p(y\mid x) \mathbb{I}_{\text{h}(x) \neq y} = 1 - p(\text{h}(x)\mid x)\)。
- \(r(x) \leqslant 0\):at this time\(C_{L_{\text{abst}}}(h, r, x) = c\)。
Let's take a look next\(C^*_{L_{\text{abst}}}\)How to express. We assume that the rejection function set\(\mathcal{R}\)It is complete (that is, to any\(x\in \mathcal{X}, \{r(x): r\in \mathcal{R}\} = \mathbb{R}\)),So\(\mathcal{R}\)It is also abstaining from the right (even if it has to be arbitrary\(x\in \mathcal{X}\),exist\(r_1, r_2\in \mathcal{R}\)satisfy\(r_1(x) > 0\)and\(r_2(x) \leqslant 0\)). So we have
We assume\(\mathcal{H}\)It is symmetrical and complete (see the blog for details"Learning Theory: Predictor-Rejector Multi-Classification Abstaining Learning"), then we have\(\{\text{h}(x): h\in \mathcal{H}\} = \mathcal{Y}\),then
For further consideration\(C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{y}, x)\)We need to express it\(\max_{y\in \mathcal{Y}}p(y\mid x)\)and\((1 - c)\)The size comparison is classified and discussed:
- \(\max_{y\in \mathcal{Y}}p(y\mid x) > 1 - c\):at this time\(C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}, x) = 1 - \max_{y\in \mathcal{Y}}p(y\mid x)\)。
- \(\max_{y\in \mathcal{Y}}p(y\mid x) \leqslant 1 - c\):at this time\(C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}, x) = c\)。
So, we have:
4 \(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\)Expression of
4.1 Preparation for classification discussion
Let's take a look next\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) = C_L(h, r, x) - C^*_L(\mathcal{H}, \mathcal{R}, x)\)How to express. By definition, if\(\alpha = \beta\),\(\phi(z) = \exp(-z)\), we have:
Because it's about\(y\)The conditional expectation is only required in the last line of the above formula\(\mathcal{l}(h, x, y)\)Just perform weighted summing. In the following article, we will target the following three different types:\(\mathcal{l}\)Functions and\(\psi(z)\)The selection situation is to match the\(C_L(h, r, x)\)Conduct discussion:
- \(\mathcal{l} = \mathcal{l}_{\text{mae}}\),\(\psi(z) = z\);
- \(\mathcal{l} = \mathcal{l}_{\rho}\),\(\psi(z) = z\);
- \(\mathcal{l} = \mathcal{l}_{\rho-\text{hinge}}\),\(\psi(z) = nz\)。
NoteThese three differences\(\mathcal{l}\)See the blog for definition"Learning Theory: Predictor-Rejector Multi-Classification Abstaining Learning"), I'll post their definitions here:
- Average absolute error loss:\(\mathcal{l}_{\text{mae}}(h, x, y) = 1 - \frac{e^{{h(x)}_y}}{\sum_{y^{\prime}\in \mathcal{Y}}e^{{h(x)}_{y^{\prime}}}}\);
- constraint\(\rho\)-Henging loss:\(\mathcal{l}_{\rho-\text{hinge}}(h, x, y) = \sum_{y^{\prime}\neq y}\phi_{\rho-\text{hinge}}(-{h(x)}_{y^{\prime}}), \rho > 0\),in\(\phi_{\rho-\text{hinge}}(z) = \max\{0, 1 - \frac{z}{\rho}\}\)for\(\rho\)-Henging loss and constraints\(\sum_{y\in \mathcal{Y}}{h(x)}_y=0\)。
- \(\rho\)- Interval loss:\(\mathcal{l}_{\rho}(h, x, y) = \phi_{\rho}({\rho_h (x, y)})\),in\(\rho_{h}(x, y) = h(x)_y - \max_{y^{\prime} \neq y}h(x)_{y^{\prime}}\)It's the confidence interval,\(\phi_{\rho}(z) = \min\{\max\{0, 1 - \frac{z}{\rho}\}, 1\}, \rho > 0\)for\(\rho\)- Interval loss.
4.2 \(\mathcal{l} = \mathcal{l}_{\text{mae}}\),\(\psi(z) = z\)
in this case\(C_L(h, r, x)\)Can be expressed as:
in\(s_h(x, y) = \frac{e^{{h(x)}_y}}{\sum_{y^{\prime}\in \mathcal{Y}}e^{{h(x)}_{y^{\prime}}}}\)。
then
Because of the assumption\(\mathcal{H}\)It is symmetrical and complete, we have
NoteIn fact, for any\(h\in \mathcal{H}\),have:
\[\begin{aligned} &\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) - \left(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\right) \\ &= \max_{y\in \mathcal{Y}} p(y\mid x) - \sum_{y\in \mathcal{Y}}p(y\mid x)s_h(x, y) \\ &= \max_{y\in \mathcal{Y}} p(y\mid x) - \left(p\left(\text{h}(x)\mid x\right)s_h\left(x, \text{h}(x)\right) + \sum_{y\neq \text{h}(x)}p(y\mid x)s_h(x, y)\right) \\ &\geqslant \max_{y\in \mathcal{Y}} p(y\mid x) - \left(p\left(\text{h}(x)\mid x\right)s_h\left(x, \text{h}(x)\right) + \max_{y\in \mathcal{Y}}p(y\mid x)\left(1 - s_h\left(x, \text{h}(x)\right)\right)\right) \\ &= s_h\left(x, \text{h}(x)\right)\left(\max_{y\in \mathcal{Y}}p(y\mid x) - p\left(\text{h}(x)\mid x\right)\right) \\ &\geqslant \frac{1}{n} \left(\max_{y\in \mathcal{Y}}p(y\mid x) - p\left(\text{h}(x)\mid x\right)\right) \end{aligned} \]
We will use this conclusion many times in the subsequent proof. One reason for this conclusion is if the classifier\(h^*\)is Bayesian optimal classifier (i.e.\(p(\text{h}^*(x)\mid x) = \max_{y\in \mathcal{Y}} p(y\mid x)\)),but\(\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) - \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right) \geqslant 0\), it can be understood intuitively as\(\mathbb{E}_{p(y\mid x)}\left[\mathcal{l}_{\text{mae}}\right]\)More likely to be closer to its lower boundary.
then
The part of the above formula that needs to be found in the extreme value is a functional\(F(r)\), then its functional derivative is
make\(\frac{\delta F}{\delta p(x)} = 0\)(right\(\forall x\in \mathcal{X}\)), the solution\(r^*(x) = -\frac{1}{2\alpha}\log \left(\frac{1 - \max_{y\in \mathcal{Y}}p(y\mid x)}{c}\right)\). Substitute it into\(F(r)\)Get:
then
To build\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)and\(\Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)Next, we will adopt similar practices in Section 3 for\(\max_{y\in \mathcal{Y}} p(y\mid x)\)and\(1 - c\)Comparison of size with\(r(x)\)The positive and negative situations are correct\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\)Conduct a classified discussion:
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\),\(r(x) > 0\):
at this time\[\begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)} \\ & \geqslant \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - \left(c + \underbrace{\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)}_{<c}\right) \\ & \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad (\text{AM-GM inequality}) \\ &\geqslant \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - ce^{-\alpha r(x)} - \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)e^{\alpha r(x)} \\ &\geqslant \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) - \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)\\ &\geqslant \frac{1}{n} \left(\max_{y\in \mathcal{Y}}p(y\mid x) - p\left(\text{h}(x)\mid x\right)\right) \\ &= \frac{1}{n} \Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \end{aligned} \](in\(\text{AM-GM inequality}\)for arithmetic-geometric mean inequality)
Pick\(\Gamma (z) = nz\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)Get proven. -
\(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\),\(r(x) > 0\):
\[ \begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)} \\ & \geqslant \underbrace{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h\left(x, y\right)\right)}_{\geqslant c}e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c\left(\sum_{y\in \mathcal{Y}}p(y\mid x)\left(1 - s_h(x, y)\right)\right)} \\ & \geqslant \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) + c - 2\sqrt{c\left(\sum_{y\in \mathcal{Y}}p(y\mid x)\left(1 - s_h(x, y)\right)\right)} \\ &= \left(\sqrt{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)} - \sqrt{c}\right)^2 \\ &= \left(\frac{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) - c}{\sqrt{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)} + \sqrt{c}}\right)^2 \\ &\geqslant \left(\frac{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) - \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right) + \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x) - c\right)}{2}\right)^2 \\ &\geqslant \left(\frac{\frac{1}{n} \left(\max_{y\in \mathcal{Y}}p(y\mid x) - p\left(\text{h}(x)\mid x\right)\right) + \frac{1}{n}\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x) - c\right)}{2}\right)^2 \\ &= \frac{1}{4n^2}\left(1 - c - p\left(\text{h}(x)\mid x\right)\right)^2 \\ &= \frac{\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4n^2} \end{aligned} \]Pick\(\Gamma (z) = 2n\sqrt{z}\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)Get proven.
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\),\(r(x) \leqslant 0\):
Because at this time\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) = 0\),therefore\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma\left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)For any\(\Gamma \geqslant 0\)Established. -
\(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\),\(r(x) \leqslant 0\):
\[ \begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)} \\ &\geqslant \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)\underbrace{e^{\alpha r(x)}}_{\leqslant 1} + c \underbrace{e^{-\alpha r(x)}}_{\geqslant 1} - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)} \\ &\geqslant 1 - \max_{y\in \mathcal{Y}}p(y\mid x) + c - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)} \\ &= \left(\sqrt{1 - \max_{y\in \mathcal{Y}}p(y\mid x)} - \sqrt{c}\right)^2 \\ &= \left(\frac{1 - \max_{y\in \mathcal{Y}}p(y\mid x) - c}{\sqrt{1 - \max_{y\in \mathcal{Y}}p(y\mid x)} + \sqrt{c}}\right)^2 \\ &\geqslant \left(\frac{\max_{y\in \mathcal{Y}}p(y\mid x) - 1 + c}{2}\right)^2 \\ &= \frac{\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4} \end{aligned} \]Pick\(\Gamma (z) = 2\sqrt{z}\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)Get proven.
To sum up, if\(\Gamma(z) = \max\{\Gamma_1(z), \Gamma_2(z), \Gamma_3(z)\} = \max\{2n\sqrt{z}, nz\}\), then there is always\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\). then\(\mathcal{l} = \mathcal{l}_{\text{mae}}\),\(\psi(z) = z\)Single-stage agent loss\((\mathcal{H}, \mathcal{R})\)-The world of consistency is proven.
4.3 \(\mathcal{l} = \mathcal{l}_{\rho}\),\(\psi(z) = z\)
in this case\(C_L(h, r, x)\)Can be expressed as:
in\(\rho_h(x, y) = h(x)_y - \max_{y^{\prime}\neq y}h(x)_{y^{\prime}}\)for interval.
Because of the assumption\(\mathcal{H}\)It is symmetrical and complete, we have
NoteIn fact, for any\(h\in \mathcal{H}\),have:
\[\begin{aligned} &\left(1 - \sum_{y\in \mathcal{Y}} p (y\mid x)\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\}\right) - \left(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\right) \\ &= \max_{y\in \mathcal{Y}} p(y\mid x) - \sum_{y\in \mathcal{Y}} p (y\mid x)\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\} \\ &= \max_{y\in \mathcal{Y}} p(y\mid x) - \min \left\{1, \frac{\rho_h\left(x, \text{h}(x)\right)}{\rho}\right\}p\left(\text{h}(x)\mid x\right) \\ &\geqslant \max_{y\in \mathcal{Y}}p\left(y\mid x\right) - p\left(\text{h}(x)\mid x\right) \end{aligned} \]
And before\(\mathcal{l}_{\text{mae}}\)The proof is similar, and we will use this conclusion many times in the subsequent proof.
So with the previous\(\mathcal{l}_{\text{mae}}\)Similarly, we have
then
To build\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)and\(\Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)Inequality relationship, we will adopt\(\mathcal{l}_{\text{mae}}\)The proof of similar practices, targeting\(\max_{y\in \mathcal{Y}} p(y\mid x)\)and\(1 - c\)Comparison of size with\(r(x)\)The positive and negative situations are correct\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\)Conduct a classified discussion:
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\),\(r(x) > 0\):
at this time\[\begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= \left(1 - \sum_{y\in \mathcal{Y}} p (y\mid x)\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\}\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\right)} \\ &\geqslant \frac{1}{4}\left(1 - c - p\left(\text{h}\left(x\right)\mid x\right)\right)^2 \\ &= \frac{\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4} \end{aligned} \](Due to the proof step with\(\mathcal{l}_{\text{mae}}\)Similarly, the proof steps have been simplified here, the following is the same)
Pick\(\Gamma_2 (z) = 2\sqrt{z}\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)Get proven. -
\(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\),\(r(x) > 0\):
\[\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) \geqslant \max_{y\in \mathcal{Y}}p(y\mid x) - p(\text{h}(x)\mid x) = \Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \]Pick\(\Gamma_1 (z) = z\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)Get proven.
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\),\(r(x) \leqslant 0\):
Because at this time\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) = 0\),therefore\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)For any\(\Gamma \geqslant 0\)Established. -
\(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\),\(r(x) \leqslant 0\):
\[\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) \geqslant \left(\frac{\max_{y\in \mathcal{Y}}p(y\mid x) - 1 + c}{2}\right)^2 = \frac{\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4} \]Pick\(\Gamma_3 (z) = 2\sqrt{z}\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)Get proven.
To sum up, if\(\Gamma(z) = \max\{\Gamma_1(z), \Gamma_2(z), \Gamma_3(z)\} = \max\{2\sqrt{z}, z\}\), then there is always\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\). then\(\mathcal{l} = \mathcal{l}_{\rho}\),\(\psi(z) = z\)Single-stage agent loss\((\mathcal{H}, \mathcal{R})\)-The world of consistency is proven.
4.4 \(\mathcal{l} = \mathcal{l}_{\rho-\text{hinge}}\),\(\psi(z) = nz\)
in this case\(C_L(h, r, x)\)Can be expressed as:
Because of the assumption\(\mathcal{H}\)It is symmetrical and complete, we have
NoteIn fact, if\(h_{\rho}\)Make\(h_{\rho}(x)_y = \left\{\begin{aligned} &h(x)_y\quad &\text{if } y\notin \left\{y_{\max}, \text{h}(x)\right\} \\ &-\rho \quad &\text{if } y = \text{h}(x) \\ &h\left(x\right)_{y_{\text{max}}} + h\left(x\right)_{\text{h}(x)} + \rho \quad &\text{if } y = y_{\text{max}} \\ \end{aligned}\right.\)Satisfies constraints\(\sum_{y\in \mathcal{Y}}h_{\rho}(y\mid x)=0\),in\(y_{\max} = \text{arg max}_{y\in \mathcal{Y}}p(y\mid x)\), for any\(h\in \mathcal{H}\)have:
\[\begin{aligned} &\sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\max\left\{0, 1 + \frac{h(x)_y}{\rho}\right\} - n\left(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\right) \\ &\geqslant \sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\min\left\{n, \max\left\{0, 1 + \frac{h(x)_y}{\rho}\right\}\right\} - n\left(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\right) \\ &\geqslant \sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\min\left\{n, \max\left\{0, 1 + \frac{h(x)_y}{\rho}\right\}\right\} \\ &\quad - \sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\min\left\{n, \max\left\{0, 1 + \frac{h_{\rho}(x)_y}{\rho}\right\}\right\} \\ &= \left(p(y_{\text{max}}\mid x) - p(\text{h}(x)\mid x)\right)\min\left\{n, 1 + \frac{h(x)_{\text{h}(x)}}{\rho}\right\} \\ &\geqslant \max_{y\in \mathcal{Y}}p\left(y\mid x\right) - p\left(\text{h}\left(x\right)\mid x\right) \end{aligned} \]
And before\(\mathcal{l}_{mae}\)、\(\mathcal{l}_{\rho}\)The proof is similar, and we will use this conclusion many times in the subsequent proof.
So with the previous\(\mathcal{l}_{mae}\)、\(\mathcal{l}_{\rho}\)Similarly, we have
then
To build\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)and\(\Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)Inequality relationship, we will adopt\(\mathcal{l}_{\text{mae}}\)、\(\mathcal{l}_{\rho}\)The proof of similar practices, targeting\(\max_{y\in \mathcal{Y}} p(y\mid x)\)and\(1 - c\)Comparison of size with\(r(x)\)The positive and negative situations are correct\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\)Conduct a classified discussion:
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\),\(r(x) > 0\):
at this time\[\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) \geqslant \max_{y\in \mathcal{Y}}p(y\mid x) - p(\text{h}(x)\mid x) = \Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2 \]Pick\(\Gamma_1 (z) = z\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)Get proven.
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\),\(r(x) > 0\):
\[\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) \geqslant \frac{1}{4n}\left(1 - c - p\left(\text{h}\left(x\right)\mid x\right)\right)^2 = \frac{\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4n} \]Pick\(\Gamma_1 (z) = 2\sqrt{nz}\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)Get proven.
-
\(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\),\(r(x) \leqslant 0\):
Because at this time\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) = 0\),therefore\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)For any\(\Gamma \geqslant 0\)Established. -
\(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\),\(r(x) \leqslant 0\):
\[\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) \geqslant n\left(\frac{\max_{y\in \mathcal{Y}}p(y\mid x) - 1 + c}{2}\right)^2 = \frac{n\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4} \]Pick\(\Gamma_3 (z) = 2\sqrt{z/n}\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)Get proven.
To sum up, if\(\Gamma(z) = \max\{\Gamma_1(z), \Gamma_2(z), \Gamma_3(z)\} = \max\{2\sqrt{nz}, z\}\), then there is always\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\). then\(\mathcal{l} = \mathcal{l}_{\rho-\text{hinge}}\),\(\psi(z) = nz\)Single-stage agent loss\((\mathcal{H}, \mathcal{R})\)-The world of consistency is proven.
refer to
- [1] Mao A, Mohri M, Zhong Y. Predictor-rejector multi-class abstention: Theoretical analysis and algorithms[C]//International Conference on Algorithmic Learning Theory. PMLR, 2024: 822-867.
- [2] Cortes C, DeSalvo G, Mohri M. Boosting with abstention[J]. Advances in Neural Information Processing Systems, 2016, 29.
- [3] Ni C, Charoenphakdee N, Honda J, et al. On the calibration of multiclass classification with rejection[J]. Advances in Neural Information Processing Systems, 2019, 32.
- [4] Han Bao: Learning Theory Bridges Loss Functions
- [5] Crammer K, Singer Y. On the algorithmic implementation of multiclass kernel-based vector machines[J]. Journal of machine learning research, 2001, 2(Dec): 265-292.
- [6] Awasthi P, Mao A, Mohri M, et al. Multi-Class $ H $-Consistency Bounds[J]. Advances in neural information processing systems, 2022, 35: 782-795.