Learning Theory: (H, R) of Single-Stage Agent Loss

1 Guide

We're on the previous blog"Learning Theory: Predictor-Rejector Multi-Classification Abstaining Learning"The basic concepts and methods of abstaining learning are introduced in this article, including the following single-stage predictor-rejector abstaining loss for multi-classification problems.$L_{\text{abst}}$：

\[L_{\text{abst}}(h, r, x, y) = \underbrace{\mathbb{I}_{\text{h}(x) \neq y}\mathbb{I}_{r(x) > 0}}_{\text{not abstained}} + \underbrace{c(x) \mathbb{I}_{r(x)\leqslant 0}}_{\text{abstained}} \]

in$(x, y)\in \mathcal{X}\times \mathcal{Y}$(Label$\mathcal{Y} = \{1, \cdots, n\}$（$n\geqslant 2$）），$(h, r)\in \mathcal{H}\times\mathcal{R}$For predictor-rejector pair ($\mathcal{H}$and$\mathcal{R}$For two$\mathcal{X}$arrive$\mathbb{R}$function family composed of functions),$\text{h}(x) = \text{arg max}_{y\in \mathcal{Y}} {h(x)}_y$Direct output instance$x$predictive tags. To simplify the discussion, we assume that$c\in (0, 1)$Spend function for a constant.

set up$\mathcal{l}$For in tags$\mathcal{Y}$The proxy loss of 0-1 multi-classification loss defined above, we can further define the abstaining proxy loss on this basis.$L$：

\[L(h, r, x, y) = \mathcal{l}(h, x, y)\phi(-\alpha r(x)) + \psi(c) \phi(\beta r(x)) \]

in$\psi$Yes, non-decreasing function,$\phi$Yes, non-incremental helper functions (as$z \mapsto \mathbb{I}_{z \leqslant 0}$the upper boundary),$\alpha$、$\beta$It is a normal amount. Below, for the sake of simplicity, we mainly deal with$\phi(z) = \exp(-z)$Perform analysis, although similar analysis can also be applied to other functions$\phi$。

In the previous blog, we also mentioned that single-stage proxy losses are met$(\mathcal{H}, \mathcal{R})$-Consistency boundary:

Theorem 1 Single-stage agent loss$(\mathcal{H}, \mathcal{R})$- ConsistencyAssumptions$\mathcal{H}$It is symmetrical and complete. Then it is correct$\alpha=\beta$，$\mathcal{l} = \mathcal{l}_{\text{mae}}$,or$\mathcal{l} = \mathcal{l}_{\rho}$and$\psi(z) = z$,or$\mathcal{l} = \mathcal{l}_{\rho - \text{hinge}}$and$\psi(z) = z$, there are the following$(\mathcal{H}, \mathcal{R})$- Consistency Binary$h\in \mathcal{H}, r\in \mathcal{R}$And arbitrary distribution is established:

Which is right$\mathcal{l} = \mathcal{l}_{\text{mae}}$Pick$\Gamma (z) = \max\{2n\sqrt{z}, nz\}$;right$\mathcal{l}=\mathcal{l}_{\rho}$Pick$\Gamma (z) = \max\{2\sqrt{z}, z\}$;right$\mathcal{l} = \mathcal{l}_{\rho - \text{hinge}}$Pick$\Gamma (z) = \max\{2\sqrt{nz}, z\}$。

However, in the previous blog, we did not show single-stage proxy losses$(\mathcal{H}, \mathcal{R})$- Detailed proof process of the consistency world. In this article, let’s see how to prove this theorem (my supervisor also asked me to take a closer look at the relevant analytical parts in these papers and hope that I can master the proof technology of the single-stage method).

2 Some preparatory concepts for analysis

We assume tagged samples$S=((x_1, y_1), \cdots, (x_m, y_m))$Independently distributed$p(x, y)$. For the target loss$L_{\text{abst}}$and agent losses$L$In other words, it can be defined separately$L_{\text{abst}}$- Expected loss of abstention$R_{L}(h, r)$(i.e. the generalization error of the target loss function) and$L$- Expected loss of abstaining agent$R_{L}(h, r)$(that is, the generalization error of the proxy loss function) is as follows:

\[R_{L_{\text{abst}}}(h, r) = \mathbb{E}_{p(x, y)}\left[L_{\text{abst}}(h, r, x, y)\right], \quad R_{L}(h, r) = \mathbb{E}_{p(x, y)}\left[L(h, r, x, y)\right] \]

set up$R_{{L}^{*}_{\text{abst}}}(\mathcal{H}, \mathcal{R}) = \inf_{h\in \mathcal{H}, r\in \mathcal{R}}R_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R})$and$R_{L}^{*}(\mathcal{H}, \mathcal{R}) = \inf_{h\in \mathcal{H}, r\in \mathcal{R}}R_{L}(\mathcal{H}, \mathcal{R})$Each$R_{L_{\text{abst}}}$and$R_L$exist$\mathcal{H}\times \mathcal{R}$The upper and lower boundaries are accurate.
To further simplify the subsequent analysis, we will use the multiplication rules of probability.$R_L(h, r)$Written as:

\[R_{L}(h, r) = \mathbb{E}_{p(x, y)}\left[L(h, r, x, y)\right] = \mathbb{E}_{p(x)}\underbrace{\left[\mathbb{E}_{p(y\mid x)}\left[L(h, r, x, y)\right]\right]}_{\text{conditional risk }C_L} \]

We call the conditional expectation item in the inner layer a proxy loss$L$ofconditional risk(also known as proxy loss$L$pointwise risk), due to the calculation process$y$Take the expectation, so the item is only$h$、$r$、$x$Related, so we write it as$C_L(h, r, x)$：

\[C_L(h, r, x) = \mathbb{E}_{p(y\mid x)}\left[L(h, r, x, y)\right] = \sum_{y\in \mathcal{Y}}p(y\mid x)L(h, r, x, y) \]

We use$C^*_L(\mathcal{H}, \mathcal{R}, x) = \inf_{h\in \mathcal{H}, r\in \mathcal{R}} C_L(h, r, x)$Let's expressAssume that class is optimal (best-in-class)of$L$conditional risks. Similarly, we use$C_{L_{\text{abst}}}$To indicate the target loss$L_{\text{abst}}$conditions and risks, and use$C^*_{L_{\text{abst}}}$To represent the optimal class$L_{\text{abst}}$conditional risks.

according to$R_{L}^*(h, r)$and$C^*_L(\mathcal{H}, \mathcal{R}, x)$, we can expressMinimize ability gap：

\[M_L(\mathcal{H}, \mathcal{R}) = R_{L}^*(\mathcal{H}, \mathcal{R}) - \mathbb{E}_{p(x)}\left[C_L^*(\mathcal{H}, \mathcal{R}, x)\right] \]

$M_{L_{\text{abst}}}$The same is true for the expression.

So, we can prove$(\mathcal{H}, \mathcal{R})$-Rewrite the consistency boundary:

\[R_{L_{\text{abst}}}(h, r) - R_{L_{\text{abst}}}^{*}(\mathcal{H}, \mathcal{R}) + M_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}) \leqslant \Gamma(R_L(h, r) - R_{L}^{*}(\mathcal{H}, \mathcal{R}) + M_{L}(\mathcal{H}, \mathcal{R}))\\ \Rightarrow R_{L_{\text{abst}}}(h, r) - \mathbb{E}_{p(x)}\left[C_{L_{\text{abst}}}^*(\mathcal{H}, \mathcal{R}, x)\right] \leqslant \Gamma\left(R_{L}(h, r) - \mathbb{E}_{p(x)}\left[C_{L}^*(\mathcal{H}, \mathcal{R}, x)\right]\right) \]

in$R_{L_{\text{abst}}}(h, r)$and$R_L(h, r)$Each$\mathbb{E}_{p(x)}C_{L_{\text{abst}}}(h, r, x)$and$\mathbb{E}_{p(x)}C_{L}(h, r, x)$, so the above inequality is

\[\mathbb{E}_{p(x)}\underbrace{\left[C_{L_{\text{abst}}}(h, r, x) - C_{L_{\text{abst}}}^*(\mathcal{H}, \mathcal{R}, x)\right]}_{\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)} \leqslant \Gamma\left(\mathbb{E}_{p(x)}\underbrace{\left[C_{L}(h, r, x) - C_{L}^*(\mathcal{H}, \mathcal{R}, x)\right]}_{\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)}\right) \]

We abbreviated the desired term on both sides of the above inequality as$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)$and$\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)$,in$\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)$Calledcalibration gap. By definition$\Gamma(\cdot)$It is a concave function, and the Jensen inequality has:

\[\mathbb{E}_{p(x)}\left[\Gamma\left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\right] \leqslant \Gamma\left(\mathbb{E}_{p(x)}\left[\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right]\right) \]

Therefore, if we can prove the following inequality, the original inequality is proved:

\[\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right) \]

We will see later$(\mathcal{H}, \mathcal{R})$- An important step in the proof process of consistency is proof$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)$Can be$\Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)$Definition.

3 $\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)$Expression of

Let's take a look first$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) = C_{L_{\text{abst}}}(h, r, x) - C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}, x)$How to express. By definition, we have:

\[\begin{aligned} C_{L_{\text{abst}}}(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x)L_{\text{abst}}(h, r, x, y) \\ &= \sum_{y\in \mathcal{Y}}p(y\mid x) \mathbb{I}_{\text{h}(x) \neq y}\mathbb{I}_{r(x) > 0} + c(x) \mathbb{I}_{r(x)\leqslant 0} \end{aligned} \]

Because it's about$y$The conditional expectation is only required in the last line of the above formula$\mathbb{I}_{\text{h}(x) \neq y}$Just perform weighted summing. For further consideration$C_{L_{\text{abst}}}(h, r, x)$We need to express it$r(x)$The positive and negative situations are discussed in a classified manner:

$r(x) > 0$:at this time$C_{L_{\text{abst}}}(h, r, x) = \sum_{y\in \mathcal{Y}}p(y\mid x) \mathbb{I}_{\text{h}(x) \neq y} = 1 - p(\text{h}(x)\mid x)$。
$r(x) \leqslant 0$:at this time$C_{L_{\text{abst}}}(h, r, x) = c$。

Let's take a look next$C^*_{L_{\text{abst}}}$How to express. We assume that the rejection function set$\mathcal{R}$It is complete (that is, to any$x\in \mathcal{X}, \{r(x): r\in \mathcal{R}\} = \mathbb{R}$),So$\mathcal{R}$It is also abstaining from the right (even if it has to be arbitrary$x\in \mathcal{X}$,exist$r_1, r_2\in \mathcal{R}$satisfy$r_1(x) > 0$and$r_2(x) \leqslant 0$). So we have

\[\begin{aligned} C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}, x) &= \inf_{h\in \mathcal{H}, r\in \mathcal{R}}C_{L_{\text{abst}}}(h, r, x)\\ & = \min \left\{\min_{h\in \mathcal{H}}\left(1 - p\left( \text{h}(x)\mid x\right)\right), c\right\}\\ & = 1 - \max\left\{\max_{h\in \mathcal{H}}p\left(\text{h}(x)\mid x\right), 1 - c\right\} \end{aligned} \]

We assume$\mathcal{H}$It is symmetrical and complete (see the blog for details"Learning Theory: Predictor-Rejector Multi-Classification Abstaining Learning"), then we have$\{\text{h}(x): h\in \mathcal{H}\} = \mathcal{Y}$,then

\[C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}, x) = 1 - \max\left\{\max_{y\in \mathcal{Y}}p\left(y\mid x\right), 1 - c\right\} \]

For further consideration$C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{y}, x)$We need to express it$\max_{y\in \mathcal{Y}}p(y\mid x)$and$(1 - c)$The size comparison is classified and discussed:

$\max_{y\in \mathcal{Y}}p(y\mid x) > 1 - c$:at this time$C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}, x) = 1 - \max_{y\in \mathcal{Y}}p(y\mid x)$。
$\max_{y\in \mathcal{Y}}p(y\mid x) \leqslant 1 - c$:at this time$C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}, x) = c$。

So, we have:

\[\begin{aligned} \Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) &= C_{L_{\text{abst}}}(h, r, x) - C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}, x) \\ & = \left\{\begin{aligned} &\max_{y\in \mathcal{Y}}p(y\mid x) - p(\text{h}(x)\mid x)\quad &\text{if } \max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)，r(x) > 0 \\ &1 - c - p(\text{h}(x)\mid x) \quad &\text{if } \max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)，r(x) > 0 \\ &0 \quad &\text{if } \max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)，r(x) \leqslant 0 \\ &\max_{y\in \mathcal{Y}}p(y\mid x) - 1 + c \quad &\text{if } \max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)，r(x) \leqslant 0 \\ \end{aligned}\right. \end{aligned} \]

4 $\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)$Expression of

4.1 Preparation for classification discussion

Let's take a look next$\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) = C_L(h, r, x) - C^*_L(\mathcal{H}, \mathcal{R}, x)$How to express. By definition, if$\alpha = \beta$，$\phi(z) = \exp(-z)$, we have:

\[\begin{aligned} C_L(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x)L(h, r, x, y) \\ &= \sum_{y\in \mathcal{Y}}p(y\mid x) \mathcal{l}(h, x, y)e^{\alpha r(x)} + \psi(c) e^{-\alpha r(x)} \end{aligned} \]

Because it's about$y$The conditional expectation is only required in the last line of the above formula$\mathcal{l}(h, x, y)$Just perform weighted summing. In the following article, we will target the following three different types:$\mathcal{l}$Functions and$\psi(z)$The selection situation is to match the$C_L(h, r, x)$Conduct discussion:

$\mathcal{l} = \mathcal{l}_{\text{mae}}$，$\psi(z) = z$；
$\mathcal{l} = \mathcal{l}_{\rho}$，$\psi(z) = z$；
$\mathcal{l} = \mathcal{l}_{\rho-\text{hinge}}$，$\psi(z) = nz$。

NoteThese three differences$\mathcal{l}$See the blog for definition"Learning Theory: Predictor-Rejector Multi-Classification Abstaining Learning"), I'll post their definitions here:

Average absolute error loss:$\mathcal{l}_{\text{mae}}(h, x, y) = 1 - \frac{e^{{h(x)}_y}}{\sum_{y^{\prime}\in \mathcal{Y}}e^{{h(x)}_{y^{\prime}}}}$；

constraint$\rho$-Henging loss:$\mathcal{l}_{\rho-\text{hinge}}(h, x, y) = \sum_{y^{\prime}\neq y}\phi_{\rho-\text{hinge}}(-{h(x)}_{y^{\prime}}), \rho > 0$,in$\phi_{\rho-\text{hinge}}(z) = \max\{0, 1 - \frac{z}{\rho}\}$for$\rho$-Henging loss and constraints$\sum_{y\in \mathcal{Y}}{h(x)}_y=0$。

$\rho$- Interval loss:$\mathcal{l}_{\rho}(h, x, y) = \phi_{\rho}({\rho_h (x, y)})$,in$\rho_{h}(x, y) = h(x)_y - \max_{y^{\prime} \neq y}h(x)_{y^{\prime}}$It's the confidence interval,$\phi_{\rho}(z) = \min\{\max\{0, 1 - \frac{z}{\rho}\}, 1\}, \rho > 0$for$\rho$- Interval loss.

4.2 $\mathcal{l} = \mathcal{l}_{\text{mae}}$，$\psi(z) = z$

in this case$C_L(h, r, x)$Can be expressed as:

\[\begin{aligned} C_L(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x) \underbrace{\left(1 - \frac{e^{{h(x)}_y}}{\sum_{y^{\prime}\in \mathcal{Y}}e^{{h(x)}_{y^{\prime}}}}\right)}_{\mathcal{l}_{\text{mae}}}e^{\alpha r(x)} + c e^{-\alpha r(x)} \\ &= \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} \end{aligned} \]

in$s_h(x, y) = \frac{e^{{h(x)}_y}}{\sum_{y^{\prime}\in \mathcal{Y}}e^{{h(x)}_{y^{\prime}}}}$。

then

\[\begin{aligned} C_L^*(\mathcal{H}, \mathcal{R}, x) &= \inf_{h\in \mathcal{H}, r\in\mathcal{R}} \left\{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)}\right\} \\ &= \inf_{r\in\mathcal{R}} \left\{\inf_{h\in \mathcal{H}}\left\{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)\right\}e^{\alpha r(x)} + c e^{-\alpha r(x)}\right\} \end{aligned} \]

Because of the assumption$\mathcal{H}$It is symmetrical and complete, we have

\[\begin{aligned} &\inf_{h\in \mathcal{H}}\left\{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y\right))\right\} \\ &= 1 - \sup_{h\in \mathcal{H}}\sum_{y\in \mathcal{Y}}p(y\mid x)s_h(x, y) \\ &= 1 - \max_{y\in \mathcal{Y}}p(y\mid x)\quad \left(s_h(x, y)\in (0, 1)\right) \end{aligned} \]

NoteIn fact, for any$h\in \mathcal{H}$,have:

\[\begin{aligned} &\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) - \left(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\right) \\ &= \max_{y\in \mathcal{Y}} p(y\mid x) - \sum_{y\in \mathcal{Y}}p(y\mid x)s_h(x, y) \\ &= \max_{y\in \mathcal{Y}} p(y\mid x) - \left(p\left(\text{h}(x)\mid x\right)s_h\left(x, \text{h}(x)\right) + \sum_{y\neq \text{h}(x)}p(y\mid x)s_h(x, y)\right) \\ &\geqslant \max_{y\in \mathcal{Y}} p(y\mid x) - \left(p\left(\text{h}(x)\mid x\right)s_h\left(x, \text{h}(x)\right) + \max_{y\in \mathcal{Y}}p(y\mid x)\left(1 - s_h\left(x, \text{h}(x)\right)\right)\right) \\ &= s_h\left(x, \text{h}(x)\right)\left(\max_{y\in \mathcal{Y}}p(y\mid x) - p\left(\text{h}(x)\mid x\right)\right) \\ &\geqslant \frac{1}{n} \left(\max_{y\in \mathcal{Y}}p(y\mid x) - p\left(\text{h}(x)\mid x\right)\right) \end{aligned} \]

We will use this conclusion many times in the subsequent proof. One reason for this conclusion is if the classifier$h^*$is Bayesian optimal classifier (i.e.$p(\text{h}^*(x)\mid x) = \max_{y\in \mathcal{Y}} p(y\mid x)$),but$\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) - \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right) \geqslant 0$, it can be understood intuitively as$\mathbb{E}_{p(y\mid x)}\left[\mathcal{l}_{\text{mae}}\right]$More likely to be closer to its lower boundary.

then

\[C_L^*(\mathcal{H}, \mathcal{R}, x) = \inf_{r\in\mathcal{R}} \left\{\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)}\right\} \]

The part of the above formula that needs to be found in the extreme value is a functional$F(r)$, then its functional derivative is

\[\frac{\delta F}{\delta r(x)} = \alpha \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)e^{\alpha r(x)} - c\alpha e^{-\alpha r(x)} \]

make$\frac{\delta F}{\delta p(x)} = 0$(right$\forall x\in \mathcal{X}$), the solution$r^*(x) = -\frac{1}{2\alpha}\log \left(\frac{1 - \max_{y\in \mathcal{Y}}p(y\mid x)}{c}\right)$. Substitute it into$F(r)$Get:

\[C_L^*(\mathcal{H}, \mathcal{R}, x) = 2\sqrt{c(1 - \max_{y\in \mathcal{Y}}p(y\mid x))} \]

then

\[\begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= C_L(h, r, x) - C^*_L(\mathcal{H}, \mathcal{R}, x) \\ &= \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c(1 - \max_{y\in \mathcal{Y}}p(y\mid x))} \end{aligned} \]

To build$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)$and$\Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)$Next, we will adopt similar practices in Section 3 for$\max_{y\in \mathcal{Y}} p(y\mid x)$and$1 - c$Comparison of size with$r(x)$The positive and negative situations are correct$\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)$Conduct a classified discussion:

$\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)$，$r(x) > 0$：
at this time

\[\begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)} \\ & \geqslant \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - \left(c + \underbrace{\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)}_{<c}\right) \\ & \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad (\text{AM-GM inequality}) \\ &\geqslant \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - ce^{-\alpha r(x)} - \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)e^{\alpha r(x)} \\ &\geqslant \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) - \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)\\ &\geqslant \frac{1}{n} \left(\max_{y\in \mathcal{Y}}p(y\mid x) - p\left(\text{h}(x)\mid x\right)\right) \\ &= \frac{1}{n} \Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \end{aligned} \]

(in$\text{AM-GM inequality}$for arithmetic-geometric mean inequality)
Pick$\Gamma (z) = nz$,then$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)$Get proven.
$\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)$，$r(x) > 0$：

\[ \begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)} \\ & \geqslant \underbrace{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h\left(x, y\right)\right)}_{\geqslant c}e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c\left(\sum_{y\in \mathcal{Y}}p(y\mid x)\left(1 - s_h(x, y)\right)\right)} \\ & \geqslant \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) + c - 2\sqrt{c\left(\sum_{y\in \mathcal{Y}}p(y\mid x)\left(1 - s_h(x, y)\right)\right)} \\ &= \left(\sqrt{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)} - \sqrt{c}\right)^2 \\ &= \left(\frac{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) - c}{\sqrt{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)} + \sqrt{c}}\right)^2 \\ &\geqslant \left(\frac{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) - \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right) + \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x) - c\right)}{2}\right)^2 \\ &\geqslant \left(\frac{\frac{1}{n} \left(\max_{y\in \mathcal{Y}}p(y\mid x) - p\left(\text{h}(x)\mid x\right)\right) + \frac{1}{n}\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x) - c\right)}{2}\right)^2 \\ &= \frac{1}{4n^2}\left(1 - c - p\left(\text{h}(x)\mid x\right)\right)^2 \\ &= \frac{\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4n^2} \end{aligned} \]

Pick$\Gamma (z) = 2n\sqrt{z}$,then$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)$Get proven.
$\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)$，$r(x) \leqslant 0$：
Because at this time$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) = 0$,therefore$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma\left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)$For any$\Gamma \geqslant 0$Established.
$\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)$，$r(x) \leqslant 0$：

\[ \begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)} \\ &\geqslant \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)\underbrace{e^{\alpha r(x)}}_{\leqslant 1} + c \underbrace{e^{-\alpha r(x)}}_{\geqslant 1} - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)} \\ &\geqslant 1 - \max_{y\in \mathcal{Y}}p(y\mid x) + c - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)} \\ &= \left(\sqrt{1 - \max_{y\in \mathcal{Y}}p(y\mid x)} - \sqrt{c}\right)^2 \\ &= \left(\frac{1 - \max_{y\in \mathcal{Y}}p(y\mid x) - c}{\sqrt{1 - \max_{y\in \mathcal{Y}}p(y\mid x)} + \sqrt{c}}\right)^2 \\ &\geqslant \left(\frac{\max_{y\in \mathcal{Y}}p(y\mid x) - 1 + c}{2}\right)^2 \\ &= \frac{\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4} \end{aligned} \]

Pick$\Gamma (z) = 2\sqrt{z}$,then$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)$Get proven.

To sum up, if$\Gamma(z) = \max\{\Gamma_1(z), \Gamma_2(z), \Gamma_3(z)\} = \max\{2n\sqrt{z}, nz\}$, then there is always$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)$. then$\mathcal{l} = \mathcal{l}_{\text{mae}}$，$\psi(z) = z$Single-stage agent loss$(\mathcal{H}, \mathcal{R})$-The world of consistency is proven.

4.3 $\mathcal{l} = \mathcal{l}_{\rho}$，$\psi(z) = z$

in this case$C_L(h, r, x)$Can be expressed as:

\[\begin{aligned} C_L(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x) \underbrace{\min\left\{\max\left\{0, 1 - \frac{\rho_h(x, y)}{\rho}\right\}, 1\right\}}_{\mathcal{l}_{\rho}}e^{\alpha r(x)} + c e^{-\alpha r(x)} \\ &= \left(1 - \sum_{y\in \mathcal{Y}} p(y\mid x)\max\left\{\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\}, 0\right\}\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} \\ &= \left(1 - \sum_{y\in \mathcal{Y}} p(y\mid x)\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\}\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} \end{aligned} \]

in$\rho_h(x, y) = h(x)_y - \max_{y^{\prime}\neq y}h(x)_{y^{\prime}}$for interval.

Because of the assumption$\mathcal{H}$It is symmetrical and complete, we have

\[\begin{aligned} &\inf_{h\in \mathcal{H}}\left\{1 - \sum_{y\in \mathcal{Y}} p (y\mid x)\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\}\right\} \\ &= 1 - \sup_{h\in \mathcal{H}}\sum_{y\in \mathcal{Y}}p(y\mid x)\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\} \\ &= 1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\quad (\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\}\in [0, 1]) \end{aligned} \]

NoteIn fact, for any$h\in \mathcal{H}$,have:

\[\begin{aligned} &\left(1 - \sum_{y\in \mathcal{Y}} p (y\mid x)\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\}\right) - \left(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\right) \\ &= \max_{y\in \mathcal{Y}} p(y\mid x) - \sum_{y\in \mathcal{Y}} p (y\mid x)\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\} \\ &= \max_{y\in \mathcal{Y}} p(y\mid x) - \min \left\{1, \frac{\rho_h\left(x, \text{h}(x)\right)}{\rho}\right\}p\left(\text{h}(x)\mid x\right) \\ &\geqslant \max_{y\in \mathcal{Y}}p\left(y\mid x\right) - p\left(\text{h}(x)\mid x\right) \end{aligned} \]

And before$\mathcal{l}_{\text{mae}}$The proof is similar, and we will use this conclusion many times in the subsequent proof.

So with the previous$\mathcal{l}_{\text{mae}}$Similarly, we have

\[C_L^*(\mathcal{H}, \mathcal{R}, x) = 2\sqrt{c(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right))} \]

then

\[\begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= C_L(h, r, x) - C^*_L(\mathcal{H}, \mathcal{R}, x) \\ &= \left(1 - \sum_{y\in \mathcal{Y}} p (y\mid x)\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\}\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right))} \end{aligned} \]

To build$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)$and$\Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)$Inequality relationship, we will adopt$\mathcal{l}_{\text{mae}}$The proof of similar practices, targeting$\max_{y\in \mathcal{Y}} p(y\mid x)$and$1 - c$Comparison of size with$r(x)$The positive and negative situations are correct$\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)$Conduct a classified discussion:

$\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)$，$r(x) > 0$：
at this time

\[\begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= \left(1 - \sum_{y\in \mathcal{Y}} p (y\mid x)\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\}\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\right)} \\ &\geqslant \frac{1}{4}\left(1 - c - p\left(\text{h}\left(x\right)\mid x\right)\right)^2 \\ &= \frac{\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4} \end{aligned} \]

(Due to the proof step with$\mathcal{l}_{\text{mae}}$Similarly, the proof steps have been simplified here, the following is the same)
Pick$\Gamma_2 (z) = 2\sqrt{z}$,then$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))$Get proven.
$\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)$，$r(x) > 0$：

\[\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) \geqslant \max_{y\in \mathcal{Y}}p(y\mid x) - p(\text{h}(x)\mid x) = \Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \]

Pick$\Gamma_1 (z) = z$,then$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))$Get proven.
$\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)$，$r(x) \leqslant 0$：
Because at this time$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) = 0$,therefore$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))$For any$\Gamma \geqslant 0$Established.
$\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)$，$r(x) \leqslant 0$：

\[\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) \geqslant \left(\frac{\max_{y\in \mathcal{Y}}p(y\mid x) - 1 + c}{2}\right)^2 = \frac{\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4} \]

Pick$\Gamma_3 (z) = 2\sqrt{z}$,then$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))$Get proven.

To sum up, if$\Gamma(z) = \max\{\Gamma_1(z), \Gamma_2(z), \Gamma_3(z)\} = \max\{2\sqrt{z}, z\}$, then there is always$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))$. then$\mathcal{l} = \mathcal{l}_{\rho}$，$\psi(z) = z$Single-stage agent loss$(\mathcal{H}, \mathcal{R})$-The world of consistency is proven.

4.4 $\mathcal{l} = \mathcal{l}_{\rho-\text{hinge}}$，$\psi(z) = nz$

in this case$C_L(h, r, x)$Can be expressed as:

\[\begin{aligned} C_L(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x) \underbrace{\sum_{y^{\prime} \neq y}\max\left\{0, 1 + \frac{h(x)_{y^{\prime}}}{\rho}\right\}}_{\mathcal{l}_{\rho}-\text{hinge}}e^{\alpha r(x)} + nce^{-\alpha r(x)} \\ &= \sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\max\left\{0, 1 + \frac{h(x)_y}{\rho}\right\}e^{\alpha r(x)} + nce^{-\alpha r(x)} \\ \end{aligned} \]

Because of the assumption$\mathcal{H}$It is symmetrical and complete, we have

\[\begin{aligned} &\inf_{h\in \mathcal{H}}\left\{\sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\max\left\{0, 1 + \frac{h(x)_y}{\rho}\right\}\right\} \\ &= n - \sup_{h\in \mathcal{H}}\sum_{y\in \mathcal{Y}}p(y\mid x)\max\left\{0, 1 + \frac{h(x)_y}{\rho}\right\} \\ &= n\left(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\right) \end{aligned} \]

NoteIn fact, if$h_{\rho}$Make$h_{\rho}(x)_y = \left\{\begin{aligned} &h(x)_y\quad &\text{if } y\notin \left\{y_{\max}, \text{h}(x)\right\} \\ &-\rho \quad &\text{if } y = \text{h}(x) \\ &h\left(x\right)_{y_{\text{max}}} + h\left(x\right)_{\text{h}(x)} + \rho \quad &\text{if } y = y_{\text{max}} \\ \end{aligned}\right.$Satisfies constraints$\sum_{y\in \mathcal{Y}}h_{\rho}(y\mid x)=0$,in$y_{\max} = \text{arg max}_{y\in \mathcal{Y}}p(y\mid x)$, for any$h\in \mathcal{H}$have:

\[\begin{aligned} &\sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\max\left\{0, 1 + \frac{h(x)_y}{\rho}\right\} - n\left(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\right) \\ &\geqslant \sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\min\left\{n, \max\left\{0, 1 + \frac{h(x)_y}{\rho}\right\}\right\} - n\left(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\right) \\ &\geqslant \sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\min\left\{n, \max\left\{0, 1 + \frac{h(x)_y}{\rho}\right\}\right\} \\ &\quad - \sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\min\left\{n, \max\left\{0, 1 + \frac{h_{\rho}(x)_y}{\rho}\right\}\right\} \\ &= \left(p(y_{\text{max}}\mid x) - p(\text{h}(x)\mid x)\right)\min\left\{n, 1 + \frac{h(x)_{\text{h}(x)}}{\rho}\right\} \\ &\geqslant \max_{y\in \mathcal{Y}}p\left(y\mid x\right) - p\left(\text{h}\left(x\right)\mid x\right) \end{aligned} \]

And before$\mathcal{l}_{mae}$、$\mathcal{l}_{\rho}$The proof is similar, and we will use this conclusion many times in the subsequent proof.

So with the previous$\mathcal{l}_{mae}$、$\mathcal{l}_{\rho}$Similarly, we have

\[C_L^*(\mathcal{H}, \mathcal{R}, x) = 2\sqrt{n^2c(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right))} \]

then

\[\begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= C_L(h, r, x) - C^*_L(\mathcal{H}, \mathcal{R}, x) \\ &= \sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\max\left\{0, 1 + \frac{h(x)_y}{\rho}\right\}e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right))} \end{aligned} \]

To build$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)$and$\Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)$Inequality relationship, we will adopt$\mathcal{l}_{\text{mae}}$、$\mathcal{l}_{\rho}$The proof of similar practices, targeting$\max_{y\in \mathcal{Y}} p(y\mid x)$and$1 - c$Comparison of size with$r(x)$The positive and negative situations are correct$\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)$Conduct a classified discussion:

$\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)$，$r(x) > 0$：
at this time

\[\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) \geqslant \max_{y\in \mathcal{Y}}p(y\mid x) - p(\text{h}(x)\mid x) = \Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2 \]

Pick$\Gamma_1 (z) = z$,then$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))$Get proven.
$\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)$，$r(x) > 0$：

\[\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) \geqslant \frac{1}{4n}\left(1 - c - p\left(\text{h}\left(x\right)\mid x\right)\right)^2 = \frac{\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4n} \]

Pick$\Gamma_1 (z) = 2\sqrt{nz}$,then$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))$Get proven.
$\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)$，$r(x) \leqslant 0$：
Because at this time$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) = 0$,therefore$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))$For any$\Gamma \geqslant 0$Established.
$\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)$，$r(x) \leqslant 0$：

\[\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) \geqslant n\left(\frac{\max_{y\in \mathcal{Y}}p(y\mid x) - 1 + c}{2}\right)^2 = \frac{n\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4} \]

Pick$\Gamma_3 (z) = 2\sqrt{z/n}$,then$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))$Get proven.

To sum up, if$\Gamma(z) = \max\{\Gamma_1(z), \Gamma_2(z), \Gamma_3(z)\} = \max\{2\sqrt{nz}, z\}$, then there is always$\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))$. then$\mathcal{l} = \mathcal{l}_{\rho-\text{hinge}}$，$\psi(z) = nz$Single-stage agent loss$(\mathcal{H}, \mathcal{R})$-The world of consistency is proven.

refer to

[1] Mao A, Mohri M, Zhong Y. Predictor-rejector multi-class abstention: Theoretical analysis and algorithms[C]//International Conference on Algorithmic Learning Theory. PMLR, 2024: 822-867.
[2] Cortes C, DeSalvo G, Mohri M. Boosting with abstention[J]. Advances in Neural Information Processing Systems, 2016, 29.
[3] Ni C, Charoenphakdee N, Honda J, et al. On the calibration of multiclass classification with rejection[J]. Advances in Neural Information Processing Systems, 2019, 32.
[4] Han Bao: Learning Theory Bridges Loss Functions
[5] Crammer K, Singer Y. On the algorithmic implementation of multiclass kernel-based vector machines[J]. Journal of machine learning research, 2001, 2(Dec): 265-292.
[6] Awasthi P, Mao A, Mohri M, et al. Multi-Class $ H $-Consistency Bounds[J]. Advances in neural information processing systems, 2022, 35: 782-795.

Learning Theory: (H, R) of Single-Stage Agent Loss - Proof of Consistency

4.1 Preparation for classification discussion

4.2 \(\mathcal{l} = \mathcal{l}_{\text{mae}}\)，\(\psi(z) = z\)

4.3 \(\mathcal{l} = \mathcal{l}_{\rho}\)，\(\psi(z) = z\)

4.4 \(\mathcal{l} = \mathcal{l}_{\rho-\text{hinge}}\)，\(\psi(z) = nz\)