Location>code7788 >text

Learning Theory: (H, R) of Single-Stage Agent Loss - Proof of Consistency

Popularity:668 ℃/2025-03-05 16:34:28

1 Guide

We're on the previous blog"Learning Theory: Predictor-Rejector Multi-Classification Abstaining Learning"The basic concepts and methods of abstaining learning are introduced in this article, including the following single-stage predictor-rejector abstaining loss for multi-classification problems.\(L_{\text{abst}}\)

\[L_{\text{abst}}(h, r, x, y) = \underbrace{\mathbb{I}_{\text{h}(x) \neq y}\mathbb{I}_{r(x) > 0}}_{\text{not abstained}} + \underbrace{c(x) \mathbb{I}_{r(x)\leqslant 0}}_{\text{abstained}} \]

in\((x, y)\in \mathcal{X}\times \mathcal{Y}\)(Label\(\mathcal{Y} = \{1, \cdots, n\}\)\(n\geqslant 2\))),\((h, r)\in \mathcal{H}\times\mathcal{R}\)For predictor-rejector pair (\(\mathcal{H}\)and\(\mathcal{R}\)For two\(\mathcal{X}\)arrive\(\mathbb{R}\)function family composed of functions),\(\text{h}(x) = \text{arg max}_{y\in \mathcal{Y}} {h(x)}_y\)Direct output instance\(x\)predictive tags. To simplify the discussion, we assume that\(c\in (0, 1)\)Spend function for a constant.

set up\(\mathcal{l}\)For in tags\(\mathcal{Y}\)The proxy loss of 0-1 multi-classification loss defined above, we can further define the abstaining proxy loss on this basis.\(L\)

\[L(h, r, x, y) = \mathcal{l}(h, x, y)\phi(-\alpha r(x)) + \psi(c) \phi(\beta r(x)) \]

in\(\psi\)Yes, non-decreasing function,\(\phi\)Yes, non-incremental helper functions (as\(z \mapsto \mathbb{I}_{z \leqslant 0}\)the upper boundary),\(\alpha\)\(\beta\)It is a normal amount. Below, for the sake of simplicity, we mainly deal with\(\phi(z) = \exp(-z)\)Perform analysis, although similar analysis can also be applied to other functions\(\phi\)

In the previous blog, we also mentioned that single-stage proxy losses are met\((\mathcal{H}, \mathcal{R})\)-Consistency boundary:

Theorem 1 Single-stage agent loss\((\mathcal{H}, \mathcal{R})\)- ConsistencyAssumptions\(\mathcal{H}\)It is symmetrical and complete. Then it is correct\(\alpha=\beta\)\(\mathcal{l} = \mathcal{l}_{\text{mae}}\),or\(\mathcal{l} = \mathcal{l}_{\rho}\)and\(\psi(z) = z\),or\(\mathcal{l} = \mathcal{l}_{\rho - \text{hinge}}\)and\(\psi(z) = z\), there are the following\((\mathcal{H}, \mathcal{R})\)- Consistency Binary\(h\in \mathcal{H}, r\in \mathcal{R}\)And arbitrary distribution is established:

\[R_{L_{\text{abst}}}(h, r) - R_{L_{\text{abst}}}^{*}(\mathcal{H}, \mathcal{R}) + M_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}) \leqslant \Gamma(R_L(h, r) - R_{L}^{*}(\mathcal{H}, \mathcal{R}) + M_{L}(\mathcal{H}, \mathcal{R})) \]

Which is right\(\mathcal{l} = \mathcal{l}_{\text{mae}}\)Pick\(\Gamma (z) = \max\{2n\sqrt{z}, nz\}\);right\(\mathcal{l}=\mathcal{l}_{\rho}\)Pick\(\Gamma (z) = \max\{2\sqrt{z}, z\}\);right\(\mathcal{l} = \mathcal{l}_{\rho - \text{hinge}}\)Pick\(\Gamma (z) = \max\{2\sqrt{nz}, z\}\)

However, in the previous blog, we did not show single-stage proxy losses\((\mathcal{H}, \mathcal{R})\)- Detailed proof process of the consistency world. In this article, let’s see how to prove this theorem (my supervisor also asked me to take a closer look at the relevant analytical parts in these papers and hope that I can master the proof technology of the single-stage method).

2 Some preparatory concepts for analysis

We assume tagged samples\(S=((x_1, y_1), \cdots, (x_m, y_m))\)Independently distributed\(p(x, y)\). For the target loss\(L_{\text{abst}}\)and agent losses\(L\)In other words, it can be defined separately\(L_{\text{abst}}\)- Expected loss of abstention\(R_{L}(h, r)\)(i.e. the generalization error of the target loss function) and\(L\)- Expected loss of abstaining agent\(R_{L}(h, r)\)(that is, the generalization error of the proxy loss function) is as follows:

\[R_{L_{\text{abst}}}(h, r) = \mathbb{E}_{p(x, y)}\left[L_{\text{abst}}(h, r, x, y)\right], \quad R_{L}(h, r) = \mathbb{E}_{p(x, y)}\left[L(h, r, x, y)\right] \]

set up\(R_{{L}^{*}_{\text{abst}}}(\mathcal{H}, \mathcal{R}) = \inf_{h\in \mathcal{H}, r\in \mathcal{R}}R_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R})\)and\(R_{L}^{*}(\mathcal{H}, \mathcal{R}) = \inf_{h\in \mathcal{H}, r\in \mathcal{R}}R_{L}(\mathcal{H}, \mathcal{R})\)Each\(R_{L_{\text{abst}}}\)and\(R_L\)exist\(\mathcal{H}\times \mathcal{R}\)The upper and lower boundaries are accurate.
To further simplify the subsequent analysis, we will use the multiplication rules of probability.\(R_L(h, r)\)Written as:

\[R_{L}(h, r) = \mathbb{E}_{p(x, y)}\left[L(h, r, x, y)\right] = \mathbb{E}_{p(x)}\underbrace{\left[\mathbb{E}_{p(y\mid x)}\left[L(h, r, x, y)\right]\right]}_{\text{conditional risk }C_L} \]

We call the conditional expectation item in the inner layer a proxy loss\(L\)ofconditional risk(also known as proxy loss\(L\)pointwise risk), due to the calculation process\(y\)Take the expectation, so the item is only\(h\)\(r\)\(x\)Related, so we write it as\(C_L(h, r, x)\)

\[C_L(h, r, x) = \mathbb{E}_{p(y\mid x)}\left[L(h, r, x, y)\right] = \sum_{y\in \mathcal{Y}}p(y\mid x)L(h, r, x, y) \]

We use\(C^*_L(\mathcal{H}, \mathcal{R}, x) = \inf_{h\in \mathcal{H}, r\in \mathcal{R}} C_L(h, r, x)\)Let's expressAssume that class is optimal (best-in-class)of\(L\)conditional risks. Similarly, we use\(C_{L_{\text{abst}}}\)To indicate the target loss\(L_{\text{abst}}\)conditions and risks, and use\(C^*_{L_{\text{abst}}}\)To represent the optimal class\(L_{\text{abst}}\)conditional risks.

according to\(R_{L}^*(h, r)\)and\(C^*_L(\mathcal{H}, \mathcal{R}, x)\), we can expressMinimize ability gap

\[M_L(\mathcal{H}, \mathcal{R}) = R_{L}^*(\mathcal{H}, \mathcal{R}) - \mathbb{E}_{p(x)}\left[C_L^*(\mathcal{H}, \mathcal{R}, x)\right] \]

\(M_{L_{\text{abst}}}\)The same is true for the expression.

So, we can prove\((\mathcal{H}, \mathcal{R})\)-Rewrite the consistency boundary:

\[R_{L_{\text{abst}}}(h, r) - R_{L_{\text{abst}}}^{*}(\mathcal{H}, \mathcal{R}) + M_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}) \leqslant \Gamma(R_L(h, r) - R_{L}^{*}(\mathcal{H}, \mathcal{R}) + M_{L}(\mathcal{H}, \mathcal{R}))\\ \Rightarrow R_{L_{\text{abst}}}(h, r) - \mathbb{E}_{p(x)}\left[C_{L_{\text{abst}}}^*(\mathcal{H}, \mathcal{R}, x)\right] \leqslant \Gamma\left(R_{L}(h, r) - \mathbb{E}_{p(x)}\left[C_{L}^*(\mathcal{H}, \mathcal{R}, x)\right]\right) \]

in\(R_{L_{\text{abst}}}(h, r)\)and\(R_L(h, r)\)Each\(\mathbb{E}_{p(x)}C_{L_{\text{abst}}}(h, r, x)\)and\(\mathbb{E}_{p(x)}C_{L}(h, r, x)\), so the above inequality is

\[\mathbb{E}_{p(x)}\underbrace{\left[C_{L_{\text{abst}}}(h, r, x) - C_{L_{\text{abst}}}^*(\mathcal{H}, \mathcal{R}, x)\right]}_{\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)} \leqslant \Gamma\left(\mathbb{E}_{p(x)}\underbrace{\left[C_{L}(h, r, x) - C_{L}^*(\mathcal{H}, \mathcal{R}, x)\right]}_{\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)}\right) \]

We abbreviated the desired term on both sides of the above inequality as\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)and\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\),in\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\)Calledcalibration gap. By definition\(\Gamma(\cdot)\)It is a concave function, and the Jensen inequality has:

\[\mathbb{E}_{p(x)}\left[\Gamma\left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\right] \leqslant \Gamma\left(\mathbb{E}_{p(x)}\left[\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right]\right) \]

Therefore, if we can prove the following inequality, the original inequality is proved:

\[\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right) \]

We will see later\((\mathcal{H}, \mathcal{R})\)- An important step in the proof process of consistency is proof\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)Can be\(\Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)Definition.

3 \(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)Expression of

Let's take a look first\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) = C_{L_{\text{abst}}}(h, r, x) - C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}, x)\)How to express. By definition, we have:

\[\begin{aligned} C_{L_{\text{abst}}}(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x)L_{\text{abst}}(h, r, x, y) \\ &= \sum_{y\in \mathcal{Y}}p(y\mid x) \mathbb{I}_{\text{h}(x) \neq y}\mathbb{I}_{r(x) > 0} + c(x) \mathbb{I}_{r(x)\leqslant 0} \end{aligned} \]

Because it's about\(y\)The conditional expectation is only required in the last line of the above formula\(\mathbb{I}_{\text{h}(x) \neq y}\)Just perform weighted summing. For further consideration\(C_{L_{\text{abst}}}(h, r, x)\)We need to express it\(r(x)\)The positive and negative situations are discussed in a classified manner:

  1. \(r(x) > 0\):at this time\(C_{L_{\text{abst}}}(h, r, x) = \sum_{y\in \mathcal{Y}}p(y\mid x) \mathbb{I}_{\text{h}(x) \neq y} = 1 - p(\text{h}(x)\mid x)\)
  2. \(r(x) \leqslant 0\):at this time\(C_{L_{\text{abst}}}(h, r, x) = c\)

Let's take a look next\(C^*_{L_{\text{abst}}}\)How to express. We assume that the rejection function set\(\mathcal{R}\)It is complete (that is, to any\(x\in \mathcal{X}, \{r(x): r\in \mathcal{R}\} = \mathbb{R}\)),So\(\mathcal{R}\)It is also abstaining from the right (even if it has to be arbitrary\(x\in \mathcal{X}\),exist\(r_1, r_2\in \mathcal{R}\)satisfy\(r_1(x) > 0\)and\(r_2(x) \leqslant 0\)). So we have

\[\begin{aligned} C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}, x) &= \inf_{h\in \mathcal{H}, r\in \mathcal{R}}C_{L_{\text{abst}}}(h, r, x)\\ & = \min \left\{\min_{h\in \mathcal{H}}\left(1 - p\left( \text{h}(x)\mid x\right)\right), c\right\}\\ & = 1 - \max\left\{\max_{h\in \mathcal{H}}p\left(\text{h}(x)\mid x\right), 1 - c\right\} \end{aligned} \]

We assume\(\mathcal{H}\)It is symmetrical and complete (see the blog for details"Learning Theory: Predictor-Rejector Multi-Classification Abstaining Learning"), then we have\(\{\text{h}(x): h\in \mathcal{H}\} = \mathcal{Y}\),then

\[C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}, x) = 1 - \max\left\{\max_{y\in \mathcal{Y}}p\left(y\mid x\right), 1 - c\right\} \]

For further consideration\(C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{y}, x)\)We need to express it\(\max_{y\in \mathcal{Y}}p(y\mid x)\)and\((1 - c)\)The size comparison is classified and discussed:

  1. \(\max_{y\in \mathcal{Y}}p(y\mid x) > 1 - c\):at this time\(C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}, x) = 1 - \max_{y\in \mathcal{Y}}p(y\mid x)\)
  2. \(\max_{y\in \mathcal{Y}}p(y\mid x) \leqslant 1 - c\):at this time\(C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}, x) = c\)

So, we have:

\[\begin{aligned} \Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) &= C_{L_{\text{abst}}}(h, r, x) - C^*_{L_{\text{abst}}}(\mathcal{H}, \mathcal{R}, x) \\ & = \left\{\begin{aligned} &\max_{y\in \mathcal{Y}}p(y\mid x) - p(\text{h}(x)\mid x)\quad &\text{if } \max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c),r(x) > 0 \\ &1 - c - p(\text{h}(x)\mid x) \quad &\text{if } \max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c),r(x) > 0 \\ &0 \quad &\text{if } \max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c),r(x) \leqslant 0 \\ &\max_{y\in \mathcal{Y}}p(y\mid x) - 1 + c \quad &\text{if } \max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c),r(x) \leqslant 0 \\ \end{aligned}\right. \end{aligned} \]

4 \(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\)Expression of

4.1 Preparation for classification discussion

Let's take a look next\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) = C_L(h, r, x) - C^*_L(\mathcal{H}, \mathcal{R}, x)\)How to express. By definition, if\(\alpha = \beta\)\(\phi(z) = \exp(-z)\), we have:

\[\begin{aligned} C_L(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x)L(h, r, x, y) \\ &= \sum_{y\in \mathcal{Y}}p(y\mid x) \mathcal{l}(h, x, y)e^{\alpha r(x)} + \psi(c) e^{-\alpha r(x)} \end{aligned} \]

Because it's about\(y\)The conditional expectation is only required in the last line of the above formula\(\mathcal{l}(h, x, y)\)Just perform weighted summing. In the following article, we will target the following three different types:\(\mathcal{l}\)Functions and\(\psi(z)\)The selection situation is to match the\(C_L(h, r, x)\)Conduct discussion:

  1. \(\mathcal{l} = \mathcal{l}_{\text{mae}}\)\(\psi(z) = z\)
  2. \(\mathcal{l} = \mathcal{l}_{\rho}\)\(\psi(z) = z\)
  3. \(\mathcal{l} = \mathcal{l}_{\rho-\text{hinge}}\)\(\psi(z) = nz\)

NoteThese three differences\(\mathcal{l}\)See the blog for definition"Learning Theory: Predictor-Rejector Multi-Classification Abstaining Learning"), I'll post their definitions here:

  • Average absolute error loss:\(\mathcal{l}_{\text{mae}}(h, x, y) = 1 - \frac{e^{{h(x)}_y}}{\sum_{y^{\prime}\in \mathcal{Y}}e^{{h(x)}_{y^{\prime}}}}\)
  • constraint\(\rho\)-Henging loss:\(\mathcal{l}_{\rho-\text{hinge}}(h, x, y) = \sum_{y^{\prime}\neq y}\phi_{\rho-\text{hinge}}(-{h(x)}_{y^{\prime}}), \rho > 0\),in\(\phi_{\rho-\text{hinge}}(z) = \max\{0, 1 - \frac{z}{\rho}\}\)for\(\rho\)-Henging loss and constraints\(\sum_{y\in \mathcal{Y}}{h(x)}_y=0\)
  • \(\rho\)- Interval loss:\(\mathcal{l}_{\rho}(h, x, y) = \phi_{\rho}({\rho_h (x, y)})\),in\(\rho_{h}(x, y) = h(x)_y - \max_{y^{\prime} \neq y}h(x)_{y^{\prime}}\)It's the confidence interval,\(\phi_{\rho}(z) = \min\{\max\{0, 1 - \frac{z}{\rho}\}, 1\}, \rho > 0\)for\(\rho\)- Interval loss.

4.2 \(\mathcal{l} = \mathcal{l}_{\text{mae}}\)\(\psi(z) = z\)

in this case\(C_L(h, r, x)\)Can be expressed as:

\[\begin{aligned} C_L(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x) \underbrace{\left(1 - \frac{e^{{h(x)}_y}}{\sum_{y^{\prime}\in \mathcal{Y}}e^{{h(x)}_{y^{\prime}}}}\right)}_{\mathcal{l}_{\text{mae}}}e^{\alpha r(x)} + c e^{-\alpha r(x)} \\ &= \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} \end{aligned} \]

in\(s_h(x, y) = \frac{e^{{h(x)}_y}}{\sum_{y^{\prime}\in \mathcal{Y}}e^{{h(x)}_{y^{\prime}}}}\)

then

\[\begin{aligned} C_L^*(\mathcal{H}, \mathcal{R}, x) &= \inf_{h\in \mathcal{H}, r\in\mathcal{R}} \left\{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)}\right\} \\ &= \inf_{r\in\mathcal{R}} \left\{\inf_{h\in \mathcal{H}}\left\{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)\right\}e^{\alpha r(x)} + c e^{-\alpha r(x)}\right\} \end{aligned} \]

Because of the assumption\(\mathcal{H}\)It is symmetrical and complete, we have

\[\begin{aligned} &\inf_{h\in \mathcal{H}}\left\{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y\right))\right\} \\ &= 1 - \sup_{h\in \mathcal{H}}\sum_{y\in \mathcal{Y}}p(y\mid x)s_h(x, y) \\ &= 1 - \max_{y\in \mathcal{Y}}p(y\mid x)\quad \left(s_h(x, y)\in (0, 1)\right) \end{aligned} \]

NoteIn fact, for any\(h\in \mathcal{H}\),have:

\[\begin{aligned} &\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) - \left(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\right) \\ &= \max_{y\in \mathcal{Y}} p(y\mid x) - \sum_{y\in \mathcal{Y}}p(y\mid x)s_h(x, y) \\ &= \max_{y\in \mathcal{Y}} p(y\mid x) - \left(p\left(\text{h}(x)\mid x\right)s_h\left(x, \text{h}(x)\right) + \sum_{y\neq \text{h}(x)}p(y\mid x)s_h(x, y)\right) \\ &\geqslant \max_{y\in \mathcal{Y}} p(y\mid x) - \left(p\left(\text{h}(x)\mid x\right)s_h\left(x, \text{h}(x)\right) + \max_{y\in \mathcal{Y}}p(y\mid x)\left(1 - s_h\left(x, \text{h}(x)\right)\right)\right) \\ &= s_h\left(x, \text{h}(x)\right)\left(\max_{y\in \mathcal{Y}}p(y\mid x) - p\left(\text{h}(x)\mid x\right)\right) \\ &\geqslant \frac{1}{n} \left(\max_{y\in \mathcal{Y}}p(y\mid x) - p\left(\text{h}(x)\mid x\right)\right) \end{aligned} \]

We will use this conclusion many times in the subsequent proof. One reason for this conclusion is if the classifier\(h^*\)is Bayesian optimal classifier (i.e.\(p(\text{h}^*(x)\mid x) = \max_{y\in \mathcal{Y}} p(y\mid x)\)),but\(\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) - \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right) \geqslant 0\), it can be understood intuitively as\(\mathbb{E}_{p(y\mid x)}\left[\mathcal{l}_{\text{mae}}\right]\)More likely to be closer to its lower boundary.

then

\[C_L^*(\mathcal{H}, \mathcal{R}, x) = \inf_{r\in\mathcal{R}} \left\{\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)}\right\} \]

The part of the above formula that needs to be found in the extreme value is a functional\(F(r)\), then its functional derivative is

\[\frac{\delta F}{\delta r(x)} = \alpha \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)e^{\alpha r(x)} - c\alpha e^{-\alpha r(x)} \]

make\(\frac{\delta F}{\delta p(x)} = 0\)(right\(\forall x\in \mathcal{X}\)), the solution\(r^*(x) = -\frac{1}{2\alpha}\log \left(\frac{1 - \max_{y\in \mathcal{Y}}p(y\mid x)}{c}\right)\). Substitute it into\(F(r)\)Get:

\[C_L^*(\mathcal{H}, \mathcal{R}, x) = 2\sqrt{c(1 - \max_{y\in \mathcal{Y}}p(y\mid x))} \]

then

\[\begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= C_L(h, r, x) - C^*_L(\mathcal{H}, \mathcal{R}, x) \\ &= \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c(1 - \max_{y\in \mathcal{Y}}p(y\mid x))} \end{aligned} \]

To build\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)and\(\Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)Next, we will adopt similar practices in Section 3 for\(\max_{y\in \mathcal{Y}} p(y\mid x)\)and\(1 - c\)Comparison of size with\(r(x)\)The positive and negative situations are correct\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\)Conduct a classified discussion:

  1. \(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\)\(r(x) > 0\)
    at this time

    \[\begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)} \\ & \geqslant \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - \left(c + \underbrace{\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)}_{<c}\right) \\ & \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad (\text{AM-GM inequality}) \\ &\geqslant \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - ce^{-\alpha r(x)} - \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)e^{\alpha r(x)} \\ &\geqslant \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) - \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)\\ &\geqslant \frac{1}{n} \left(\max_{y\in \mathcal{Y}}p(y\mid x) - p\left(\text{h}(x)\mid x\right)\right) \\ &= \frac{1}{n} \Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \end{aligned} \]

    (in\(\text{AM-GM inequality}\)for arithmetic-geometric mean inequality)
    Pick\(\Gamma (z) = nz\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)Get proven.

  2. \(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\)\(r(x) > 0\)

    \[ \begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)} \\ & \geqslant \underbrace{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h\left(x, y\right)\right)}_{\geqslant c}e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c\left(\sum_{y\in \mathcal{Y}}p(y\mid x)\left(1 - s_h(x, y)\right)\right)} \\ & \geqslant \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) + c - 2\sqrt{c\left(\sum_{y\in \mathcal{Y}}p(y\mid x)\left(1 - s_h(x, y)\right)\right)} \\ &= \left(\sqrt{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)} - \sqrt{c}\right)^2 \\ &= \left(\frac{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) - c}{\sqrt{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)} + \sqrt{c}}\right)^2 \\ &\geqslant \left(\frac{\sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right) - \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right) + \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x) - c\right)}{2}\right)^2 \\ &\geqslant \left(\frac{\frac{1}{n} \left(\max_{y\in \mathcal{Y}}p(y\mid x) - p\left(\text{h}(x)\mid x\right)\right) + \frac{1}{n}\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x) - c\right)}{2}\right)^2 \\ &= \frac{1}{4n^2}\left(1 - c - p\left(\text{h}(x)\mid x\right)\right)^2 \\ &= \frac{\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4n^2} \end{aligned} \]

    Pick\(\Gamma (z) = 2n\sqrt{z}\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)Get proven.

  3. \(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\)\(r(x) \leqslant 0\)
    Because at this time\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) = 0\),therefore\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma\left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)For any\(\Gamma \geqslant 0\)Established.

  4. \(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\)\(r(x) \leqslant 0\)

    \[ \begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x) \left(1 - s_h(x, y)\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)} \\ &\geqslant \left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)\underbrace{e^{\alpha r(x)}}_{\leqslant 1} + c \underbrace{e^{-\alpha r(x)}}_{\geqslant 1} - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)} \\ &\geqslant 1 - \max_{y\in \mathcal{Y}}p(y\mid x) + c - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p(y\mid x)\right)} \\ &= \left(\sqrt{1 - \max_{y\in \mathcal{Y}}p(y\mid x)} - \sqrt{c}\right)^2 \\ &= \left(\frac{1 - \max_{y\in \mathcal{Y}}p(y\mid x) - c}{\sqrt{1 - \max_{y\in \mathcal{Y}}p(y\mid x)} + \sqrt{c}}\right)^2 \\ &\geqslant \left(\frac{\max_{y\in \mathcal{Y}}p(y\mid x) - 1 + c}{2}\right)^2 \\ &= \frac{\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4} \end{aligned} \]

    Pick\(\Gamma (z) = 2\sqrt{z}\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)Get proven.

To sum up, if\(\Gamma(z) = \max\{\Gamma_1(z), \Gamma_2(z), \Gamma_3(z)\} = \max\{2n\sqrt{z}, nz\}\), then there is always\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\). then\(\mathcal{l} = \mathcal{l}_{\text{mae}}\)\(\psi(z) = z\)Single-stage agent loss\((\mathcal{H}, \mathcal{R})\)-The world of consistency is proven.

4.3 \(\mathcal{l} = \mathcal{l}_{\rho}\)\(\psi(z) = z\)

in this case\(C_L(h, r, x)\)Can be expressed as:

\[\begin{aligned} C_L(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x) \underbrace{\min\left\{\max\left\{0, 1 - \frac{\rho_h(x, y)}{\rho}\right\}, 1\right\}}_{\mathcal{l}_{\rho}}e^{\alpha r(x)} + c e^{-\alpha r(x)} \\ &= \left(1 - \sum_{y\in \mathcal{Y}} p(y\mid x)\max\left\{\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\}, 0\right\}\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} \\ &= \left(1 - \sum_{y\in \mathcal{Y}} p(y\mid x)\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\}\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} \end{aligned} \]

in\(\rho_h(x, y) = h(x)_y - \max_{y^{\prime}\neq y}h(x)_{y^{\prime}}\)for interval.

Because of the assumption\(\mathcal{H}\)It is symmetrical and complete, we have

\[\begin{aligned} &\inf_{h\in \mathcal{H}}\left\{1 - \sum_{y\in \mathcal{Y}} p (y\mid x)\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\}\right\} \\ &= 1 - \sup_{h\in \mathcal{H}}\sum_{y\in \mathcal{Y}}p(y\mid x)\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\} \\ &= 1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\quad (\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\}\in [0, 1]) \end{aligned} \]

NoteIn fact, for any\(h\in \mathcal{H}\),have:

\[\begin{aligned} &\left(1 - \sum_{y\in \mathcal{Y}} p (y\mid x)\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\}\right) - \left(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\right) \\ &= \max_{y\in \mathcal{Y}} p(y\mid x) - \sum_{y\in \mathcal{Y}} p (y\mid x)\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\} \\ &= \max_{y\in \mathcal{Y}} p(y\mid x) - \min \left\{1, \frac{\rho_h\left(x, \text{h}(x)\right)}{\rho}\right\}p\left(\text{h}(x)\mid x\right) \\ &\geqslant \max_{y\in \mathcal{Y}}p\left(y\mid x\right) - p\left(\text{h}(x)\mid x\right) \end{aligned} \]

And before\(\mathcal{l}_{\text{mae}}\)The proof is similar, and we will use this conclusion many times in the subsequent proof.

So with the previous\(\mathcal{l}_{\text{mae}}\)Similarly, we have

\[C_L^*(\mathcal{H}, \mathcal{R}, x) = 2\sqrt{c(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right))} \]

then

\[\begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= C_L(h, r, x) - C^*_L(\mathcal{H}, \mathcal{R}, x) \\ &= \left(1 - \sum_{y\in \mathcal{Y}} p (y\mid x)\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\}\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right))} \end{aligned} \]

To build\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)and\(\Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)Inequality relationship, we will adopt\(\mathcal{l}_{\text{mae}}\)The proof of similar practices, targeting\(\max_{y\in \mathcal{Y}} p(y\mid x)\)and\(1 - c\)Comparison of size with\(r(x)\)The positive and negative situations are correct\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\)Conduct a classified discussion:

  1. \(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\)\(r(x) > 0\)
    at this time

    \[\begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= \left(1 - \sum_{y\in \mathcal{Y}} p (y\mid x)\min\left\{1, \frac{\rho_h(x, y)}{\rho}\right\}\right)e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c\left(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\right)} \\ &\geqslant \frac{1}{4}\left(1 - c - p\left(\text{h}\left(x\right)\mid x\right)\right)^2 \\ &= \frac{\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4} \end{aligned} \]

    (Due to the proof step with\(\mathcal{l}_{\text{mae}}\)Similarly, the proof steps have been simplified here, the following is the same)
    Pick\(\Gamma_2 (z) = 2\sqrt{z}\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)Get proven.

  2. \(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\)\(r(x) > 0\)

    \[\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) \geqslant \max_{y\in \mathcal{Y}}p(y\mid x) - p(\text{h}(x)\mid x) = \Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \]

    Pick\(\Gamma_1 (z) = z\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)Get proven.

  3. \(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\)\(r(x) \leqslant 0\)
    Because at this time\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) = 0\),therefore\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)For any\(\Gamma \geqslant 0\)Established.

  4. \(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\)\(r(x) \leqslant 0\)

    \[\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) \geqslant \left(\frac{\max_{y\in \mathcal{Y}}p(y\mid x) - 1 + c}{2}\right)^2 = \frac{\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4} \]

    Pick\(\Gamma_3 (z) = 2\sqrt{z}\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)Get proven.

To sum up, if\(\Gamma(z) = \max\{\Gamma_1(z), \Gamma_2(z), \Gamma_3(z)\} = \max\{2\sqrt{z}, z\}\), then there is always\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\). then\(\mathcal{l} = \mathcal{l}_{\rho}\)\(\psi(z) = z\)Single-stage agent loss\((\mathcal{H}, \mathcal{R})\)-The world of consistency is proven.

4.4 \(\mathcal{l} = \mathcal{l}_{\rho-\text{hinge}}\)\(\psi(z) = nz\)

in this case\(C_L(h, r, x)\)Can be expressed as:

\[\begin{aligned} C_L(h, r, x) &= \sum_{y\in \mathcal{Y}}p(y\mid x) \underbrace{\sum_{y^{\prime} \neq y}\max\left\{0, 1 + \frac{h(x)_{y^{\prime}}}{\rho}\right\}}_{\mathcal{l}_{\rho}-\text{hinge}}e^{\alpha r(x)} + nce^{-\alpha r(x)} \\ &= \sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\max\left\{0, 1 + \frac{h(x)_y}{\rho}\right\}e^{\alpha r(x)} + nce^{-\alpha r(x)} \\ \end{aligned} \]

Because of the assumption\(\mathcal{H}\)It is symmetrical and complete, we have

\[\begin{aligned} &\inf_{h\in \mathcal{H}}\left\{\sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\max\left\{0, 1 + \frac{h(x)_y}{\rho}\right\}\right\} \\ &= n - \sup_{h\in \mathcal{H}}\sum_{y\in \mathcal{Y}}p(y\mid x)\max\left\{0, 1 + \frac{h(x)_y}{\rho}\right\} \\ &= n\left(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\right) \end{aligned} \]

NoteIn fact, if\(h_{\rho}\)Make\(h_{\rho}(x)_y = \left\{\begin{aligned} &h(x)_y\quad &\text{if } y\notin \left\{y_{\max}, \text{h}(x)\right\} \\ &-\rho \quad &\text{if } y = \text{h}(x) \\ &h\left(x\right)_{y_{\text{max}}} + h\left(x\right)_{\text{h}(x)} + \rho \quad &\text{if } y = y_{\text{max}} \\ \end{aligned}\right.\)Satisfies constraints\(\sum_{y\in \mathcal{Y}}h_{\rho}(y\mid x)=0\),in\(y_{\max} = \text{arg max}_{y\in \mathcal{Y}}p(y\mid x)\), for any\(h\in \mathcal{H}\)have:

\[\begin{aligned} &\sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\max\left\{0, 1 + \frac{h(x)_y}{\rho}\right\} - n\left(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\right) \\ &\geqslant \sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\min\left\{n, \max\left\{0, 1 + \frac{h(x)_y}{\rho}\right\}\right\} - n\left(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right)\right) \\ &\geqslant \sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\min\left\{n, \max\left\{0, 1 + \frac{h(x)_y}{\rho}\right\}\right\} \\ &\quad - \sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\min\left\{n, \max\left\{0, 1 + \frac{h_{\rho}(x)_y}{\rho}\right\}\right\} \\ &= \left(p(y_{\text{max}}\mid x) - p(\text{h}(x)\mid x)\right)\min\left\{n, 1 + \frac{h(x)_{\text{h}(x)}}{\rho}\right\} \\ &\geqslant \max_{y\in \mathcal{Y}}p\left(y\mid x\right) - p\left(\text{h}\left(x\right)\mid x\right) \end{aligned} \]

And before\(\mathcal{l}_{mae}\)\(\mathcal{l}_{\rho}\)The proof is similar, and we will use this conclusion many times in the subsequent proof.

So with the previous\(\mathcal{l}_{mae}\)\(\mathcal{l}_{\rho}\)Similarly, we have

\[C_L^*(\mathcal{H}, \mathcal{R}, x) = 2\sqrt{n^2c(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right))} \]

then

\[\begin{aligned} \Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) &= C_L(h, r, x) - C^*_L(\mathcal{H}, \mathcal{R}, x) \\ &= \sum_{y\in \mathcal{Y}}\left(1 - p(y\mid x)\right)\max\left\{0, 1 + \frac{h(x)_y}{\rho}\right\}e^{\alpha r(x)} + c e^{-\alpha r(x)} - 2\sqrt{c(1 - \max_{y\in \mathcal{Y}}p\left(y\mid x\right))} \end{aligned} \]

To build\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)\)and\(\Gamma \left(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\right)\)Inequality relationship, we will adopt\(\mathcal{l}_{\text{mae}}\)\(\mathcal{l}_{\rho}\)The proof of similar practices, targeting\(\max_{y\in \mathcal{Y}} p(y\mid x)\)and\(1 - c\)Comparison of size with\(r(x)\)The positive and negative situations are correct\(\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x)\)Conduct a classified discussion:

  1. \(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\)\(r(x) > 0\)
    at this time

    \[\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) \geqslant \max_{y\in \mathcal{Y}}p(y\mid x) - p(\text{h}(x)\mid x) = \Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2 \]

    Pick\(\Gamma_1 (z) = z\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)Get proven.

  2. \(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\)\(r(x) > 0\)

    \[\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) \geqslant \frac{1}{4n}\left(1 - c - p\left(\text{h}\left(x\right)\mid x\right)\right)^2 = \frac{\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4n} \]

    Pick\(\Gamma_1 (z) = 2\sqrt{nz}\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)Get proven.

  3. \(\max_{y\in \mathcal{Y}} p(y\mid x) \leqslant (1 - c)\)\(r(x) \leqslant 0\)
    Because at this time\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) = 0\),therefore\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)For any\(\Gamma \geqslant 0\)Established.

  4. \(\max_{y\in \mathcal{Y}} p(y\mid x) > (1 - c)\)\(r(x) \leqslant 0\)

    \[\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x) \geqslant n\left(\frac{\max_{y\in \mathcal{Y}}p(y\mid x) - 1 + c}{2}\right)^2 = \frac{n\Delta C_{\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x)^2}{4} \]

    Pick\(\Gamma_3 (z) = 2\sqrt{z/n}\),then\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\)Get proven.

To sum up, if\(\Gamma(z) = \max\{\Gamma_1(z), \Gamma_2(z), \Gamma_3(z)\} = \max\{2\sqrt{nz}, z\}\), then there is always\(\Delta C_{L_\text{abst}, \mathcal{H}, \mathcal{R}}(h, r, x) \leqslant \Gamma (\Delta C_{L, \mathcal{H}, \mathcal{R}}(h, r, x))\). then\(\mathcal{l} = \mathcal{l}_{\rho-\text{hinge}}\)\(\psi(z) = nz\)Single-stage agent loss\((\mathcal{H}, \mathcal{R})\)-The world of consistency is proven.

refer to

  • [1] Mao A, Mohri M, Zhong Y. Predictor-rejector multi-class abstention: Theoretical analysis and algorithms[C]//International Conference on Algorithmic Learning Theory. PMLR, 2024: 822-867.
  • [2] Cortes C, DeSalvo G, Mohri M. Boosting with abstention[J]. Advances in Neural Information Processing Systems, 2016, 29.
  • [3] Ni C, Charoenphakdee N, Honda J, et al. On the calibration of multiclass classification with rejection[J]. Advances in Neural Information Processing Systems, 2019, 32.
  • [4] Han Bao: Learning Theory Bridges Loss Functions
  • [5] Crammer K, Singer Y. On the algorithmic implementation of multiclass kernel-based vector machines[J]. Journal of machine learning research, 2001, 2(Dec): 265-292.
  • [6] Awasthi P, Mao A, Mohri M, et al. Multi-Class $ H $-Consistency Bounds[J]. Advances in neural information processing systems, 2022, 35: 782-795.