The mathematical theory behind greedy algorithms

Greedy algorithms seem entirely intuitive and can solve many problems. But in fact, the word "greed" is derogatory in daily life. We have been taught to be "greedy and short-sighted" since we were young, but why in some algorithms, greedy and short-sighted can directly reach the overall optimal? What questions can be applied to greed and what cannot be applied to?

Greedy Algorithm

As a programmer, you must have encountered such confusion:

The minimum spanning tree problem (Kruskal algorithm) is easily solved with greed, the code is concise, and the result is perfect;
But when it comes to the set coverage problem, the same greedy routine can only get an "approximate optimal solution", and may even be asked by the interviewer: "Why don't you need dynamic planning?"
01 Backpacks are just simple choices of value for items, why is greed completely ineffective?
Some functions can directly find the optimal value using gradient descent, but some functions cannot.

Core contradiction：
Greedy algorithms may seem “simple and brainless”, but their effectiveness depends entirely on the hidden structure of the problem.

Message array structure(such as the minimum spanning tree, Dijkstra): Every step of greed will not destroy the possibility of the future, and the final result will inevitably be globally optimal.
Sub-modal structure(such as set coverage): Applicable to greed, the marginal returns gradually decrease, which can only ensure the approximate optimality.
No structural constraints(such as the 0-1 backpack problem): Greed may fail completely and must rely on dynamic planning.

The value of mathematical theory：
By understanding the underlying rules of greedy algorithms (such as quasi-arrays and sub-modularity), you can quickly classify the problem:

Independent selection question(such as task scheduling, conflict-free resource allocation): Check whether the exchangeability of the matte array is met - if the current choice does not affect the future freedom, greed is feasible.
Maximizing coverage or impact issues(such as advertising, viral transmission): Check whether there is a diminishing marginal return (sub-modelity) - Greed can provide an approximate solution.
Dividable questions(such as some backpacks): Greedy is the best;Indivisible question(such as 0-1 backpack): dynamic planning is required.

The correct way to open：

Mathematical Theory → Pattern Library: Transform concepts such as quasi-array and sub-modelity into "problem features" and accumulate them into your algorithm pattern library.
Formula → Decision Rules: No derivation proof is required, just remember that "if the problem meets the XX structure, the expected effect of the greedy strategy is YY".

For example:

When the problem requires "selecting elements from a collection, and some subsets are prohibited (such as forming loops)", AssociatesMessage array, try to be greedy directly.
When the question calls for "maximizing coverage, but the benefits of new elements are getting smaller and smaller", LenovoSub-modelity, feel confident to use greed to approximate the solution.

Matroid and greedy algorithms

Programmer's Perspective: Why do some greedy strategies never fail?
Suppose you are developing a task scheduling system that needs to select non-conflict task combinations from a bunch of tasks and maximize the total benefits. Intuition tells you that "choose the most profitable task every time" seems reasonable, butHow to ensure that this greedy choice does not miss a better combination?

A quasi-array is a mathematical system consisting of two parts:

Element collection \(E\)(such as all tasks);
Independent set collection \(\mathcal{I}\)(such as all non-conflict task combinations).

It must meet three rules:

Hereditary: If a combination is legal (such as task A+B does not conflict), then any subset of it (such as selecting task A) is also legal.
Exchange: If there are two legal combinations of A and B, and A contains more tasks than B, then you can always get one task from A and throw it to B, so that B is still legal.
Non-empty: At least one legal combination (such as an empty set).

Minimum spanning tree problem

Why can the schema ensure that greed is correct?

Taking the Kruskal algorithm as an example, we found that the minimum spanning tree is simply an algorithm customized for the quasi-array.

Element collection \(E\): All edges in the figure.
Independent set \(\mathcal{I}\): All edge sets (forests) that do not form rings.
Exchange: If two forests A and B meet\(|A| > |B|\), you can always find an edge from A and add B to it without forming a ring.

Therefore, we can always find the one with the most elements in an independent set (that is, the minimum spanning tree). Any acyclic graph of non-minimum spanning tree can always become the minimum spanning tree by constantly adding edges.

Therefore, sort in ascending order of edge weights and select edges in sequence. If no ring is formed after addition (maintaining independence), then retain.Exchange axiomsIt doesn't matter if we ensure that we choose the smallest edge currently available for short-sightedness.

In this discussion, the application of quasi-arrays and greedy algorithms is very interesting because quasi-arrays provide a structure that ensures that greedy algorithms can obtain the optimal solution. In addition to the minimum spanning tree problem (such as the Kruskal algorithm), we can further illustrate the application of the quasi-array, especially the correctness of the greedy strategy and its superiority in a specific situation through other examples.

Task assignment

Suppose you have the following task set, the benefits and execution time of the task are as follows:

Task	Start time	End time	income
A	1	3	10
B	2	5	15
C	4	6	20
D	5	8	30
E	7	9	25

The goal is to choose a non-conflict task set to maximize the total gain.

Tasks A and B conflict and cannot be selected at the same time.
Tasks B and C conflict and cannot be selected at the same time.
Tasks C and D conflict and cannot be selected at the same time.

Suppose you choose Task B (reward 15), and then choose Task D (reward 30). Since there is no conflict between Task B and Task D, this combination is legal. At this time, the selection of tasks B and D forms a legal independent set, and this choice is optimal.

There are many forms and situations of task selection problems, and they are not always solved by greed. The above example only shows an independent set.

Counterexample: Insufficient exchangeability, greed is invalid

Suppose task A needs to exclusive resource X and the profit is 100, task B and task C need to share resource X, with 80 and 90 respectively. Suppose we use greedy strategies and always choose the task with the highest profit at the moment.

Task A (reward 100) requires exclusive resource X;
Task B (reward 80) and task C (reward 90) need to share resource X;
The greedy algorithm will first select Task A (reward 100), and then can no longer select Task B or Task C, because resource X has been occupied by Task A, and the total revenue is 100.
The actual optimal solution is to select task B and task C (total revenue 170), which can share resource X.

In this scenario, the independent set of task scheduling problems does not satisfy the exchangeability. The combination of tasks A, B, and C cannot exchange tasks and maintain legality, so greedy strategies cannot guarantee the optimal solution.

Submodulity and greed with diminishing returns

Submodule functions are related to another type of greed problem. Suppose you are designing a news recommendation system and you need to select 10 of 100 candidate articles to push them to the user, with the goal of covering the most interest tags. What may be the result if you directly choose the article with the most tags covered? Is it the optimal solution?

Definition of submodule functions：
function\(F(S)\)is sub-model if and only if for any set\(A \subseteq B\)and elements\(e \notin B\),satisfy:

\[F(A \cup \{e\}) - F(A) \geq F(B \cup \{e\}) - F(B) \]

In layman's terms, the benefits brought by adding another element decrease as the existing collection expands.

Approximate guarantee of greed：
How to solve the maximum value of this function? If the objective function is a submodule and is non-decreasing (adding elements does not reduce returns), mathematical knowledge can prove that applying greedy algorithms (selecting the element with the largest marginal returns in each step) can ensure that at least it can be achieved\((1-1/e) \approx 63\%\)The optimal solution.

Going back to the news recommendation question above, suppose you are designing a news recommendation system with the goal of selecting 10 of the given 100 news articles to push to the user, with the goal of maximizing the number of interest tags covered by these articles. We can model this problem as an optimization problem:

Candidate collection: There are 100 candidate news articles, each article can have one or more interest tags (for example, sports, technology, health, etc.).
Target: Select 10 articles to maximize the sum of labels of these 10 articles, that is, to cover as many labels as possible.

Now, we can see an important phenomenon calledMarginal decreasing(or diminishing returns). When you select a new article from the remaining candidate articles, the tags covered by the selected articles will have an impact on the tag coverage of the newly added articles.\(F(Article combination)=Number of tag types covered\)It is a submodule function.

For example, suppose you have selected 3 articles, which cover 5 different interest tags. Next, when selecting a new article, it may overwrite one or more new tags, or it may be a tag that has been selected. If it covers a label that has been selected, its marginal returns will be small; if it covers a new label, its marginal returns will be large. According to the nature of the sub-model, as the set expands (from\(A\)arrive\(B\)), add new elements\(e\)marginal returns will decrease. This isMarginal decreasingcore features.

According to the nature of the submodule function, if we useGreedy Algorithm(Every shortsighted choice of the article currently brings the greatest marginal benefit), we can guarantee that in the worst case, our solution is at least 63% of the optimal solution.

Convexity and gradient descent and continuous greed

Programmer's Perspective: Why are gradient descent and greedy algorithms "dispersed relatives"?

Gradient descent is a powerful solution tool in the field of machine learning, used to solve the minimum value of continuous functions. Suppose you are training a linear regression model, the goal is to minimize the loss function\(L(w)\). Is this a "greedy" strategy when gradient descent is updated in the negative gradient direction?

A function\(L(w)\)It is convex, and it needs to be arbitrary\(w_1, w_2\)and\(\theta \in [0,1]\),satisfy:

\[L(\theta w_1 + (1-\theta) w_2) \leq \theta L(w_1) + (1-\theta) L(w_2) \]

We have actually learned about convex functions in high school and college, and the function image is a "smooth valley" without local pits. The knowledge of advanced mathematics and convex optimization tells us that convex functions can solve the most value by selecting the current fastest descent direction (local optimal direction) at each step. Because there is no local optimal trap, it will eventually converge to global optimal no matter where it starts.

1. Least Squares Method in Linear Regression

question: Suppose you have a set of training data\((x_i, y_i)\), the goal is to find a set of parameters\(w\), so that the predicted value is\(\hat{y_i} = w \cdot x_i\)Minimize and actual value\(y_i\)Error between. This problem is often solved by minimizing the loss function:

\[L(w) = \sum_{i=1}^{n} (y_i - w \cdot x_i)^2 \]

Why is it convex?
This loss function is about\(w\)quadratic function, its image is a "smooth valley" - a standard convex function. Because it is convex, it has no local minimum point and gradient descent finds the global minimum from any initial point.

How to optimize?
Using gradient descent, the algorithm updates parameters along the negative gradient direction each time\(w\), This is a greedy strategy: choose the current "fastest descent" direction every time. However, since the function is convex, greedy choices of gradient descent will not fall into local optimal, but will converge to global optimal.

in conclusion: In this example, the convex function ensures that greedy strategy (gradient descent) can find the global optimal solution.

2. Convex optimization in Support Vector Machine (SVM)

question: Support vector machines are used for classification tasks, with the goal of finding an optimal hyperplane to maximize the interval of data point classification. The goal of SVM is to minimize the following convex loss function:

\[L(w, b) = \frac{1}{2} \|w\|^2 \]

Why is it convex?
This is a quadratic function that represents weight\(w\)The square of the norm is obviously convex. This is because the square function of the norm is convex, and the optimization problem of SVM is a typicalConvex secondary planningquestion.

How to optimize?
The optimization problem of SVM is solved by convex optimization methods (such as gradient descent, Newtonian method, etc.). In this process, greedy choices (such as the fastest drop of each step of gradient descent) can always converge towards the global optimal solution because the loss function does not have a local optimal solution.

in conclusion: Since the loss function is convex, greedy strategies can find the global optimal solution, and the convexity ensures the effectiveness of the gradient descent method.

3. Counterexample: Non-convex loss function

question: Suppose that when we train deep learning models, the loss function we use is a non-convex function, such as cross-entropy loss in neural networks or other complex nonlinear models. In this case, the shape of the loss function may contain multiple local minimums.

Why not convex?
These non-convex functions usually have multiple local minimums and saddle points, so during training, gradient descent may stay at some local minimum and cannot reach the global minimum.

How to optimize?
Although gradient descent can still be applied, greedy fastest descent methods may cause the algorithm to fall into local optimality due to the non-convexity of the function. To avoid this, tricks are usually used, such as momentum, learning rate adjustment, random initialization, etc., to help the algorithm escape the local minimum, or to use more advanced optimization algorithms (such as Adam, SGD, etc.).

in conclusion: In non-convex problems, gradient descent no longer guarantees to find the global optimal solution, convexity is no longer applicable, and greedy strategies may fail.

Summarize

In general, greedy algorithms are a powerful strategy that are widely used in various optimization problems.

In situations with specific structural characteristics, such as submodule functions, quasmic arrays and convex functions problems, are related to greed. We have seen that when a specific structure is met, greedy algorithms can not only solve problems efficiently, but also provide theoretical guarantees to ensure that the algorithm converges to near the optimal solution.

However, greedy algorithms are not omnipotent. They may fall into local optimality in some problems and cannot find the global optimal solution. Therefore, understanding the structural characteristics of the problem and choosing the right algorithm is the key to solving complex optimization problems. Through in-depth analysis of greedy algorithms, we can better understand their advantages and limitations and make smarter choices in practical applications.