\(\text{KMP}\) Notes!
In the last contest, the questioner came up with a\(\text{KMP}\) Template.I knocked a\(\text{SAM}\) run awayBut there are a lot of good questions given by the seniors.\(\text{KMP}\), and so rolled back to cramming basic string algorithms.
\(\text{KMP}\) I learned it last winter break.Why is it only recently that I fully understoodLibyan Arab Jamahiriya\(\text{KMP}\) It's short, extremely concise, and really hard to follow, so I avoided it for a long time, and was inspired to write this recently.
Prior knowledge:String Basics,string matching(Basic concepts)
pull into
\(\text{KMP}\) Algorithm for solvingString Matching Problems. This class of problems can generally be formulated as "in the main string\(S\) Find pattern strings in\(T\) of certain information".
Obviously we have a very violent algorithm: enumerating pattern strings\(T\) of the starting point and compare them one by one going forward. Remember\(S\) The length of the\(n\),\(T\) The length of the\(m\), it is clear that this algorithm is at worst\(O(nm)\) The.
Give the realization of this violence:
//
int n,m;
char s[N],t[M];
void force(){
for(int st=1;st+m-1<=n;st++){
int i=st,j=1;
while(j<=m&&s[i]==t[j]) i++,j++;
if(j>m){
// do something
}
}
}
Of course we can usestring hashaccomplish\(O(n+m)\) preprocessing.\(O(1)\) Comparing two strings, the complexity becomes\(O(n+m)\). In fact, it's\(\text{KMP}\) complexity, but the\(\text{KMP}\) Can do more.
Prefix Functions-Definition and Properties
Define a string\(s\) prefix function of\(nxt[i]\) because of\(s\) lengths\(i\) prefix of a compound word\(s[1\colon i]\) (used form a nominal expression)Length of the longest public true prefix and suffix, in order to facilitate the use of arrays of tabular forms.
Dizzy? Don't panic, look at an example.
for example\(s=\texttt{abcab}\)It's easy to find out.\(\texttt{ab}\) being the case that\(s\) The true prefix of the\(s\) The true suffix of\(s\) (used form a nominal expression)public prefix and suffixAnd then we find out.\(s\) only\(\texttt{ab}\) anpublic prefix and suffixSo.\(\texttt{ab}\) naturallylongest true public prefix and suffixSo.\(s\) (used form a nominal expression)Length of the longest public true prefix and suffixwould be tantamount to\(2\)。
Next, can you write\(\texttt{ababc}\) (used form a nominal expression)\(nxt[]\) What? - I don't know.
- insofar as\(nxt[1]\)The corresponding prefix is\(\texttt{a}\),longest true public prefix and suffixis the empty string (the true prefix/suffix doesn't count by itself), so the\(nxt[1]=0\)。
- insofar as\(nxt[2]\)The corresponding prefix is\(\texttt{ab}\),longest true public prefix and suffixis the empty string, so\(nxt[2]=0\)。
- insofar as\(nxt[3]\)The corresponding prefix is\(\texttt{aba}\),longest true public prefix and suffixbecause of\(\texttt{a}\)And so\(nxt[3]=1\)。
- insofar as\(nxt[4]\)The corresponding prefix is\(\texttt{abab}\),longest true public prefix and suffixbecause of\(\texttt{ab}\)And so\(nxt[4]=2\)。
- insofar as\(nxt[5]\)The corresponding prefix is\(\texttt{ababc}\),longest true public prefix and suffixis the empty string, so\(nxt[5]=0\)。
In summary.\(\texttt{ababc}\) (used form a nominal expression)\(nxt[]\) because of\(\{0,0,1,2,0\}\)。
Two more.\(nxt[]\) The nature of arrays:
- characteristic \(\bf{1}\):\(nxt[i]<i\), obtained by definition.
- characteristic \(\bf{2}\): Set string\(s\),\(p=\lvert s \rvert\)Keep doing it.\(p=nxt[p]\), and each value that comes out is theThe length of each common true prefix and suffix of the original string. (Includes\(nxt[p]\)、\(nxt[nxt[p]]\)、\(nxt[nxt[nxt[p]]]\)……)
characteristic \(\bf{2}\) show that
Take the first two times, the initial\(nxt[p]\) beLength of each longest common true prefix and suffix of the original stringSo.\(nxt[nxt[i]]\) which is the length of\(nxt[p]\) prefixedLength of the longest public true prefix and suffixRemember this.longest true public prefix and suffixbecause of\(u\)Then the original string has a prefix\(u\)。
But again, because the length of\(nxt[p]\) of the prefix is equal to a prefix of length\(nxt[p]\) suffix, so the length of the\(nxt[p]\) The suffix of also has a length of\(nxt[nxt[p]]\) (used form a nominal expression)longest true public prefix and suffix \(u\)Then the original string also has a suffix.\(u\). So the original string haspublic prefix and suffix \(u\)。
So to continue with this recursive understanding a bit, nested like this, each value is obviously the original string of thepublic prefix and suffix(interpret the empty string as a public true prefix and suffix as well).
\(\bf{KMP}\)-Matching process
Think about how to optimize what we talked about earlier\(O(nm)\) Violence.
for example\(S=\texttt{abababc}\),\(T=\texttt{ababc}\):
\(\texttt{\color{grey}{abababc}}\)
\(\texttt{\color{grey}{ababc}}\)
At first, it was clear that the first four would all match up:
\(\texttt{\color{green}{abab}\color{grey}{abc}}\)
\(\texttt{\color{green}{abab}\color{grey}{c}}\)
But by the fifth position, the characters are different, at which point we claim that an occurrence of amismatches:
\(\texttt{\color{green}{abab}\color{red}{a}\color{grey}{bc}}\)
\(\texttt{\color{green}{abab}\color{red}{c}}\)
For normal violence, we need to move the pattern string one place to the right and re-match them all, like this:
\(\texttt{\color{grey}{abababc}}\)
\(\texttt{ \color{grey}{ababc}}\)
Then the comparison continues:
\(\texttt{\color{grey}{a}\color{red}{b}\color{grey}{ababc}}\)
\(\texttt{ \color{red}{a}\color{grey}{babc}}\)
However, there is clearly a smarter way to do this, due to the pattern string\(T\) A prefix has been matched\(\texttt{abab}\), and there are two identical characters in this prefix\(\texttt{ab|ab}\)So the next paragraph\(\texttt{ab}\) well-matched\(S\) The third and fourth digits of can be given directly to the\(T\) of the first two for use and directly from the\(T\) of the third position for the start of the comparison:
\(\texttt{\color{grey}{ab}\color{green}{ab}\color{grey}{abc}}\)
\(\texttt{\color{white}{--}\color{green}{ab}\color{grey}{abc}}\)
Then compare backward, and the match is successful:
\(\texttt{\color{grey}{ab}\color{green}{ababc}}\)
\(\texttt{\color{white}{--}\color{green}{ababc}}\)
Got it and don't seem to get it? To summarize, the basic idea is to use the parts and pattern strings that have already been matched\(T\) The same parts of itself are optimized.
The part that has been matched is simple, but now the question arises: what kind of\(T\) What about the same parts that can be utilized? How to jump around?
Let's start with the conclusion: let the current\(S[i+1] \neq T[j+1]\)namely\(S\) cap (a poem)\(T\) About to be mismatched, then the nextMinimum possible match positionbecause of\(i-nxt[j]+1\)(\(T\) (the beginning position) without moving the\(i\)direct\(j\leftarrow nxt[j]\)Then continue to compare\(S[i+1] \neq T[j+1]\) Ready to go.
For example, in the above example, the\(\texttt{ababc}\) matches to the\(S[4]=T[4]\)But in the next one.\(S[5] \neq T[5]\), which means it's going to be a mismatch, so let's\(j\leftarrow nxt[j]\) assume (office)\(j\leftarrow nxt[4]\) assume (office)\(j\leftarrow 2\)And so the comparison continues\(S[5]\) cap (a poem)\(T[3]\) Ready to go.
show that
in the first place\(i-nxt[i]+1\) The correctness of this is, obviously, the use of\(nxt[i]\) Provides information on the front and back phases directly before completion\(nxt[i]\) Bit Matching.
The counter-argument is then used to prove that\(i-nxt[i]+1\) beMinimum possible match position. Suppose that there exists a possible matching position\(1\lt p \lt i-nxt[j]+1\), as shown below:
sign\(p\) until (a time)\(j\) The length of the substring of\(len\)Obviously.\(len>nxt[i]\). We draw the pattern string after the move\(T\)(the third orange line), then after moving the third orange line\(T\) The length of the\(len\) The prefix of is equal to the second orange line\(p\) until (a time)\(j\) of the substring i.e.\(T\) grow to be\(j\) The length of the prefix of\(len\) of the suffix (top and bottom correspond).
So in the case of the long\(T\) (used form a nominal expression)\(j\) In the prefix of the prefix with a length of\(len\) The prefix of is equal to a prefix of length\(len\) suffix, so the string is\(T\) (used form a nominal expression)public prefix and suffixBut then again, we said earlier\(len>nxt[i]\)This is not the same as\(nxt[]\) The definition of is contradictory, so the assumption is not valid and the proof is obtained.
So the next possible matching position would be\(i-nxt[j]+1\)and then the length of\(nxt[j]\) prefix, it is possible to utilize our previously matched prefix of length\(nxt[j]\) The suffixes match well, so from the\((i-nxt[j]+1)+nxt[j]-1=i\) Just keep matching (the equivalent of\(i\) (no need to change); as to\(j\)As a result of the previous\(nxt[j]\) for having matched it, so\(j\leftarrow nxt[j]\) Ready to go.
And what if there's still a mismatch? Go on.\(j\leftarrow nxt[j]\) Until\(S[i+1]=T[j+1]\) maybe\(j=0\) Until.
Combined with the preceding prefix function\(nxt[]\) A perceptual understanding of the nature of: according tocharacteristic \(\bf{2}\), which is a process of finding every common true prefix and suffix once, according to thecharacteristic \(\bf{1}\), after each operation, the length continues to decrease, the closer the starting point, this is actually a continuous "second best" idea, closer means more reuse of the previous matching knot results, too close to match, go back, go too far, then it will be beyond the range of the matched, there is no available results, then the Not the current\(i\) The next problem that needs to be solved now.
Easy to write into code:
//
int n,m,nxt[M];
char s[N],t[M];
void KMP(){
// calculate nxt[]
for(int i=1,j=0;i<=n;i++){
while(j&&t[j+1]!=s[i]) j=nxt[j];
j+=(t[j+1]==s[i]);
if(j==m){
// do something
}
}
}
Note that we originally said compare\(S[i+1] \neq T[j+1]\)However, due to the\(i+1\) value remains the same throughout a round of loops, so we'll set the\(i\) The updating is left to the loop to do ahead of time, and we do it inside the loop with the\(i\) That's all, don't write it as\(i+1\)。
And why not?j++
What about it? Because it's possible that it won't match at all, so you have to determine if it can match, and when it does match then++
namelyj+=(t[j+1]==s[i]);
。
Speaking in great detail, there should be no problem with the code, now there is only one question left: how to find the prefix function\(nxt[]\)?
Prefix Functions-Summation
Well, actually the prefix function\(nxt[]\) It is the method of seeking that is\(\text{KMP}\) Algorithms are most often tested, and even most of the questions do not require the matching process; the focus is on testing the matching of the prefix function\(nxt[]\) of understanding.
The solution of the prefix function utilizes theincremental approach, i.e., we know the previous\(nxt[]\) time to determine the new function value is certainly a dp.
Let the original string be\(s\)Next, we're going to ask for the\(nxt[i]\)All of the previous\(nxt[]\) value is known, then it is clear that the initial state is\(nxt[1]=0\)。
Apparently, the Wakara string looks like this:
\(\texttt{[aba]c}\cdots\texttt{[aba]}\)
included in\(\texttt{[ ]}\) The package islongest true public prefix and suffixWell that's in two ways:
- (coll.) fail (a student)\(s[nxt[i-1]+1]=s[i]\) When, i.e., when adding\(\texttt{c}\) whenlongest true public prefix and suffixchange into\(\texttt{[aba]c}\)That is to say, there are\(nxt[i]=nxt[i-1]+1\)。
- (coll.) fail (a student)\(s[nxt[i-1]+1] \neq s[i]\), i.e., by adding a variable that is not\(\texttt{c}\) character, then it can't be used with the precedinglongest true public prefix and suffixIt matches up, so "back off": the largest doesn't work, what about the second smallest? I'm sure you've thought of that, which is to use the prefix function of thecharacteristic \(\bf{2}\), keep looking for smaller ones to see if they match, find the last, and if they don't yet, then\(nxt[i]=0\)。
Equally easy to write into code:
int n,m,nxt[M];
char s[N],t[M];
void KMP(){
nxt[1]=0;
for(int i=2,j=0;i<=m;i++){
while(j&&t[j+1]!=t[i]) j=nxt[j];
j=nxt[i]=j+(t[j+1]==t[i]);
}
for(int i=1,j=0;i<=n;i++){
while(j&&t[j+1]!=s[i]) j=nxt[j];
j+=(t[j+1]==s[i]);
if(j==m){
// do something
}
}
}
Compare and contrast prefixing functions\(nxt[]\) code and the matching code, the two codes turned out to be strikingly similar!
This is because, given that a prefix of the original string has been solved for\(s\) of all the prefix functions, then at this point, if you add another character\(c\), which is equivalent to the\((s+c)\) neutralization\(s\) The process of making the final round of matches is just that we don't care if the matches are good or not, it's good to understand this perceptually.
\(\bf{KMP}\)-complexity
We analyze the prefix function\(nxt[]\) The complexity of the
int m,nxt[M];
char t[M];
nxt[1]=0;
for(int i=2,j=0;i<=m;i++){
while(j&&t[j+1]!=t[i]) j=nxt[j];
j=nxt[i]=j+(t[j+1]==t[i]);
}
The only thing that's a little confusing is this.while
It's up. Everything else is.\(O(m)\) The.
evidentlywhile
The statement in\(O(1)\) of the world, so the focus is on thewhile
The number of executions of the
Let's consider it this way, according to the prefix function ofcharacteristic \(\bf{1}\),\(j\) existwhile
At least one reduction at a time in\(1\)But\(j\) will only add up to\(1\)So.\(j\) All increases will not exceed\(m\)So it's up to you to enter thewhile
circulate\(O(m)\) times, so the total complexity is\(O(m)\)。
The complexity analysis of matching is similar, with complexity\(O(n+m)\)。
In summary.\(\texttt{KMP}\) The complexity of the algorithm is\(O(n+m)\)。
It is easy to construct a worst-case scenario that\(S=\texttt{aaa}\cdots\texttt{ab}\),\(T=\texttt{aaa}\cdots\texttt{a}\)。
In fact, there are many more general manifestations of\(\text{KMP}\) string matching algorithms such as\(\text{BM}\)、\(\text{Sunday}\) Wait, but\(\text{KMP}\) exist\(\text{OI}\) It's enough to get by, and to really figure out the\(\text{KMP}\) It's already hard enough too~.
coda
Finished writing!!! Tired!
This work usesCC BY-SA 4.0 License is granted and additional terms can be used.