String Basics

define

string (computer science)

For a string\(S\)，\(S\) leave it (to sb)\(n\) consists of a number of characters, of which\(n\) be\(S\) The length of the\(|S|\)。

substring

From a string\(S\) embedded in a continuous segment\(T\)follow\(T\) because of\(S\) of the substring.

subsequence

From a string\(S\) The new string formed by taking out some characters from the\(S\) The subsequence of the

prefix (linguistics)

through (a gap)\(S\) A substring taken from the first character of the

suffix (linguistics)

in order to\(S\) The substring terminated by the last character of the

palindrome

The string is the same both forward and backward.

cyclicality

We define that if a string is written with one or more lengths of\(k\) is concatenated by repeated strings of the same type, then the string is called the period\(k\) The string.

\(\text C++\) standard library

s.c_str()/() Returns a pointer to the original string.
() Returns the number of characters.
(c, begin) Finds and returns the value from thebegin initialc The location.
(begin,len) through (a gap)begin Start intercepting a segment of lengthlen of the character.
(s',pos,len) commander-in-chief (military)s' choose frompos initiallen characters tos of the end.
(pos,n,s') commander-in-chief (military)s choose frompos The starting length isn by replacing the substring ofs'。
(pos,len) Delete frompos initiallen Characters.
(pos,s') existpos insertions'。

hash (computing)

define

A hash is actually a mapping that turns something abstract into an intuitive number through a function that makes it easy for us to compare information.

string hash

It is the process of taking a string and turning it into a number by mapping it into a number for easy comparison. Generally the string is viewed as a multi-digit number, a prime number is taken by the progression system, and then a large modulus is chosen.

characteristic

Hash values that are different must turn out to be different.
Hash values that are the same turn out not to be necessarily the same.

If the original is different but the hash value is the same, we call this a hash conflict.

hash conflict

We set the range of values of the string after hashing to be\(mod\)The number of strings is\(n\), then the hash conflict probability is:

\[p(n,mod)=1-\exp(-\frac{n(n-1)}{2mod}) \]

Proof:

The probability that the hash does not conflict is calculated first:

\[\overline{p}(n,mod)=1\times(1-\frac{1}{mod})\times(1-\frac{2}{mod})\times\dots\times(1-\frac{n-1}{mod}) \]

Then according to Taylor's formula:

\[\exp(x)=\sum_{i=0}^{\infty}\frac{x^i}{i!}=1+x+\frac{x^2}{2}+\frac{x^3}{6}+\dots \]

(coll.) fail (a student)\(x\) Very hourly.\(\exp(x)\) converge to\(x+1\)。

So the previous equation can be written as:

\[\begin{array} \overline{p}(n,mod)&=1\times\exp(-\frac{1}{mod})\times\exp(-\frac{2}{mod})\dots\times\exp(-\frac{n-1}{mod})\\ &=\exp(-\frac{1}{mod}-\frac{2}{mod}-\dots-\frac{n-1}{mod})\\ &=\exp(-\frac{n(n-1)}{2mod}) \end{array} \]

So the hash conflict probability is:

\[p(n,mod)=1-\exp(-\frac{n(n-1)}{2mod}) \]

So how do we avoid it when we actually write the questions?

double hash (computing)

As the name suggests, that is, for a string set up two separate functions, to determine whether equal to see whether the two hash values are equal.

example

[P3370 [Template] String Hashing] (P3370 [Template] String Hashing - Rock Valley | The New Ecology of Computer Science Education ()) There's not much to say about the board questions.

[P2757 [NATIONAL TRAINING TEAM] ISOMATIC SEQUENCE] ([P2757NCTC] Equivalent Subsequences - LOGU | The New Ecology of Computer Science Education ()) Problem - 452F - Codeforces double experience

notes

Translating the question is to find a position in the sequence\(i\)For\(a_i\) Find one.\(t\) feasible\(a_i-t\) cap (a poem)\(a_i+t\) separately in\(a_i\) on both sides of the line. The first thing you can think of is to enumerate the intermediate numbers and violently find them.

How to optimize it? Suppose we now have an array\(b\)When we enumerate the middle numbers, all of the left-hand side are marked with a 1. When we enumerate the middle numbers, the left-hand side is all marked 1. For enumerated numbers\(x\)We need to go through and judge every possible\(t\) is or isn't\(x-t\) cap (a poem)\(x+t\) in an array\(b\) values are equal or not. It's like grabbing a substring in a sequence to goDetermine if the string is a palindrome, and then also to support a single point of modification.

Up to this point we'll just do it with a line tree. Maintaining both hashes in positive inverted order and single point modification is done.

~~(Since I didn't write it, I won't put the code.)~~

[P3449 [POI2006] PAL-Palindromes]([P3449 POI2006] PAL-Palindromes - Rock Valley | The New Ecology of Computer Science Education ())

notes

First go to find the nature of the topic. The first thing you can notice is that the inputs are all palindromic strings! Then think about what is required if two strings are to form a large palindrome string.

Let's assume for a moment that there are two palindromes\(s\) cap (a poem)\(t\)We assume that\(s\) The minimum period of\(c\)So.\(s\) can be expressed as\(c^p\). If\(s\) cap (a poem)\(t\) can be spelled out as a palindrome string, then\(t\) It must be possible to write\(c^q\) The form of the echo string. So when two palindromes have the same minimum period they are able to be put together as palindromes.

We'll just start one.map Keep track of the number of times each minimum cycle is the same and then it's done. The time complexity is less than the harmonic series so it's\(\Theta(n\log n)\)。

coding

int n, m;
const int b1 = 133, b2 = 233, m1 = 1e9 + 7, m2 = 1e9 + 9;
ll h1[N], h2[N], s1[N], s2[N], ans;
char s[N];
map < pll , ll > mp;
pll aim;

pll geth(int l, int r){
    ll x1 = (h1[r] - h1[l - 1] * s1[r - l + 1] % m1) + m1; x1 %= m1;
    ll x2 = (h2[r] - h2[l - 1] * s2[r - l + 1] % m2) + m2; x2 %= m2;
    return mkp(x1, x2);
}
bool eq(pll x, pll y){
    return  ==  and  == ;
}
bool chk(int len){
    for(int i = len + 1; i <= m; i += len)if(! eq(geth(i, i + len - 1), aim))return false;
    return true;
}

signed main(){
    n = rd(); s1[0] = s2[0] = 1;
    for(int i = 1; i < N; ++i)s1[i] = s1[i - 1] * b1 % m1, s2[i] = s2[i - 1] * b2 % m2;
    while(n--){
        m = rd(); scanf("%s", s + 1);
        for(int i = 1; i <= m; ++i)h1[i] = (h1[i - 1] * b1 % m1 + s[i]) % m1, h2[i] = (h2[i - 1] * b2 % m2 + s[i]) % m2;
        for(int i = 1; i <= m; ++i)if(m % i == 0){
            aim = geth(1, i);
            if(chk(i)){ans += (mp[aim]++ << 1) + 1; break;}
        }
    }
    printf("%lld", ans);
    return 0;
}

KMP

present (sb for a job etc)

The KMP algorithm is an efficient algorithm for solving matching problems in strings. The most basic problem is that you are given a text string and a pattern string and asked to find out where, how many times, etc. the pattern string appears in the matching string.

Matching issues

There are actually a lot of practices regarding this issue.

Violent practice: enumerate each position of the text string as a starting point to start matching by bit, if it does not work, then change the starting point.
Hashing practice: preprocess out the hash value of the text string and the pattern string then\(O(|S|)\) Comparison.
KMP！

preamble

Although a pattern string and a matching string of questions can not see the gap between hash and KMP, but the idea of KMP can be used to solve multiple pattern string matching string of questions, which is the significance of learning KMP.

String border

preamble

border is a very important concept, so we need to understand border and some of its properties before we can introduce all the matching stuff.

conceptual

We call a string border if one of its true prefixes and true suffixes are equal.

characteristic

periodicity

If a prefix\(s[1\dots i]\) be\(s\) border, then the\(|s|-i\) be\(s\) of the cycle.
as\(p,q\) all\(s\) The period of the\(p+q≤|s|\)follow\(\gcd(p,q)\) also\(s\) of the cycle.

Property 1 It's easy to draw a diagram and self-certify;

Nature 2 Proof: Chinden\(p<q\)set up\(d=q-p,n=|s|\)。

insofar as\(i\in[p+1,n]\)We'll find out.\(s[i]=s[i-p]=s[i-p+q]=s[i+d]\)；
insofar as\(i\in[1,n-q]\)We can find the same.\(s[i]=s[i+q]=s[i+q-p]=s[i+d]\)。

\(\because p+q\le n,\therefore p\le n-q\), so the above two cases are able to cover the entire string, so the\(d\) also\(s\) of a cycle, and then make\(q=p,p=d\) Repeating the above process (which is similar to division by rolling over) yields Property 2.

border per se

For a string\(s\)It's all about length.\(\ge\frac{|s|}{2}\) The length of the border is an isomorphic sequence.
The string consists of all the borders\(O(\log n)\) A sequence of equal differences.

Property 3 Proof: Suppose we know that the largest border is d, then the minimum period of the string is\(r=|s|-d\). Because.\(r\) is the smallest period, then\(2r,3r,4r\dots\) must also be cycles, so these cycles correspond to border\(d-r,d-2r,d-3r\dots\) It is also the sequence of equal differences.

Property 4 Proof: first for the length\(\ge\frac{|s|}{2}\) We can handle borders of type 3 here, so we'll just talk about the rest of the borders. assume that the longest of the remaining borders is now\(B\)(as shown), and for any other arbitrary border\(A\)All of them.\(A\) be\(B\) The reason for this can be deduced from the definition, which is shown in the following figure.~~(negative prefix)~~Clear.

So again, we can divide the lengths according to property 3.\(\ge\frac{|B|}{2}\) of the border, and so on. Because each time the\(|B|\) are all at least halved, so it's\(\log n\) The.

Example question:WC2016 Battle Bundle of Bamboo Poles

notes

Consider a string whose different contributions are the length of the string minus border, and then the question can be translated into giving you some numbers and asking how many you can round up to no more than\(w-n\) of the number. If the number is rounded up to\(\bmod n\) (following numerical value) or more\(0\dots n\) number, then each time you add\(n\) It is possible to come up with new numbers, so you can think of congruent shortest circuits. If the direct violent concatenation of edges is\(O(n^2)\) complexity, so optimization is considered.

Because the nature of border is\(O(\log n)\) of an equivariant sequence, we can consider each equivariant sequence separately, and then transfer two different equivariant sequences in. For an isometry sequence\(A_i=x+d\times i\)They're in the\(\bmod x\) in the sense that it can be divided into\(\gcd(x,d)\) rings, and the reader is free to draw his or her own diagrams to prove it. For each point\(i\) toward\((i+d)\%x\) Connect the edges and then you have the transfer equation:

\[f_i=\min(f_j+x+(id_i-id_j)\times d) \]

Monotonic queue maintenance is possible when transferring\(f_j-id_j\times d\)We'll just pick the smallest one on the ring.\(f_i\) Use it as a starting point for transfer.

Finally, consider how two sequences of equal differences are converted between them. For a point\(i\)I must have added a number of times from the previous number.\(x'\) from now on\(\bmod x\) in the sense of\(i\). This can then be transformed into\(i\) toward\((i+x')\%x\) Connecting the edges, the transfer is then essentially identical to the equation above (just missing a\(x\)）。

~~Because the author is too novice to write it so I won't put in a half-finished product~~

prefix function

Before we can talk about KMP formally, we need to learn about prefix functions.

For a location\(i\)，\(\pi(i)\) indicate\(i\) and the prefix function records the substring\(s[1\dots i]\) centerlongest and equalof the length of the true prefix and true suffix, in other words, the prefix function is the largest border.

As an example: if I now have a string\(s=aabaaba\), then each position corresponds to the prefix function\(\pi\) For: 0, 1, 0, 1, 2, 3, 1.

How do I go about finding the prefix function?

For a string, suppose we have solved for\(\pi(1\dots i-1)\)So how do you find out\(\pi(i)\) And?

We know that the prefix function counts the longest equal true prefix true suffix, so we can go ahead and compare the\(s[i]\) together with\(s[\pi(i-1)+1]\) The location of the Because of the\(s[1\dots\pi(i-1)]=s[i-\pi(i-1)\dots i-1]\), if their next bits are equal, then it's straightforward to update the\(\pi(i)\)。

So what if they're different? We can record a\(j\), which is initially\(i-1\). When the above is not fulfilled, we let\(j=\pi[j]\)Continue to compare\(s[1\dots\pi(j)]=s[i-\pi(j)\dots i-1]\)。

You can refer to the diagram I drew above for details~~(so ugly)~~. If two large parts (\(s[1\dots\pi(i-1)]，s[i-\pi(i-1)\dots i-1]\)) are equal, then only the next digit is compared; if they are found to be unequal (as the characters in Figure\(c\) together with\(a\) unequal), then continue to jump back to the prefix function. It is then easy to see that the substrings at the corresponding positions of the four small rectangles in the figure are equal if\(s[1\dots\pi(j)]=s[i-\pi(j)\dots i-1]\) Then we've found it.\(\pi(i)\)。

We have derived the prefix function, so how do we solve the matching problem using the prefix function?

KMP algorithm

Release the picture first qwq.

(460×374) ()

After looking at the diagram we can realize: this process is very similar to the previous process of finding the prefix function. We can first find the pattern string\(t\) of the prefix function, and then use a similar method of solving for the prefix function in the match string\(s\) Sweeps over it, if for the current character\(s_i\) together with\(t_{j+1}\) does not match, then start with the\(j+1\) Jump back. When the pointer\(j\) be tantamount to\(|t|\) When you find it, you've found it.\(s\) can match a substring in\(t\)。

Code (boards I wrote in the distant past)

int n, m, j, k[N];
char a[N], b[N];
void solve(){
	cin >> a + 1;
	cin >> b + 1;
	n = strlen(a + 1); m = strlen(b + 1);
	FL(i, 2, m){
		while(j and b[i] != b[j + 1])j = k[j];
		if(b[i] == b[j + 1])j++;
		k[i] = j;
	}
	j = 0;
	FL(i, 1, n){
		while(j > 0 and a[i] != b[j + 1])j = k[j];
		if(a[i] == b[j + 1])j++;
		if(j == m){
			cout << i - m + 1 << '\n';
			j = k[j];
		}
	}
	FL(i, 1, m)cout << k[i] << ' ';
}

example

USACO15FEB Censoring S

Problem solving:

Suspicious of the board? It's just a matter of using a stack instead of an array when matching text strings, popping that section out when a match is found, and maintaining the prefix function.

coding

int n, m, fail[N], j, top, cur[N];
char a[N], b[N], st[N];

signed main(){
    // fileio(fil);
    scanf("%s %s", a + 1, b + 1); n = strlen(a + 1), m = strlen(b + 1);
    for(int i = 2; i <= m; ++i){
        while(j and b[i] ^ b[j + 1])j = fail[j];
        if(b[i] == b[j + 1])++j; fail[i] = j;
    }
    j = 0;
    for(int i = 1; i <= n; ++i){
        st[++top] = a[i];
        while(j and st[top] ^ b[j + 1])j = fail[j];
        if(st[top] == b[j + 1])++j;
        if(j == m)top -= m, j = cur[top];
        cur[top] = j;
    }
    for(int i = 1; i <= top; ++i)putchar(st[i]);
    return 0;
}

P4391 BOI2009 Radio Transmission Wireless Transmission

notes

Finding the period is a straightforward matter of subtracting the length\(\pi(|S|)\) Ready to go.

trie tree

present (sb for a job etc)

As the name suggests, it's asimple and unadornedThe tree. There is a root, some subtrees down from the root, and the path from the root to some point is a string.

A diagram from the OI-wiki is quoted here so that the reader can visualize the trie tree. For example, point 12 represents the string\(\text caa\)。

The trie's board is given here first.

int T, n, m;
char s[N];
map < char, int > mp;
struct trie{
	int cnt, nex[N][63], ext[N];
	void ins(char *s, int l){
		int p = 0;
		FL(i, 1, l){
			int ch = mp[s[i]];
			if(! nex[p][ch])nex[p][ch] = ++cnt;
			p = nex[p][ch]; ext[p]++;
		}
	}
	int query(char *s, int l){
		int p = 0;
		FL(i, 1, l){
			int ch = mp[s[i]];
			if(! nex[p][ch])return 0;
			p = nex[p][ch];
		}
		return ext[p];
	}
	void clear(){
		FL(i, 0, cnt){
			ext[i] = 0;
			FL(j, 0, 62)nex[i][j] = 0;
		}cnt = 0;
	}
}t;
void init(){
	int tot = 0;
	for(char i = 'a'; i <= 'z'; i++)mp[i] = ++tot;
	for(char i = 'A'; i <= 'Z'; i++)mp[i] = ++tot;
	for(char i = '0'; i <= '9'; i++)mp[i] = ++tot;
}

So what's the use of the trie tree?

appliance

Query string. Just walk directly along the edge of the tree, if you go to an empty node it means the string does not exist.
This will be covered in other blogs.
01 trie

A little bit about 01 trie here.

01 trie

The 01 trie is a special kind of trie that maintains strings of 0s and 1s. Then for any integer we can write it in binary, so we can maintain some numbers with 01 trie, and throw these numbers into 01 trie and it will sort them automatically, and then you may realize that it (01 trie) is like a balanced tree. However, the author has studied data structures, so I won't talk about maintaining 01 trie here. For strings, it is enough to know that trie can query strings. (escape)

example

P2580 And so his erroneous roll call begins.

notes

It's just plain old doing insert operations on the trie, and then just maintaining after each query whether the string was asked for the first time or not.

P4551 Longest dissimilarity path

notes

First things first.\(\oplus\) The nature of it. For example, some things dissimilar or even cancel out an even number of times. The problem is then to find a path through the tree that has the largest sum of dissimilarities, and we can start by finding the sum of dissimilarities from the root node to each point\(d_i\)and then for any two points\(u,v\) The path dissimilarity sum can then be expressed as\(d_u\oplus d_v\). We can then take all the\(d_i\) Throw in a 01 trie and enumerate each point and greedily go for maximal dissimilarity. Since the higher greater value must be greater so greedy is correct, then it's done.

coding

int n, hd[N], cnt, dis[N], trie[N << 4][2], cnt_t = 1, ans;
bool ed[N << 4];
struct edge{int nxt, to, d;}e[N << 1];
void add(int x, int y, int z){e[++cnt] = (edge){hd[x], y, z}; hd[x] = cnt;}
void dfs(int u, int fa)
{
	for(int i = hd[u]; i; i = e[i].nxt)
	{
		int v = e[i].to, val = e[i].d;
		if(v == fa)continue;
		dis[v] = dis[u] xor val; dfs(v, u);
	}
}
void ins(int x)
{
	int k = 1;
	for(int i = 31; i > ~ i; i--)
	{
		int ret = (x >> i) & 1;
		if(!trie[k][ret])trie[k][ret] = ++cnt_t;
		k = trie[k][ret];
	}
	ed[k] = true;
}
int find(int x)
{
	int k = 1, ans = 0;
	for(int i = 31; i > ~ i; i--)
	{
		int ret = (x >> i) & 1;
		if(trie[k][ret xor 1])k = trie[k][ret xor 1], ans += 1 << i;
		else k = trie[k][ret];
	}
	return ans;
}
signed main()
{
	cin >> n;
	for(int i = 1, x, y, z; i < n; i++)
	{
		scanf("%d %d %d", &x, &y, &z);
		add(x, y, z); add(y, x, z);
	}
	dfs(1, 0);
	for(int i = 1; i <= n; i++)ins(dis[i]);
	for(int i = 1; i <= n; i++)ans = max(ans, find(dis[i]));
	cout << ans; return 0;
}

ultimate

That's all there is to this blog, which focuses on the basics (hash, KMP, trie), with a little bit of harder topics. There will be three more string blogs and then the end of all string content, so stay tuned.

bibliography

Some properties of border - kymru - Blogosphere

Introduction to the String Section - OI Wiki

Algorithm Graphic Animation Series] KMP String Matching Search Algorithm - Tencent Cloud Developer Community - Tencent Cloud

Border / Reply Border Theory Trivia - Lgx_Q - Blogland