Location>code7788 >text

Suffix Automation (SAM) Learning Notes

Popularity:171 ℃/2025-02-05 19:32:51

1. Foreword

A suffix automata is an efficient finite state machine that represents all possible suffix substrings of a string. Compared with traditional suffix trees, suffix automata has less space complexity and faster build speed.

It is an algorithm that can store all substrings in the next string. Generally speaking, there should be a state, while SAM can use O(N) states to represent all substrings because it makes many essentially similar substrings of .


2. Basic concepts

a. Status
  • definition: The state represents all substrings with the same longest common prefix.
  • property
    • len: The maximum length of the substring represented by this state.
    • link: A pointer to another state, used to represent the relationship between the current state and the previous state.
b. Transfer edges
  • definition: Transfer edges represent paths from one state to another, usually triggered by characters.
  • structure: Each state can have multiple transfer edges through which other states can be reached.
c. Suffix chain
  • definition: A suffix chain is a series of state connections used to represent the relationship between suffix substrings of different lengths in a string.
  • effect: When building an automaton, the suffix chain helps maintain information with the longest common prefix.

3. Build a suffix automaton

a. Initialization
  1. Create an initial state (init),Thatlenis -1,linkThe pointer points to itself.
  2. The node corresponding to the last character inserted islast
b. Extend node
  • step
    1. Create a new node: When processing a new character, first create a new state and put itlenSet to Currentlastoflen + 1。
    2. Backtracking check: From the current onelastStart backtracking, looking for states where new transfer edges can be added along the transfer edges and suffix chains. If a child node in a certain state has already existed
      In, updatelinkpointer.
    3. Maintain the last public state: During the backtracking process, make sure that the suffix chains of all relevant states are correct.
c. Implementation details
  • Use classes or structures in C++ to represent states and transfer edges:
struct state {
     int len;
     int link;
     unordered_map<char, int> next; // Transfer edges: character to state mapping
 };

4. Core algorithm

a. Extend node
  • Function description
    • Given the currentlastStatus and a new characterc, create and return a new state.
  • Implement steps
    1. Initialize a new statepand will = last->len + 1
    2. Traversal fromlastThe suffix chain at the beginning, looking for the state where the transfer edge can be addedq
    3. renewqchild nodes to ensure there is no conflict.
    4. Return to a new statuspAnd updatelast
b. Calculate the number of occurrences
  • method
    • Take advantage of the properties of the suffix automaton to traverse all states and count themlenAttribute to calculate the number of occurrences of different substrings.
  • Code example
int countOccurrences() {
    int res = 0;
    for (const auto& state : states) {
        if ( != -1) {
            res +=  - ;
        }
    }
    return res;
}
c. Longest common prefix
  • method
    • Find the longest common prefix in both strings.
  • Implement steps
    1. Convert the problem to find the corresponding path in the suffix automata.
    2. Use the backtracking algorithm to start from the root node and look for the longest common prefix along the transfer edge.

5. Advanced Applications

a. The longest substring of the scripture
  • Thinking
    • Use a suffix automata and some auxiliary data structures (such as hash tables and managers) to record symmetric information at each position.
    • Traverse all possible center points and look for the longest palindrome.
b. Multi-pattern matching
  • method
    • Merge multiple pattern strings into the same suffix automaton.
    • Use an automaton for efficient multi-pattern matching.

6. FAQs and debugging skills

a. Memory leak
  • Make sure that each newly created state is released correctly.
b. Pointer error
  • Check all statuses carefullylinkandnextWhether the attribute is initialized correctly.
c. Performance optimization
  • Replace the default hash table with more efficient data structures such as skip tables or balanced binary trees.
  • Parallelize certain sequence-independent calculation steps.

7. More model extensions

1. Check whether the string appears
  • Question page
    • Given a text string s and multiple pattern strings t, we want to check whether the string t appears as a substring of s.
  • Thinking
    • Let's build an automaton for s first. To check whether the pattern string t appears in s, we transfer from the starting point along the transfer edge according to the character of t. If it cannot be transferred at a certain point, the pattern string t is not a substring of s. If we can handle the complete string t like this, then the pattern string appears in s.

2. Calculate how many different substrings are in the given string.
  • Question page
    • Give a string s and calculate the number of different substrings.
  • Thinking
    • Constructs a suffix automata for string s. Each substring of s is equivalent to some paths in the automaton. Therefore, the number of different substrings is equal to the number of different paths in the automaton with t_0 as the starting point. make\(d_{v}\)is the number of paths starting from state v (including paths with zero length).

3. Total length of all different substrings
  • Question page
    • Given a string s, calculate the total length of all different substrings.
  • Thinking
    • This question is similar to the previous question, but now we need to consider dynamic programming in two parts: the number of different substrings\(d_{v}\)and their total length\(ret_{v}\)

4. Dictionary order k big substring
  • Question page
    • Given a string s. Multiple groups of queries, each group of queries gives a number k, and query the substrings with the largest dictionary kth in all substrings of S.
  • Thinking
    • The k-th substring of the dictionary order corresponds to the k-th path in the dictionary order in SAM. Therefore, after calculating the number of paths in each state, we can easily find the k-th path from the root of the SAM.

5. Minimum cyclic shift
  • Question page
    • Given a string s. Find the cyclic shift with the smallest dictionary order.
  • Thinking
    • The string S+S contains all loop shifts of the string S as a substring. So the problem is simplified to find the smallest length on the suffix automaton corresponding to S+S\(\left|S\right|\)This can be done in an ordinary way: we start from the initial state and greedily access the smallest characters.

6. The first time it appears
  • Question page
    • Given a text string t, multiple sets of queries. Each time the query string s first appears in the string t (the beginning of s).
  • Thinking
    • This kind of simple O(T|t|) approach, but obviously it can't be passed. Consider optimization:
      We construct a suffix automata. We preprocess all state locations in SAM\(\operatorname{firstpos}\). That is, for each state v we want to find the position at the end of the state that first appears\(\operatorname{firstpos}[v]\). In other words, we want to find each collection first\(\operatorname{endpos}\)The smallest element in (apparently we cannot explicitly maintain all\(\operatorname{endpos}\)gather).

      For maintenance\(\operatorname{firstpos}\)In these positions, we extend the original function to sam_extend(). When we create a new state\(\textit{cur}\)When we ordered:

      \[\operatorname{firstpos}(\textit{cur})=\operatorname{len}(\textit{cur})-1 \]

      When we copy node q to\(\textit{clone}\)When we ordered:

      \[\operatorname{firstpos}(\textit{clone})=\operatorname{firstpos}(q) \]

      (Because the only other option of the value\(\operatorname{firstpos}(\textit{cur})\)Obviously too big).

      Then the answer to the query is\(\operatorname{firstpos}(t)-\left|s\right|+1\), where t is the state of the corresponding string s.


7. All locations appear
  • Question page
    • The same problem as above, this time you need to query all the locations where the pattern string appears in the text string t.
  • Thinking
    • Use the tree structure of the suffix automaton to traverse the subtree and output once the end point node is found.

8. The shortest string without occurrence
  • Question page
    • Given a string s and a specific character set, we want to find a string with the shortest length that has not appeared in s.
  • Thinking
    • make\(d_{v}\)For node v's answer, that is, we have processed part of the substring, currently in state v, and want to find the minimum number of characters to add to the discontinuous transfer. calculate\(d_{v}\)Very simple. If there is no transfer using at least one character in the character set,\(d_{v}=1\). Otherwise, adding a character is not enough, we need to find the minimum value of all transfers