This article is participating in the “Java Theme Month – Java Brush questions punch card”, see the activity link for details

Topic describes

This is 28. Implement strStr() on LeetCode, and the difficulty is simple.

Tag: “string matching”, “KMP”

Implement the strStr() function.

Given two strings haystack and needle, find the first place in the haystack string where needle appears (the subscript starts at 0).

If none exists, -1 is returned.

Description:

  • What value should we return when needle is an empty string? This is a good question to ask in an interview.

  • For this case, we should return 0 when needle is an empty string. This is consistent with the C language STRSTR () and the Java definition of indexOf().

Example 1:

Input: haystack = "hello", needle = "ll" Output: 2Copy the code

Example 2:

Input: haystack = "aaAAA ", needle = "bba" Output: -1Copy the code

Example 3:

Input: haystack = "", needle = "" Output: 0Copy the code

Tip:

  • 0 <= haystack.length, needle.length <= 5 *
    1 0 4 10^4
  • Haystack and needle are only lowercase characters

Simple solution

Enumerates each character in ss as the “starting point” of the original string and the “first” of the matching string each time:

  • Match successful: Return the original “origin point” of the match.
  • Failed match: Enumerate the next “origin point” of the string and retry the match.

Code:

class Solution {
    public int strStr(String ss, String pp) {
        int n = ss.length(), m = pp.length();
        char[] s = ss.toCharArray(), p = pp.toCharArray();
        // enumerate the origin point of the string
        for (int i = 0; i <= n - m; i++) {
            // Start from the "starting point" of the original string and the "first" of the matching string
            int a = i, b = 0;
            while (b < m && s[a] == p[b]) {
                a++;
                b++;
            }
            // If an exact match is found, return the "origin point" subscript of the string
            if (b == m) return i;
        }
        return -1; }}Copy the code
  • Time complexity: n is the length of the original string, m is the length of the matching string. The complexity of enumeration is O(n− M)O(n-m)O(N − M), and the complexity of constructing and comparing strings is O(m)O(m)O(m) O(m). The overall complexity is O((n−m)∗m)O((n-m) * m)O((n− M)∗m).

  • Space complexity: O(1)O(1)O(1).


KMP algorithm

KMP algorithm is a fast search matching string algorithm, its role is actually the question: how to quickly find “matching string” in the “original string”.

Without pruning, the complexity of the above naive solution is O(m∗n)O(M * n)O(m∗n) O(m * n)O(m∗n), while the complexity of the KMP algorithm is O(M +n)O(M + N)O(M + N).

KMP was able to work in the
O ( m + n ) O(m + n)
The search is completed within complexity because it can extract effective information in the process of “incomplete matching” for reuse to reduce the consumption of “repeated matching”.

You may not understand, but that’s ok, we can understand KMP by using 🌰.

1. Matching process

Before simulating the KMP matching process, we first establish two concepts:

  • Prefix: for stringsabcxxxxefg, we callabcBelong toabcxxxxefgA prefix of.
  • Suffix: for stringsabcxxxxefg, we callefgBelong toabcxxxxefgA suffix of.

Then we assume that the original string is abeababeabf and the matching string is abeabf:

We can start by looking at how the match would work without using KMP (without using the substring function).

First, there is a pointer to the current matching position in the “original string” and “match string” respectively.

The “start point” for the first match is the first character, a. Obviously, the subsequent ABEAb matches, and both Pointers move to the right at the same time (black bar).

“Naive matching” is no different from “KMP” in the parts where abeAB matches.

Until the first different position appears (red label) :

Here’s where “naive matching” and “KMP” differ:

  • Let’s start with the “naive matching” logic:

1. Move the pointer of the original string to the next position of the “start point” (bCharacter); The pointer to the matching string moves to the starting position.

2. If no match is found, the pointer to the original string is moved backward until it matches the matching string.

As shown in figure:

That is, in the case of naive matching, if a match fails, the pointer to the original string is moved to the next “start point,” the pointer to the matching string is moved to the start position, and the match is attempted again.

It is easy to understand why the complexity of “naive matching” is O(m∗n)O(m * n)O(m∗n).

  • Then let’s look at the “KMP matching” process:

First, the matching string checks to see if the same “prefix” and “suffix” exists in the previously matched parts. If so, skip to the next position of the “prefix” to continue matching:

After jumping to the next matching position and trying to match, it is found that the characters of the two Pointers do not match, and there is no same “prefix” and “suffix” in front of the matching string pointer. At this time, you can only go back to the starting position of the matching string and start again:

At this point, you should know why KMP is faster than the naive solution:

  • Because KMP uses the same “prefix” and “suffix” in the matched part to speed up the next match.

  • Because the KMP string pointer does not backtrack (there is no naive matching to the next “start point”).

The first point is intuitive and easy to understand.

We can focus on the second point, what does it mean that the original string does not go back to the “origin point”?

What this means is that as the matching process progresses and the string pointer moves to the right, we are essentially constantly rejecting “impossible” solutions.

When our string pointer fromiPosition moved back tojThe position does not just mean that the string subscript range is
[ i . j ) [i,j)
The character matches or does not match the “match string”, and is rejected by those whose subscript range is “original string”
[ i . j ) [i,j)
Is a subset of match initiation points.

2. Analyze the implementation

Is this the end of it? Ready to start implementing the matching process?

We can look at the complexity first. In the worst case, we need to scan the whole string, O(n)O(n)O(n). At the same time, if the match fails, check the same prefix and suffix of the matched part and go to the corresponding position. If the match fails, check whether the previous part has the same prefix and suffix and go to the corresponding position. The complexity of this part is O(m2)O(m^2)O(m2), so the overall complexity is O(n∗m2)O(n * m^2)O(n∗m2) O(n∗m2)O(n ∗m2) while our naive solution is O(m∗n)O(m * n)O(m∗n) O(m∗n).

So there are some properties that we haven’t used yet.

Obviously, scanning the entire string is unavoidable, and the only thing we can optimize is the process of “checking for the same prefix and suffix for matched parts”.

Further, the purpose of checking “prefix” and “suffix” is “to determine where the next segment in the matching string starts to match”.

At the same time, we find that for any position in the matching string, the position of the next matching point initiated by this position is actually independent of the original string.

For example 🌰, for character D matching string abcabd, the next matching point jump initiated by it must be the position of character C. Because the same “prefix” and “suffix” of character D is the character C next to character AB.

It can be seen that the process of jumping from one matching position to the next matching position is independent of the original string, and we call this process searchnextPoints.

Obviously we can preprocess itnextArray, the value of each position in the array is the target position to which the subscript should jump (nextPoint).

What is the complexity when we do this optimization?

The complexity of the preprocessing next array is unknown, and the matching process can scan at most a complete string, and the complexity is O(n)O(n)O(n).

So if we want the entire KMP process to be O(m+n)O(m +n)O(m +n)O(m +n), then we need to preprocess the next array within the complexity of O(m)O(m)O(m) O(m).

So what we’re focusing on is how do we get in
O ( m ) O(m)
Processing within complexitynextThe array.

3. nextBuilding arrays

Next, let’s look at how the next array is preprocessed in O(m)O(m)O(m).

Assuming there is a matching string aaABBAB, let’s see how the corresponding next is constructed.

This is how the entire next array is built, O(m)O(m)O(m).

So far the complexity of the whole KMP matching process is O(m+n)O(m +n)O(m +n).

4. Code implementation

When actually coding, I usually append a space (sentry) to the headers of the original and matching strings.

The goal is to make the j subscript start at 0, save the trouble of j starting at -1.

The whole process is exactly the same as the above analysis, and I have written some relevant comments into the code.

Code:

class Solution {
    / / KMP algorithm
    // ss: original string pp: matching string
    public int strStr(String ss, String pp) {
        if (pp.isEmpty()) return 0;
        
        // Read the length of the original and matching strings respectively
        int n = ss.length(), m = pp.length();
        // Both the original string and the matching string are preceded by a space so that their subscripts start at 1
        ss = "" + ss;
        pp = "" + pp;

        char[] s = ss.toCharArray();
        char[] p = pp.toCharArray();

        // Construct the next array with the length of the matching string (next array is related to matching string)
        int[] next = new int[m + 1];
        // I = 2, j = 0, I < or equal to the length of the matching string
        for (int i = 2, j = 0; i <= m; i++) {
            // next(j) = next(j)
            while (j > 0&& p[i] ! = p[j +1]) j = next[j];
            // if the match is successful, make j++ first
            if (p[i] == p[j + 1]) j++;
            // Update next[I] to end the loop, I ++
            next[i] = j;
        }

        I = 1, j = 0, I is less than or equal to the length of the original string.
        for (int i = 1, j = 0; i <= n; i++) {
            J = next(j)
            while (j > 0&& s[i] ! = p[j +1]) j = next[j];
            // if the match is successful, j++ is used first, and i++ is used after the loop is completed
            if (s[i] == p[j + 1]) j++;
            // If the whole section matches, return the index directly
            if (j == m) return i - m;
        }

        return -1; }}Copy the code
  • Time complexity:nIs the length of the original string,mIs the length of the matching string. Complexity of
    O ( m + n ) O(m + n)
    .
  • Space complexity: BuiltnextThe array. Complexity of
    O ( m ) O(m)
    .

conclusion

The application scope of KMP algorithm is wider than that of Manacher algorithm. Manacher algorithm can only be applied to “palindrome string” problem, which is relatively limited, while “substring matching” problem is very common.

The significance of memorizing such an algorithm is that it is equivalent to having a time complexity in the brain
O ( n ) O(n)
This API passes in a string and a matching string and returns the matching string at the location of the original string.

Therefore, Mitsuha strongly recommends that you memorize the template on the basis of “understanding KMP”


The last

This is the 28th article in our “Brush through LeetCode” series. The series started on 2021/01/01, and there are 1916 questions in LeetCode by the start date. Some of them have locks, so we will finish all the questions without locks.

In this series of articles, in addition to explaining how to solve the problem, I will also present the most concise code possible. If a general solution is involved, the corresponding code template will also be used.

In order to facilitate the students to debug and submit the code on the computer, I set up the relevant warehouse: github.com/SharingSour…

In the repository, you’ll see links to the series of articles, the corresponding code for the series of articles, the LeetCode source links, and other preferred solutions.