Introduction to the

In this quick tutorial, learn how to detect multiple words in a string.

Our example

Let’s assume we have strings:

String inputString = "hello there, william";

Copy the code

Our task is to find if inputString contains the words “hello” and “William”.

So, let’s put our keywords in an array:

String[] words = {"hello"."william"};
Copy the code

Also, the order of the words doesn’t matter; matching is case sensitive.

Use string.contains ()

First, we’ll show how to use the string.contains () method to achieve our goal.

Let’s traverse the key word count and check the occurrence of each item in inputString:


public static boolean containsWords(String inputString, String[] items) {
    boolean found = true;
    for (String item : items) {
        if(! inputString.contains(item)) { found =false;
            break; }}return found;
}

Copy the code

This example is fairly straightforward, and although we need to write more code, this solution is fast for simple use cases.

Use string.indexof ()

Similar to the solution using the string.contains () method, we can use the string.indexof () method to check the indexOf a keyword. To do this, we need a method that accepts inputString and keyword list:

public static boolean containsWordsIndexOf(String inputString, String[] words) {
    boolean found = true;
    for (String word : words) {
        if (inputString.indexOf(word) == -1) {
            found = false;
            break; }}return found;
}
Copy the code

The index inputString of the inner word returned by the indexOf() method. When we have no words in the text, the index will be -1.

Using regular expressions

Now, let’s use regular expressions to match our words. To do this, we will use the Pattern class.

First, let’s define a string expression. Since we need to match two keywords, we’ll build our regular expression rules using two forward-looking:

Pattern pattern = Pattern.compile("(? =.*hello)(? =.*william)");
Copy the code

For the general case:

StringBuilder regexp = new StringBuilder();
for (String word : words) {
    regexp.append("(? =. *").append(word).append(")");
}
Copy the code

After that, we’ll use the matcher() method find() :

public static boolean containsWordsPatternMatch(String inputString, String[] words) {
 
    StringBuilder regexp = new StringBuilder();
    for (String word : words) {
        regexp.append("(? =. *").append(word).append(")");
    }
 
    Pattern pattern = Pattern.compile(regexp.toString());
 
    return pattern.matcher(inputString).find();
}
Copy the code

However, regular expressions have a performance cost. This solution may not perform optimally if we are looking for more than one word.

Use Java 8 and List

Finally, we can use the Java 8 Stream API. But first, we need to perform some simple transformations on the original data:

List<String> inputString = Arrays.asList(inputString.split(""));
List<String> words = Arrays.asList(words);
Copy the code

Now, it’s time to use the Stream API:


public static boolean containsWordsJava8(String inputString, String[] words) {
    List<String> inputStringList = Arrays.asList(inputString.split(""));
    List<String> wordsList = Arrays.asList(words);
 
    return wordsList.stream().allMatch(inputStringList::contains);
}

Copy the code

The above operation returns true if the input string contains all of our keywords.

Alternatively, we could simply use the Collections framework’s containsAll() method to achieve the desired result:

public static boolean containsWordsArray(String inputString, String[] words) {
    List<String> inputStringList = Arrays.asList(inputString.split(""));
    List<String> wordsList = Arrays.asList(words);
 
    return inputStringList.containsAll(wordsList);
}
Copy the code

However, this method only applies to whole words. Therefore, our keywords will only be found if they are separated from Spaces in the text.

Aho-corasick algorithm is used

In short, the Aho-Corasick algorithm is used for text searches using multiple keywords. No matter how many keywords we search or how long the text is, it has O(n) time complexity

Let’s include the Aho-Corasick algorithm dependency in pom.xml:

< the dependency > < groupId > org. Ahocorasick < / groupId > < artifactId > ahocorasick < / artifactId > < version > 0.4.0 < / version > </dependency>Copy the code

First, maven introduces dependency packages. The internal structure will use a tree data structure:

Trie trie = Trie.builder().onlyWholeWords().addKeywords(words).build();
Copy the code

After that, let’s call the parser method with the inputString text, where we want to find the keyword and save the result in the emits collection:

Collection<Emit> emits = trie.parseText(inputString);
Copy the code

Finally, print the result of the run:

emits.forEach(System.out::println);
Copy the code

For each keyword, we look at the starting position of the keyword, the ending position, and the keyword itself in the text:

0:4=hello
13:19=william
Copy the code

Finally, let’s look at the full implementation:

public static boolean containsWordsAhoCorasick(String inputString, String[] words) {
    Trie trie = Trie.builder().onlyWholeWords().addKeywords(words).build();
 
    Collection<Emit> emits = trie.parseText(inputString);
    emits.forEach(System.out::println);
 
    boolean found = true;
    for(String word : words) {
        boolean contains = Arrays.toString(emits.toArray()).contains(word);
        if(! contains) { found =false;
            break; }}return found;
}

Copy the code

In this example, we just look for the whole word. Therefore, if we want to match not only inputString but also helloBaeldung, we should simply remove the onlyWholeWords() attribute from the Trie builder pipeline.

Also, keep in mind that we remove duplicate elements from the emits collection, because there may be multiple matches for the same keyword.

conclusion

In this article, you learned how to find multiple keywords in a string. In addition, we demonstrate examples using the core JDK and the Aho-Corasick library.