Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”

187. Repeated DNA sequences

All DNA is made up of A series of nucleotides abbreviated to ‘A’, ‘C’, ‘G’ and ‘T’, such as’ ACGAATTCCG ‘. When studying DNA, it can sometimes be very helpful to identify repeated sequences in DNA.

Write a function to find all target substrings of length 10 that occur more than once in the DNA string s.

  • Example 1:

Input: s = “AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT” output: [” AAAAACCCCC “, “CCCCCAAAAA”]

  • Example 2:

Input: s = “AAAAAAAAAAAAA” Output: [“AAAAAAAAAA”]

Their thinking

Use sliding window + bit operation

  1. Using Map, you can obtain the number of occurrences of the target substring in O (1) time complexity
  2. But because 0 <= S.length <= 100000, the space complexity is very high if you use map to count the target substring of length 10. But because all DNA is made up of A series of nucleotides abbreviated to ‘A’, ‘C’, ‘G’ and ‘T’, we can use A two-bit binary to represent A character, so A target substring of length 10 would need only A 32-bit integer
  3. In the naive solution, we need to enumerate all the target substrings of length 10, the time complexity is O (10* S.length), we can use the sliding window, maintain the string composition in the window, the time complexity is O (s.length).

code

class Solution {
    public  String decode(int tar) {
        StringBuilder sb = new StringBuilder();
        for (int i=0; i<10; i++) {int cur=tar&3;
            if (cur==0)
                sb.append('A');
            else if(cur==1)
                sb.append('C');
            else if(cur==2)
                sb.append('G');
            else sb.append('T');
            tar>>=2;
        }
        return sb.reverse().toString();
    }
    public  int count(char c) {

        if (c=='A')
            return 0;
        else if(c=='C')
            return 1;
        else if(c=='G')
            return 2;
        else return 3;
    }
    public  List<String> findRepeatedDnaSequences(String s) {
        ArrayList<String> strings = new ArrayList<>();
        Map<Integer,Integer> map=new HashMap<>();
        int cur=0,l=0,r=0,n=s.length(),mask=(1<<20) -1;
        while (r<n)
        {
          if (r-l>=10)
          {
              map.put(cur,map.getOrDefault(cur,0) +1);
              cur-=(count(s.charAt(l++))<<18);

          }
          cur<<=2;
          cur+=count(s.charAt(r++));
        }
        map.put(cur,map.getOrDefault(cur,0) +1);
        for (Map.Entry<Integer, Integer> entry : map.entrySet()) {
            if (entry.getValue()>1)
                strings.add(decode(entry.getKey()));
        }
        returnstrings; }}Copy the code