June 9, 2021, CVTE NLP algorithm job interview questions 9

1. Talk about the improved TF-IDF

IDF in TF-IDF is a kind of weighting that tries to suppress noise. It simply assumes that words with low text frequency are more important and words with higher text frequency are more useless. This method has huge disadvantages in similar corpus, and some keywords of similar text are easy to be masked. For example, there are more educational articles in corpus D, while text J is an educational culture article. In that case, the IDF value of education-related words will be smaller, so that the recall rate of keywords extracted in this paper will be lower.

Improved method: TF−IWF (Term Frequency-Inverse Word Frequency)

AI charging season, open the treasure box free to learn regular price class!

More 200 paper books and so on, to send! >>>AI charging season, open treasure box free to learn regular price lessons, 200 paper books and other free mail! – Online in July

2. Talk about K-means and spectral clustering

Clustering algorithm belongs to unsupervised machine learning algorithm, that is, there is no category label Y, and similar data should be divided into a group according to data characteristics. K – means clustering algorithm namely K points were randomly selected as the clustering center, calculate the other point and the center distance, and choose the nearest center classification, classified after the completion of the calculation of each type of the new center, to calculate each point and the clustering center and select the closest category, repeat this process, until the center no longer change.

The idea of spectral clustering is to regard samples as vertices and the similarity between samples as weighted edges, so as to transform the clustering problem into a graph segmentation problem: Find a method of graph segmentation to make the weight of edges connecting different groups as low as possible (which means the similarity between groups should be as low as possible), and the weight of edges within groups should be as high as possible (which means the similarity within groups should be as high as possible), so as to achieve the purpose of clustering.

3. The idea of distillation, why distillation?

Knowledge distillation is the distillation of knowledge from a trained model into another model. Specifically, knowledge distillation can transfer knowledge from one network to another network, which can be isomorphic or heterogeneous. The idea is to train a teacher network and then train the student network using the true label of the teacher network’s output and data.

In training, we need to use complex models, large computational resources, in order to extract information from very large, highly redundant data sets. In experiments, the models that work best are often large, or even integrated from multiple models. However, the large model is not convenient to deploy to the service, and the common bottlenecks are as follows:

Slow inference
There are high requirements for deployment resources (memory, video memory, etc.), and we have strict limitations on latency and computing resources during deployment.

Therefore, model compression (reducing the number of parameters in a model while maintaining performance) becomes an important issue. Model distillation is a method of model compression.

4. What are the methods of distillation? What is the student model in distillation?

Take Bert model as an example:

Logit Distillation

Beyond Logit Cut: TinyBert

Curriculum Distillation:

Dynamic Early Exit: FastBert.

5. What memory optimizations did Python make?

Python uses memory pools to reduce memory fragmentation and improve execution efficiency. The garbage collection is mainly completed by reference counting, the problems caused by circular reference of container objects are solved by mark-clean, and the efficiency of garbage collection is improved by generational collection.

6. How to save memory?

Manually reclaim unneeded variables;
Convert numeric data to 32 or 16 bits (with restrictions on data types)

The following is a code example:

def reduce_mem_usage(props): Start_mem_usg = props.memory_usage().sum() / 1024 ** 2 print("Memory usage of the dataframe is :", Start_mem_usg, "MB") # Which columns contain null values filled with -999. NAlist = [] for col in props. Columns: If (props[col].dtypes!) props[col].dtypes! = object): print("**************************") print("columns: ", col) print("dtype before", Mmax = props[col].max() mmin = props[col].min() # props[col].min( NA, therefore Na needs to be filled if not np.isfinite(props[col]).all(): NAlist.append(col) props[col].fillna(-999, # test if column can be props[col].fillna(0).astype(np.int64) Result = np.fabs(props[col] -asint) result = result.sum() if result < 0.01: # make interger/unsigned Integer datatypes if isInt: if mmin >= 0: If mmax <= 255: props[col] = props[col].astype(np.uint8) elif mmax <= 65535: props[col] = props[col].astype(np.uint16) elif mmax <= 4294967295: props[col] = props[col].astype(np.uint32) else: props[col] = props[col].astype(np.uint64) else: If mmin > np.iinfo(np.int8). Min and mmax < np.iinfo(np.int8). Max: props[col] = props[col].astype(np.int8) elif mmin > np.iinfo(np.int16).min and mmax < np.iinfo(np.int16).max: props[col] = props[col].astype(np.int16) elif mmin > np.iinfo(np.int32).min and mmax < np.iinfo(np.int32).max: props[col] = props[col].astype(np.int32) elif mmin > np.iinfo(np.int64).min and mmax < np.iinfo(np.int64).max: props[col] = props[col].astype(np.int64) else: # note: So here we're converting to float16 for float, Props [col] = props[col].astype(np.float16) print("dtype after", props[col].dtype) print("********************************") print("___MEMORY USAGE AFTER COMPLETION:___") mem_usg = props.memory_usage().sum() / 1024**2 print("Memory usage is: ",mem_usg," MB") print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")Copy the code

7. How can PANDAS read oversized files?

Data can be read in blocks.

The following is a code example:

data_path = r'E:\python\Study\BiGData\demo.csv' def read_bigfile(path): Each chunk is a chunk, Df = pd.read_csv(path, engine='python', encoding=' GBK ', encoding=' GBK ', iterator=True) loop = True chunkSize = 10000 chunks = [] while loop: try: chunk = df.get_chunk(chunkSize) chunks.append(chunk) except StopIteration: loop = False print("Iteration is stopped.") df = pd.concat(chunks, ignore_index=True) after_df = read_bigfile(path = data_path)Copy the code

8. The oldest string with no duplicate characters

The title is Leetcode-3, difficulty: [medium]

Method: double pointer + sliding Window

We define two Pointers, start and end, to get the sliding Window

Start starts at 0, uses end to linearly iterate over each character, and uses Recod to record the latest subscript of each letter

Select * from ‘start’ and ‘record[char]+1’; select * from ‘start’ and ‘record[char]+1’;

Note: In both cases, the record needs to be updated, because the new characters are added to the record when the record does not appear, and in the case of the record, the corresponding value in the record needs to be updated to the new index.

Code:

class Solution:
    def lengthOfLongestSubstring(self, s: str) -> int:
        record = {}
        start,res = 0,0
        for end in range(len(s)):
            if s[end] in record:
                start = max(start, record[s[end]] + 1)
            record[s[end]] = end
            res = max(res, end - start + 1)
        return res
Copy the code

Time complexity: O(n)

Space complexity: O(1)

Judge 9.The listIs there a ring,The listThe entry of the ring

Determine whether there are rings in the linked list

Two methods of solving the problem are provided as follows:

Method one: hash table

All nodes are traversed, one at a time, to determine whether the node has been previously accessed.

If it has been accessed, the list is circular and returns True. If not, the node is added to the hash table and the traversal is complete.

The code is as follows:

class Solution:
    def hasCycle(self, head: ListNode) -> bool:
        seen = set()
        while head:
            if head in seen:
                return True
            seen.add(head)
            head = head.next
        return False
Copy the code

Time complexity: O(n)

Space complexity: O(n)

Method two: fast pointer

Define two Pointers, one fast and one slow. The full pointer moves one step each time, and the fast pointer moves two steps each time. Since the fast pointer is slower than the slow pointer, if the list has a loop, the fast pointer must meet the slow pointer.

The code is as follows:

class Solution:
    def hasCycle(self, head: ListNode) -> bool:
        fast = slow = head
        while fast and fast.next:
            slow = slow.next
            fast = fast.next.next
            if fast == slow:
                return True
        return False
Copy the code

Time complexity: O(N)

Space complexity: O(1)

The entry of the linked list loop is Leetcode-142

Use the fast and slow pointer

First of all, we need to do the following Settings, let the list has a + B nodes, the list head to the list entry (excluding the list entry node) has a node, the list loop has B nodes, at the same time let the fast pointer to take f steps, slow pointer to take S steps;

Obviously, since the fast pointer is twice as long as the slow pointer, there is: f = 2s

If the two Pointers meet for the first time, f = s + nb (n represents n rings), that is, fast travels n more rings than slow;

When two Pointers first meet, s = nb, f = 2nb

Now the question we need to solve is, how do we know what a is?

Suppose now that the two Pointers meet for the first time:

For the slow pointer, if the slow pointer goes to the entry node of the list, it needs a + nb step, and when they meet, they have already gone NB step, so let the slow pointer go a step again to the entry node of the list;

For the fast pointer, if we let the fast pointer go a step from the beginning, the fast pointer also goes exactly to the entry node of the list;

It can be seen that when the two Pointers meet for the second time, they are just at the entry point of the list.

The code is as follows:

class Solution: def detectCycle(self, head: ListNode) -> ListNode: fast = slow = head while True: if not (fast and fast.next):return fast = fast.next.next slow = slow.next if fast == slow:break fast = head while fast ! = slow: fast = fast.next slow = slow.next return fastCopy the code

Time complexity O(N)

Space complexity O(1)

AI charging season, open the treasure box free to learn regular price class!

More 200 paper books and so on, to send! >>>AI charging season, open treasure box free to learn regular price lessons, 200 paper books and other free mail! – Online in July

Published just now

June 9, 2021, CVTE NLP algorithm job interview questions 9

1. Talk about the improved TF-IDF

AI charging season, open the treasure box free to learn regular price class!

2. Talk about K-means and spectral clustering

3. The idea of distillation, why distillation?

4. What are the methods of distillation? What is the student model in distillation?

5. What memory optimizations did Python make?

6. How to save memory?

7. How can PANDAS read oversized files?

8. The oldest string with no duplicate characters

Judge 9.The listIs there a ring,The listThe entry of the ring

AI charging season, open the treasure box free to learn regular price class!

Related Posts

Install the JDK for Ubuntu 16.04

Taiwan scholar transforms robot into people’s livelihood to serve dangerous industries

Artificial intelligence puts programmers at risk: farmers lose their jobs