Directory:

Why use Naive Bayes for Text Classification Preambles Prepare Data Build word vector training algorithm from Text Calculate probability test algorithm from Word Vector Modify classifier according to reality Prepare data document word bag modelCopy the code

Why naive Bayes

Because each comment is converted into a word vector (with zeros and ones representing whether a word appears or not), each feature in the vector is independent of each other. This satisfies the naive Bayes independence hypothesis.Copy the code

1 / introduction:

To extract features from the text, you need to split the text. The features here are words (or words) from the text. You can think of an entry as a word, or you can use a non-word entry, such as a URL, IP address, or any other string. A text is then represented as a word vector, where a value of 1 means the term is present and 0 means the term is not present. Take the message board of an online community for example. In order not to affect the development of the community, we need to block abusive comments, so we need to build a fast filter. If a message contains a negative or insulting word, it is deemed inappropriate. Filtering this kind of content is a common requirement. Create two categories (labels) for this problem: insult and non-insult, denoted by 1 and 0 respectively.Copy the code

2/ Prepare data: Build word vectors from text

We think of the text as a word vector or term vector, that is, we turn comments into word vectors. Create a Python file called bayes.py and build the word vector with the following code.Copy the code

First, the command line generates the vocabulary. Python's set data type is skillfully used in the program. Through the union operation, it can generate a list of all the non-repeating words that appear in the document. The output is:  ['mr', 'cute', 'please', 'to', 'steak', 'worthless', 'not', 'how', 'so', 'I', 'stop', 'ate', 'buying', 'help', 'has', 'maybe', 'dog', 'him', 'flea', 'posting', 'stupid', 'is', 'food', 'garbage', 'take', 'park', 'my', 'quit', 'licks', 'dalmation', 'love', 'problems'] Check the above list and you will find that there are no duplicated words. Then take a look at setOfWords2Vec in action:  >>> bayes.setOfWords2Vec(myVocabList, listOPosts[0]) [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1] >>> bayes.setOfWords2Vec(myVocabList, listOPosts[3]) [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]Copy the code

3/ Training algorithm

Now that you've seen how to convert a set of words into a set of numbers, let's use those numbers to calculate probabilities. Let's take bayes' rule out of the last article, and replace x and y with w. W means that this is a vector, that is, it is composed of multiple values (0 and 1). In this case, the number of values is the same as the number of words in the vocabulary.Copy the code

Using the formula above, calculate the value for each category in the training set and then compare the size of the two probability values. First, the probability P(CI) is calculated by dividing the number of documents in category I (insulting or non-insulting comments) by the total number of documents. Next to calculate P (w | ci), here with the help of bayesian hypothesis. W if will unfold as the individual characteristics, so it can be writing the probability P (w0, w1, w2,... wN | ci). It is assumed that all words are independent from each other, that is, we mentioned the bayesian conditional independence assumption, this means that we can use P (w0 | ci) P (w1 | ci) P (w2 | ci)... P (wN | ci) to calculate the probability, even up so easy! Calculate the number of documents in each category (including 100 for insult and 1200 for non-insult) for each training document: If an entry appears in the document, increment the count of that entry increment the count of all entries for each category: for each entry: Divide the number of entries by the total number of entries to get conditional probability Returns the conditional probability for each categoryCopy the code

First, we find that the probability of a document falling into the insult category is 0.5, which is positively determined. The first word in the vocabulary is cute, which appears once in category 0 and never in category 1, with conditional probabilities of 0.04166667 and 0.0, respectively. The calculation is correct.Copy the code

4/ Test algorithm: Modify classifier according to reality

When documents are classified by Bayesian classifier, the product of various probabilities is calculated to obtain the probability that documents belong to a certain category. If one of the probability values is 0, then the final product is also 0. To reduce this effect, you can initialize the occurrences of all words to 1 and the denominator to 2. So you need to modify the denominator initializer code in trainNB(). P0Num = ones(numWords); P1Num = ones(numWords) # p0Denom = 2.0; P1Vect = log(p0Num/p0Denom) p0Vect = log(p0Num/p0Denom)Copy the code

5/ Prepare data: Document word bag model

We regard the presence or absence of each word as a feature, which can be described as a set-of-words model. If a word appears more than once in a document, this may mean including certain information that cannot be expressed by whether or not the word appears in the document, an approach known as the bag-of-words model. Words in a word bag can appear more than once, but in a word set, each word can only appear once. The following program presents naive Bayesian code based on the word bag model. It is almost identical to the function setOfWords2Vec(), except that it increments the corresponding value in the word vector each time a word is encountered, rather than just setting the corresponding value to 1.Copy the code