Complete code download

A brief introduction of TF_IDF algorithm

Term Frequency — Inverse Document Frequency (TF-IDF) is a commonly used weighting technique for information retrieval and data mining. Tf-idf is a statistical method used to assess the importance of an entry to one of the documents in a file set or a corpus. The importance of an entry increases directly with the number of times it appears in the document, but decreases inversely with the frequency of its occurrence in the corpus.

Ii. Principle of TF_IDF

The more frequently a term appears in one file (TF) and rarely appears in other files (IDF), the term is considered to have good classification ability and is suitable for classification. TF: number of times an entry appears in a file IDF: Number of times an entry appears in a file, counting only once. N: indicates the total number of articles

The formula for weighting entries in a document is:

Third, MR implementation of the algorithm

3.1. Original sample data

Weibo ID Weibo content

3823890453754000 Wow, support! 3823890482816512 happy weekend, hot pot must offer up 3823890499663321 today you about? Thousands of parents meet in front of the kindergarten... Today is kindergarten registration day! Poor parents all over the world! We meet here today 3823890520592496 friends together, how can the company of food less, let go of the stomach to eat, eat full together again weight loss! Today, I made an appointment with a good friend to have a big dinner together. Today I made an appointment with soybean milk and fried dough sticks. About the rice cooker after a few hours of automatic cooking, also want to make about soybean milk machine, let me sleep an hour more in the morning, not so hard, get up can drink delicious soybean milk. 3823890591683238 Today about the child his godmother to my house to make cookies 3823890621005004 this year spring comes very early, such a good weather, of course, on the offer of the husband and daughter to spring together to appreciate the flowers to see the fish is not comfortable said 3823890629460577 pass by, Take a run for it, Come on 3823890629779115 we made an appointment to go to primary school again tomorrow to miss the campus life 3823890642051141 about eating, drinking and playing 3823890671749747 Today about my little girl went to the bookstore to buy stationery ~ 3823890675629620 A little spring at last! Outing date!Copy the code

Second, algorithm implementation

First, according to the above formula, there are three values to be calculated:

  • TF (word frequency) the number of times each term appears in a single file. Note that this is the word frequency of the term in each tweet, not in all files
  • N Total number of microblogs
  • DF (reverse file frequency) term in how many files exist.

Next, the algorithm is implemented

2.1. The first MapReduce

First of all, TF, N is calculated in the first MapReduce, because the total number of microblogs can be obtained by traversing the original data without considering the relationship between microblogs. For each microblog, its content is extracted and TF is counted by using the word segmentation

2.1.1 mapper

A microblog corresponds to a MapTask, which is processed to calculate the number of microblog and the TF of each microblog: the number of occurrences of each microblog can be obtained through the word segmentation. Why the output can be entry + microblog ID: Because TF calculates the number of entries appearing in each microblog and distinguishes the number of entries appearing in different microblogs. If the output key is w instead of W_id, all the same W will be grouped into one group during shuffle (the default grouping is based on whether the key is the same or not), then the TF is the total number of all microblogs, which is inconsistent with the actual number.

StringReader sr = new StringReader(content); IKSegmenter = new IKSegmenter(sr, true); Lexeme word = null; while((word = ikSegmenter.next()) ! = null){ String w = word.getLexemeText(); Context. write(new Text(w+"_"+id), new IntWritable(1)); }Copy the code

At the same time, the number of microblogs should be counted. The output key is “count” and value is 1, and the sum is performed in reduce.

Context. Write (new Text("count"), new IntWritable(1));Copy the code

2.1.2 partition

Because there is only one data in the final record of the total number of micro-blogs, if we follow the custom partition and the hashcode value module reduceTaskNum of key, then the record of the total number of micro-blogs and the statistical output of TF will be in the same file, which will cause trouble to our subsequent calculation. User-defined partitioning is carried out to send the statistics of the total number of microblogs to a separate Reduce for processing. The final total number of microblogs output will only contain one data in one file. As long as the data whose key is count is sent to the last reuce for processing,

        if (key.toString().equals("count")) {
            return numReduceTask - 1;
        }
Copy the code

TF statistics are sent to other reuceTask for processing. In this mapreuce, four Reduce tasks are set. The first three reduce tasks are TF values.

package com.chb.weibo1; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner; /** * create a partition, * @author 12285 */ public class FirstPartitioner extends HashPartitioner<Text, IntWritable>{ @Override public int getPartition(Text key, IntWritable value, int numReduceTasks) { if (key.toString().equals("count")) { return numReduceTask - 1; }else { return super.getPartition(key, value, numReduceTasks - 1); }}}Copy the code

2.1.3 reducer

In Reduce, it is relatively simple. TF and microblog total amount are processed in different REUDCE, and only data accumulation is required.

package com.chb.weibo1; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class FirstReducer extends Reducer<Text, IntWritable, Text, IntWritable>{ @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable iw : values) { sum += iw.get(); } context.write(key, new IntWritable(sum)); }}Copy the code

2.1.4 Execution of the first MapReduce.

package com.chb.weibo1; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class FirstRunJob { public static void main(String[] args) throws Exception { System.setProperty("HADOOP_USER_NAME", "chb"); Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Job job = Job.getInstance(); job.setJobName("First Job"); job.setJar("C:\\Users\\12285\\Desktop\\weibo.jar"); job.setJarByClass(FirstRunJob.class); job.setMapperClass(FirstMapper.class); job.setReducerClass(FirstReducer.class); job.setPartitionerClass(FirstPartitioner.class); job.setNumReduceTasks(4); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setInputFormatClass(KeyValueTextInputFormat.class); Path in = new Path("/user/chb/input/weibo/"); FileInputFormat.addInputPath(job, in); Path out = new Path("/user/chb/output/outweibo1"); if (fs.exists(out)) { fs.delete(out, true); } FileOutputFormat.setOutputPath(job, out); boolean f = job.waitForCompletion(true); If (f) {system.out.println (" Job executed successfully!" ); }}}Copy the code

The results of

Set a total of 4 Reduce,



TF is w_id, word frequency number, with tabs in the middle

0.03_3824246315213843 2 0.25_3824246315213843 2 0.33 catty _3824020946550570 2 0.83 catty _3823927795601059 2 0.88 catty _3823953501983192 2 002244_3824015577543437 2 002244_3824025388441161 2 002245_3824015577543437 2 002245_3824025388441161 2 002250 _3824015577543437 _3824025388441161 002250 2Copy the code

The last one is the total number of tweets, with only one piece of data: