Hadoop 倒排索引

　　倒排索引是文檔檢索系統中最常用的數據結構，被廣泛地應用于全文搜索引擎。它主要是用來存儲某個單詞（或詞組）在一個文檔或一組文檔中存儲位置的映射，即提供了一種根據內容來查找文檔的方式。由于不是根據文檔來確定文檔所包含的內容，而是進行相反的操作，因而稱為倒排索引（Inverted Index）。

一、實例描述

　　倒排索引簡單地就是，根據單詞，返回它在哪個文件中出現過，而且頻率是多少的結果。這就像百度里的搜索，你輸入一個關鍵字，那么百度引擎就迅速的在它的服務器里找到有該關鍵字的文件，并根據頻率和其他的一些策略（如頁面點擊投票率）等來給你返回結果。這個過程中，倒排索引就起到很關鍵的作用。

　　樣例輸入：

　　樣例輸出：

二、設計思路

　　倒排索引涉及幾個過程：Map過程，Combine過程，Reduce過程。

　　Map過程：?

　　當你把需要處理的文檔上傳到hdfs時，首先默認的TextInputFormat類對輸入的文件進行處理，得到文件中每一行的偏移量和這一行內容的鍵值對<偏移量，內容>做為map的輸入。在改寫map函數的時候，我們就需要考慮，怎么設計key和value的值來適合MapReduce框架，從而得到正確的結果。由于我們要得到單詞,所屬的文檔URL,詞頻，而<key,value>只有兩個值，那么就必須得合并其中得兩個信息了。這里我們設計key=單詞＋URL，value=詞頻。即map得輸出為<單詞＋URL，詞頻>，之所以將單詞＋URL做為key，時利用MapReduce框架自帶得Map端進行排序。

　　Combine過程：

　　Combine過程將key值相同得value值累加，得到一個單詞在文檔上得詞頻。但是為了把相同得key交給同一個reduce處理，我們需要設計為key=單詞，value＝URL+詞頻。

　　Reduce過程：

　　Reduce過程其實就是一個合并的過程了，只需將相同的key值的value值合并成倒排索引需要的格式即可。

三、程序代碼

　　程序代碼如下：

 1 import java.io.IOException;
 2 import java.util.StringTokenizer;
 3 
 4 import org.apache.hadoop.conf.Configuration;
 5 import org.apache.hadoop.fs.Path;
 6 import org.apache.hadoop.io.LongWritable;
 7 import org.apache.hadoop.io.Text;
 8 import org.apache.hadoop.mapreduce.Job;
 9 import org.apache.hadoop.mapreduce.Mapper;
10 import org.apache.hadoop.mapreduce.Reducer;
11 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
12 import org.apache.hadoop.mapreduce.lib.input.FileSplit;
13 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
14 import org.apache.hadoop.util.GenericOptionsParser;
15 
16 
17 public class InvertedIndex {
18 
19     public static class Map extends Mapper<LongWritable, Text, Text, Text>{
20         private static Text word = new Text();
21         private static Text one = new Text();
22         
23         @Override
24         protected void map(LongWritable key, Text value,Mapper<LongWritable, Text, Text, Text>.Context context)
25                 throws IOException, InterruptedException {
26             //  super.map(key, value, context);
27             String fileName = ((FileSplit)context.getInputSplit()).getPath().getName();
28             StringTokenizer st = new StringTokenizer(value.toString());
29             while (st.hasMoreTokens()) {
30                 word.set(st.nextToken()+"\t"+fileName);
31                 context.write(word, one);
32             }
33         }
34     }
35     
36     public static class Combine extends Reducer<Text, Text, Text, Text>{
37         private static Text word = new Text();
38         private static Text index = new Text();
39         
40         @Override
41         protected void reduce(Text key, Iterable<Text> values,Reducer<Text, Text, Text, Text>.Context context)
42                 throws IOException, InterruptedException {
43             //  super.reduce(arg0, arg1, arg2);
44             String[] splits = key.toString().split("\t");
45             if (splits.length != 2) {
46                 return ;
47             }
48             long count = 0;
49             for(Text v:values){
50                 count++;
51             }
52             word.set(splits[0]);
53             index.set(splits[1]+":"+count);
54             context.write(word, index);
55         }
56     }
57     
58     public static class Reduce extends Reducer<Text, Text, Text, Text>{
59         private static StringBuilder sub = new StringBuilder(256);
60         private static Text index = new Text();
61         
62         @Override
63         protected void reduce(Text word, Iterable<Text> values,Reducer<Text, Text, Text, Text>.Context context)
64                 throws IOException, InterruptedException {
65             // super.reduce(arg0, arg1, arg2);
66             for(Text v:values){
67                 sub.append(v.toString()).append(";");
68             }
69             index.set(sub.toString());
70             context.write(word, index);
71             sub.delete(0, sub.length());
72         }
73     }
74     
75     public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
76         Configuration conf = new Configuration();
77         String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
78         if(otherArgs.length!=2){
79             System.out.println("Usage:wordcount <in> <out>");
80             System.exit(2);
81         }
82         Job job = new Job(conf,"Invert Index ");
83         job.setJarByClass(InvertedIndex.class);
84         
85         job.setMapperClass(Map.class);
86         job.setCombinerClass(Combine.class);
87         job.setReducerClass(Reduce.class);
88         
89         job.setMapOutputKeyClass(Text.class);
90         job.setMapOutputValueClass(Text.class);
91         job.setOutputKeyClass(Text.class);
92         job.setOutputValueClass(Text.class);
93         
94         FileInputFormat.addInputPath(job,new Path(args[0]));
95         FileOutputFormat.setOutputPath(job, new Path(args[1]));
96         System.exit(job.waitForCompletion(true)?0:1);
97     }
98 
99 }

轉載于:https://www.cnblogs.com/xiaoyh/p/9361356.html

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/389020.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/389020.shtml
英文地址，請注明出處：http://en.pswp.cn/news/389020.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！