如何使用Python为Hadoop编写一个简单的MapReduce程序_软件运维

MichaelG.Noll在他的Blog中提到如何在Hadoop中用Python编写MapReduce程序，韩国的gogamza在其Bolg中也提到如何用C编汪瞎写MapReduce程序（我稍微修改了一下原程序,因为他的Map对单词切分使用tab键）。我合并他们两人的文章，也让国内的Hadoop用户能够使用别的语言来编写MapReduce程序。首先您得配好您的Hadoop集群，这方面的介绍网上比较多，这儿给个链接（Hadoop学习笔记二安装部署）。HadoopStreaming帮助返锋我们用非Java的编程语言使用MapReduce，Streaming用STDIN(标准输入)和STDOUT(标准输出)来和我们编写的Map和Reduce进行数据的交换数据。任何能够使用STDIN和STDOUT都可以用来编写MapReduce程序，比如我们用Python的sys.stdin和sys.stdout，或者是C中的stdin和stdout。我们还是使用Hadoop的例子WordCount来做示范如何编写MapReduce，在WordCount的例子中漏陵晌我们要解决计算在一批文档中每一个单词的出现频率。首先我们在Map程序中会接受到这批文档每一行的数据，然后我们编写的Map程序把这一行按空格切开成一个数组。并对这个数组遍历按"1"用标准的输出输出来，代表这个单词出现了一次。在Reduce中我们来统计单词的出现频率。PythonCodeMap:mapper.py#!/usr/bin/envpythonimportsys#mapswordstotheircountsword2count={}#inputcomesfromSTDIN(standardinput)forlineinsys.stdin:#removeleadingandtrailingwhitespaceline=line.strip()#splitthelineintowordswhileremovinganyemptystringswords=filter(lambdaword:word,line.split())#increasecountersforwordinwords:#writetheresultstoSTDOUT(standardoutput)#whatweoutputherewillbetheinputforthe#Reducestep,i.e.theinputforreducer.py##tab-delimitedthetrivialwordcountis1print'%s\t%s'%(word,1)Reduce:reducer.py#!/usr/bin/envpythonfromoperatorimportitemgetterimportsys#mapswordstotheircountsword2count={}#inputcomesfromSTDINforlineinsys.stdin:#removeleadingandtrailingwhitespaceline=line.strip()#parsetheinputwegotfrommapper.pyword,count=line.split()#convertcount(currentlyastring)tointtry:count=int(count)word2count[word]=word2count.get(word,0)+countexceptValueError:#countwasnotanumber,sosilently#ignore/discardthislinepass#sortthewordslexigraphically##thisstepisNOTrequired,wejustdoitsothatour#finaloutputwilllookmoreliketheofficialHadoop#wordcountexamplessorted_word2count=sorted(word2count.items(),key=itemgetter(0))#writetheresultstoSTDOUT(standardoutput)forword,countinsorted_word2count:print'%s\t%s'%(word,count)CCodeMap:Mapper.c#include#include#include#include#defineBUF_SIZE2048#defineDELIM"\n"intmain(intargc,char*argv[]){charbuffer[BUF_SIZE]while(fgets(buffer,BUF_SIZE-1,stdin)){intlen=strlen(buffer)if(buffer[len-1]=='\n')buffer[len-1]=0char*querys=index(buffer,'')char*query=NULLif(querys==NULL)continuequerys+=1/*nottoinclude'\t'*/query=strtok(buffer,"")while(query){printf("%s\t1\n",query)query=strtok(NULL,"")}}return0}h>h>h>h>Reduce:Reducer.c#include#include#include#include#defineBUFFER_SIZE1024#defineDELIM"\t"intmain(intargc,char*argv[]){charstrLastKey[BUFFER_SIZE]charstrLine[BUFFER_SIZE]intcount=0*strLastKey='\0'*strLine='\0'while(fgets(strLine,BUFFER_SIZE-1,stdin)){char*strCurrKey=NULLchar*strCurrNum=NULLstrCurrKey=strtok(strLine,DELIM)strCurrNum=strtok(NULL,DELIM)/*necessarytocheckerrorbut.*/if(strLastKey[0]=='\0'){strcpy(strLastKey,strCurrKey)}if(strcmp(strCurrKey,strLastKey)){printf("%s\t%d\n",strLastKey,count)count=atoi(strCurrNum)}else{count+=atoi(strCurrNum)}strcpy(strLastKey,strCurrKey)}printf("%s\t%d\n",strLastKey,count)/*flushthecount*/return0}h>h>h>h>首先我们调试一下源码：chmod+xmapper.pychmod+xreducer.pyecho"foofooquuxlabsfoobarquux"|./mapper.py|./reducer.pybar1foo3labs1quux2g++Mapper.c-oMapperg++Reducer.c-oReducerchmod+xMapperchmod+xReducerecho"foofooquuxlabsfoobarquux"|./Mapper|./Reducerbar1foo2labs1quux1foo1quux1你可能看到C的输出和Python的不一样,因为Python是把他放在词典里了.我们在Hadoop时,会对这进行排序,然后相同的单词会连续在标准输出中输出.在Hadoop中运行程序首先我们要下载我们的测试文档wget页面中摘下的用php编写的MapReduce程序,供php程序员参考：Map:mapper.php#!/usr/bin/php$word2count=array()//inputcomesfromSTDIN(standardinput)while(($line=fgets(STDIN))!==false){//removeleadingandtrailingwhitespaceandlowercase$line=strtolower(trim($line))//splitthelineintowordswhileremovinganyemptystring$words=preg_split('/\W/',$line,0,PREG_SPLIT_NO_EMPTY)//increasecountersforeach($wordsas$word){$word2count[$word]+=1}}//writetheresultstoSTDOUT(standardoutput)//whatweoutputherewillbetheinputforthe//Reducestep,i.e.theinputforreducer.pyforeach($word2countas$word=>$count){//tab-delimitedecho$word,chr(9),$count,PHP_EOL}?>Reduce:mapper.php#!/usr/bin/php$word2count=array()//inputcomesfromSTDINwhile(($line=fgets(STDIN))!==false){//removeleadingandtrailingwhitespace$line=trim($line)//parsetheinputwegotfrommapper.phplist($word,$count)=explode(chr(9),$line)//convertcount(currentlyastring)toint$count=intval($count)//sumcountsif($count>0)$word2count[$word]+=$count}//sortthewordslexigraphically////thissetisNOTrequired,wejustdoitsothatour//finaloutputwilllookmoreliketheofficialHadoop//wordcountexamplesksort($word2count)//writetheresultstoSTDOUT(standardoutput)foreach($word2countas$word=>$count){echo$word,chr(9),$count,PHP_EOL}?>作者：马士华发表于：2008-03-05

搭建 Python 环境在 Hadoop 上的步骤如下：

安装 Hadoop：在你的计算机上安装 Hadoop。

安装 Python：请确保你的计孙拿算机上已经安装了 Python。

配置 Hadoop 环境：编辑 Hadoop 的配置文件，以确保 Hadoop 可以与 Python 配合使用。

安装相关模块：请安装所需的 Python 模块，以便在 Hadoop 环境下使用 Python。

测试灶行 Python 安装：请运行一些测试脚本，以确保 Python 可以在 Hadoop 环境下正常工作。

这些步骤可以帮助你在 Hadoop 环境下搭建 Python。请注意，具体的步骤可能因 Hadoop 的版本和环境而异，请仔细查则辩搭看相关文档。

这个item.txt和'user_profile.txt'是什么文件？

如果是数据文件那应该放到HDFS上，或者自己实现inputformat来提供访问方式。程序中从标准输入获取数据。

如果是运行中的一些参森判派数信息，那应该使用-files选项让Hadoop框架帮你把文件冲手发送到目标机器上，和mapreduce的jar包放到相同的临时目录下，你才能找到。-files要加在前面，例如：

hadoop jar \$HADOOP_HOME/contrib/streaming/hadoop-0.20.2-streaming.jar -files item.txt -mapper ./python/map.py -reducer ./python/reduce.py -input /home/hadoop/hello -output /home/hadoop/outpath

如果保证每台主机的相同路径下都存在这个文件，也可以使用绝对路径。

命令此贺写的也有问题，没有指定输入输出目录。

hadoop jar \$HADOOP_HOME/contrib/streaming/hadoop-0.20.2-streaming.jar -mapper ./python/map.py -reducer ./python/reduce.py -input /home/hadoop/hello -output /home/hadoop/outpath

其中输出路径/home/hadoop/outpath需要是一个之前不存在的路径，执行mapreduce的时候会校验并创建。

欢迎分享，转载请注明来源：内存溢出

原文地址:https://54852.com/yw/12560613.html

如何使用Python为Hadoop编写一个简单的MapReduce程序

发表评论

评论列表（0条）