如何使用Python为Hadoop编写一个简单的MapReduce程序

作者&投稿：寿仁（若有异议请与网页底部的电邮联系）

如何使用Python为Hadoop编写一个简单的MapReduce程序~

在这个实例中，我将会向大家介绍如何使用Python 为 Hadoop编写一个简单的MapReduce
程序。
尽管Hadoop 框架是使用Java编写的但是我们仍然需要使用像C++、Python等语言来实现Hadoop程序。尽管Hadoop官方网站给的示例程序是使用Jython编写并打包成Jar文件，这样显然造成了不便，其实，不一定非要这样来实现，我们可以使用Python与Hadoop 关联进行编程，看看位于/src/examples/python/WordCount.py 的例子，你将了解到我在说什么。

我们想要做什么？

我们将编写一个简单的 MapReduce 程序，使用的是C-Python，而不是Jython编写后打包成jar包的程序。
我们的这个例子将模仿 WordCount 并使用Python来实现，例子通过读取文本文件来统计出单词的出现次数。结果也以文本形式输出，每一行包含一个单词和单词出现的次数，两者中间使用制表符来想间隔。

先决条件

编写这个程序之前，你学要架设好Hadoop 集群，这样才能不会在后期工作抓瞎。如果你没有架设好，那么在后面有个简明教程来教你在Ubuntu Linux 上搭建（同样适用于其他发行版linux、unix）

如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立单节点的 Hadoop 集群

如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立多节点的 Hadoop 集群

Python的MapReduce代码

使用Python编写MapReduce代码的技巧就在于我们使用了 HadoopStreaming 来帮助我们在Map 和 Reduce间传递数据通过STDIN (标准输入)和STDOUT (标准输出).我们仅仅使用Python的sys.stdin来输入数据，使用sys.stdout输出数据，这样做是因为HadoopStreaming会帮我们办好其他事。这是真的，别不相信！

Map: mapper.py

将下列的代码保存在/home/hadoop/mapper.py中，他将从STDIN读取数据并将单词成行分隔开，生成一个列表映射单词与发生次数的关系：
注意：要确保这个脚本有足够权限（chmod +x /home/hadoop/mapper.py）。

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\%s' % (word, 1)在这个脚本中，并不计算出单词出现的总数，它将输出 " 1" 迅速地，尽管可能会在输入中出现多次，计算是留给后来的Reduce步骤（或叫做程序）来实现。当然你可以改变下编码风格，完全尊重你的习惯。

Reduce: reducer.py

将代码存储在/home/hadoop/reducer.py 中，这个脚本的作用是从mapper.py 的STDIN中读取结果，然后计算每个单词出现次数的总和，并输出结果到STDOUT。
同样，要注意脚本权限：chmod +x /home/hadoop/reducer.py

#!/usr/bin/env python

from operator import itemgetter
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the input we got from mapper.py
word, count = line.split('\', 1)
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
# count was not a number, so silently
# ignore/discard this line
pass

# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))

# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
print '%s\%s'% (word, count)
测试你的代码（cat data | map | sort | reduce）

我建议你在运行MapReduce job测试前尝试手工测试你的mapper.py 和 reducer.py脚本，以免得不到任何返回结果
这里有一些建议，关于如何测试你的Map和Reduce的功能：
——————————————————————————————————————————————

# very basic test
hadoop@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py
foo 1
foo 1
quux 1
labs 1
foo 1
bar 1
——————————————————————————————————————————————
hadoop@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py | sort | /home/hadoop/reducer.py
bar 1
foo 3
labs 1
——————————————————————————————————————————————

# using one of the ebooks as example input
# (see below on where to get the ebooks)
hadoop@ubuntu:~$ cat /tmp/gutenberg/20417-8.txt | /home/hadoop/mapper.py
The 1
Project 1
Gutenberg 1
EBook 1
of 1
[...]
(you get the idea)

quux 2

quux 1

——————————————————————————————————————————————

在Hadoop平台上运行Python脚本

为了这个例子，我们将需要三种电子书：

The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson

The Notebooks of Leonardo Da Vinci

Ulysses by James Joyce
下载他们，并使用us-ascii编码存储解压后的文件，保存在临时目录，比如/tmp/gutenberg.

hadoop@ubuntu:~$ ls -l /tmp/gutenberg/
total 3592
-rw-r--r-- 1 hadoop hadoop 674425 2007-01-22 12:56 20417-8.txt
-rw-r--r-- 1 hadoop hadoop 1423808 2006-08-03 16:36 7ldvc10.txt
-rw-r--r-- 1 hadoop hadoop 1561677 2004-11-26 09:48 ulyss12.txt
hadoop@ubuntu:~$

复制本地数据到HDFS

在我们运行MapReduce job 前，我们需要将本地的文件复制到HDFS中：

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg gutenberg
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls
Found 1 items
/user/hadoop/gutenberg
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls gutenberg
Found 3 items
/user/hadoop/gutenberg/20417-8.txt 674425
/user/hadoop/gutenberg/7ldvc10.txt 1423808
/user/hadoop/gutenberg/ulyss12.txt 1561677

执行 MapReduce job

现在，一切准备就绪，我们将在运行Python MapReduce job 在Hadoop集群上。像我上面所说的，我们使用的是
HadoopStreaming 帮助我们传递数据在Map和Reduce间并通过STDIN和STDOUT，进行标准化输入输出。

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-mapper /home/hadoop/mapper.py -reducer /home/hadoop/reducer.py -input gutenberg/*
-output gutenberg-output
在运行中，如果你想更改Hadoop的一些设置，如增加Reduce任务的数量，你可以使用“-jobconf”选项：

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-jobconf mapred.reduce.tasks=16 -mapper ...

一个重要的备忘是关于Hadoop does not honor mapred.map.tasks
这个任务将会读取HDFS目录下的gutenberg并处理他们，将结果存储在独立的结果文件中，并存储在HDFS目录下的
gutenberg-output目录。
之前执行的结果如下：

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-mapper /home/hadoop/mapper.py -reducer /home/hadoop/reducer.py -input gutenberg/*
-output gutenberg-output

additionalConfSpec_:null
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar: [/usr/local/hadoop-datastore/hadoop-hadoop/hadoop-unjar54543/]
[] /tmp/streamjob54544.jar tmpDir=null
[...] INFO mapred.FileInputFormat: Total input paths to process : 7
[...] INFO streaming.StreamJob: getLocalDirs(): [/usr/local/hadoop-datastore/hadoop-hadoop/mapred/local]
[...] INFO streaming.StreamJob: Running job: job_200803031615_0021
[...]
[...] INFO streaming.StreamJob: map 0% reduce 0%
[...] INFO streaming.StreamJob: map 43% reduce 0%
[...] INFO streaming.StreamJob: map 86% reduce 0%
[...] INFO streaming.StreamJob: map 100% reduce 0%
[...] INFO streaming.StreamJob: map 100% reduce 33%
[...] INFO streaming.StreamJob: map 100% reduce 70%
[...] INFO streaming.StreamJob: map 100% reduce 77%
[...] INFO streaming.StreamJob: map 100% reduce 100%
[...] INFO streaming.StreamJob: Job complete: job_200803031615_0021

[...] INFO streaming.StreamJob: Output: gutenberg-output hadoop@ubuntu:/usr/local/hadoop$

正如你所见到的上面的输出结果，Hadoop 同时还提供了一个基本的WEB接口显示统计结果和信息。
当Hadoop集群在执行时，你可以使用浏览器访问 http://localhost:50030/ ，如图：

检查结果是否输出并存储在HDFS目录下的gutenberg-output中：

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls gutenberg-output
Found 1 items
/user/hadoop/gutenberg-output/part-00000 903193 2007-09-21 13:00
hadoop@ubuntu:/usr/local/hadoop$

可以使用dfs -cat 命令检查文件目录

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat gutenberg-output/part-00000
"(Lo)cra" 1
"1490 1
"1498," 1
"35" 1
"40," 1
"A 2
"AS-IS". 2
"A_ 1
"Absoluti 1
[...]
hadoop@ubuntu:/usr/local/hadoop$

注意比输出，上面结果的(")符号不是Hadoop插入的。

转载仅供参考，版权属于原作者。祝你愉快，满意请采纳哦

　　转载：我们将编写一个简单的 MapReduce 程序，使用的是C-Python，而不是Jython编写后打包成jar包的程序。
　　我们的这个例子将模仿 WordCount 并使用Python来实现，例子通过读取文本文件来统计出单词的出现次数。结果也以文本形式输出，每一行包含一个单词和单词出现的次数，两者中间使用制表符来想间隔。

　　先决条件

　　编写这个程序之前，你学要架设好Hadoop 集群，这样才能不会在后期工作抓瞎。如果你没有架设好，那么在后面有个简明教程来教你在Ubuntu Linux 上搭建（同样适用于其他发行版linux、unix）

　　如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立单节点的 Hadoop 集群

　　如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立多节点的 Hadoop 集群

　　Python的MapReduce代码

　　使用Python编写MapReduce代码的技巧就在于我们使用了 HadoopStreaming 来帮助我们在Map 和 Reduce间传递数据通过STDIN (标准输入)和STDOUT (标准输出).我们仅仅使用Python的sys.stdin来输入数据，使用sys.stdout输出数据，这样做是因为HadoopStreaming会帮我们办好其他事。这是真的，别不相信！

　　Map: mapper.py

　　将下列的代码保存在/home/hadoop/mapper.py中，他将从STDIN读取数据并将单词成行分隔开，生成一个列表映射单词与发生次数的关系：
　　注意：要确保这个脚本有足够权限（chmod +x /home/hadoop/mapper.py）。

　　#!/usr/bin/env python
　　
　　import sys
　　
　　# input comes from STDIN (standard input)
　　for line in sys.stdin:
　　# remove leading and trailing whitespace
　　line = line.strip()
　　# split the line into words
　　words = line.split()
　　# increase counters
　　for word in words:
　　# write the results to STDOUT (standard output);
　　# what we output here will be the input for the
　　# Reduce step, i.e. the input for reducer.py
　　#
　　# tab-delimited; the trivial word count is 1
　　print '%s\%s' % (word, 1)在这个脚本中，并不计算出单词出现的总数，它将输出 " 1" 迅速地，尽管可能会在输入中出现多次，计算是留给后来的Reduce步骤（或叫做程序）来实现。当然你可以改变下编码风格，完全尊重你的习惯。

　　Reduce: reducer.py

　　将代码存储在/home/hadoop/reducer.py 中，这个脚本的作用是从mapper.py 的STDIN中读取结果，然后计算每个单词出现次数的总和，并输出结果到STDOUT。
　　同样，要注意脚本权限：chmod +x /home/hadoop/reducer.py

　　#!/usr/bin/env python
　　
　　from operator import itemgetter
　　import sys
　　
　　# maps words to their counts
　　word2count = {}
　　
　　# input comes from STDIN
　　for line in sys.stdin:
　　# remove leading and trailing whitespace
　　line = line.strip()
　　
　　# parse the input we got from mapper.py
　　word, count = line.split('\', 1)
　　# convert count (currently a string) to int
　　try:
　　count = int(count)
　　word2count[word] = word2count.get(word, 0) + count
　　except ValueError:
　　# count was not a number, so silently
　　# ignore/discard this line
　　pass
　　
　　# sort the words lexigraphically;
　　#
　　# this step is NOT required, we just do it so that our
　　# final output will look more like the official Hadoop
　　# word count examples
　　sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
　　
　　# write the results to STDOUT (standard output)
　　for word, count in sorted_word2count:
　　print '%s\%s'% (word, count)
　　测试你的代码（cat data | map | sort | reduce）

　　我建议你在运行MapReduce job测试前尝试手工测试你的mapper.py 和 reducer.py脚本，以免得不到任何返回结果
　　这里有一些建议，关于如何测试你的Map和Reduce的功能：
　　——————————————————————————————————————————————
　　

　　# very basic test
　　hadoop@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py
　　foo 1
　　foo 1
　　quux 1
　　labs 1
　　foo 1
　　bar 1
　　——————————————————————————————————————————————
　　hadoop@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py | sort | /home/hadoop/reducer.py
　　bar 1
　　foo 3
　　labs 1
　　——————————————————————————————————————————————

　　# using one of the ebooks as example input
　　# (see below on where to get the ebooks)
　　hadoop@ubuntu:~$ cat /tmp/gutenberg/20417-8.txt | /home/hadoop/mapper.py
　　The 1
　　Project 1
　　Gutenberg 1
　　EBook 1
　　of 1
　　[...]
　　(you get the idea)

　　quux 2

　　quux 1

　　——————————————————————————————————————————————

　　在Hadoop平台上运行Python脚本

　　为了这个例子，我们将需要三种电子书：

　　The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson

　　The Notebooks of Leonardo Da Vinci

　　Ulysses by James Joyce
　　下载他们，并使用us-ascii编码存储解压后的文件，保存在临时目录，比如/tmp/gutenberg.

　　hadoop@ubuntu:~$ ls -l /tmp/gutenberg/
　　total 3592
　　-rw-r--r-- 1 hadoop hadoop 674425 2007-01-22 12:56 20417-8.txt
　　-rw-r--r-- 1 hadoop hadoop 1423808 2006-08-03 16:36 7ldvc10.txt
　　-rw-r--r-- 1 hadoop hadoop 1561677 2004-11-26 09:48 ulyss12.txt
　　hadoop@ubuntu:~$

　　复制本地数据到HDFS

　　在我们运行MapReduce job 前，我们需要将本地的文件复制到HDFS中：

　　hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg gutenberg
　　hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls
　　Found 1 items
　　/user/hadoop/gutenberg
　　hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls gutenberg
　　Found 3 items
　　/user/hadoop/gutenberg/20417-8.txt 674425
　　/user/hadoop/gutenberg/7ldvc10.txt 1423808
　　/user/hadoop/gutenberg/ulyss12.txt 1561677

　　执行 MapReduce job

　　现在，一切准备就绪，我们将在运行Python MapReduce job 在Hadoop集群上。像我上面所说的，我们使用的是
　　HadoopStreaming 帮助我们传递数据在Map和Reduce间并通过STDIN和STDOUT，进行标准化输入输出。

　　hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
　　-mapper /home/hadoop/mapper.py -reducer /home/hadoop/reducer.py -input gutenberg/*
　　-output gutenberg-output
　　在运行中，如果你想更改Hadoop的一些设置，如增加Reduce任务的数量，你可以使用“-jobconf”选项：

　　hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
　　-jobconf mapred.reduce.tasks=16 -mapper ...

　　一个重要的备忘是关于Hadoop does not honor mapred.map.tasks
　　这个任务将会读取HDFS目录下的gutenberg并处理他们，将结果存储在独立的结果文件中，并存储在HDFS目录下的
　　gutenberg-output目录。
　　之前执行的结果如下：

　　hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
　　-mapper /home/hadoop/mapper.py -reducer /home/hadoop/reducer.py -input gutenberg/*
　　-output gutenberg-output
　　
　　additionalConfSpec_:null
　　null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
　　packageJobJar: [/usr/local/hadoop-datastore/hadoop-hadoop/hadoop-unjar54543/]
　　[] /tmp/streamjob54544.jar tmpDir=null
　　[...] INFO mapred.FileInputFormat: Total input paths to process : 7
　　[...] INFO streaming.StreamJob: getLocalDirs(): [/usr/local/hadoop-datastore/hadoop-hadoop/mapred/local]
　　[...] INFO streaming.StreamJob: Running job: job_200803031615_0021
　　[...]
　　[...] INFO streaming.StreamJob: map 0% reduce 0%
　　[...] INFO streaming.StreamJob: map 43% reduce 0%
　　[...] INFO streaming.StreamJob: map 86% reduce 0%
　　[...] INFO streaming.StreamJob: map 100% reduce 0%
　　[...] INFO streaming.StreamJob: map 100% reduce 33%
　　[...] INFO streaming.StreamJob: map 100% reduce 70%
　　[...] INFO streaming.StreamJob: map 100% reduce 77%
　　[...] INFO streaming.StreamJob: map 100% reduce 100%
　　[...] INFO streaming.StreamJob: Job complete: job_200803031615_0021

　　[...] INFO streaming.StreamJob: Output: gutenberg-output hadoop@ubuntu:/usr/local/hadoop$

　　正如你所见到的上面的输出结果，Hadoop 同时还提供了一个基本的WEB接口显示统计结果和信息。

MichaelG.Noll在他的Blog中提到如何在Hadoop中用Python编写MapReduce程序，韩国的gogamza在其Bolg中也提到如何用C编写MapReduce程序（我稍微修改了一下原程序,因为他的Map对单词切分使用tab键）。我合并他们两人的文章，也让国内的Hadoop用户能够使用别的语言来编写MapReduce程序。　　首先您得配好您的Hadoop集群，这方面的介绍网上比较多，这儿给个链接（Hadoop学习笔记二安装部署）。HadoopStreaming帮助我们用非Java的编程语言使用MapReduce，Streaming用STDIN(标准输入)和STDOUT(标准输出)来和我们编写的Map和Reduce进行数据的交换数据。任何能够使用STDIN和STDOUT都可以用来编写MapReduce程序，比如我们用Python的sys.stdin和sys.stdout，或者是C中的stdin和stdout。　　我们还是使用Hadoop的例子WordCount来做示范如何编写MapReduce，在WordCount的例子中我们要解决计算在一批文档中每一个单词的出现频率。首先我们在Map程序中会接受到这批文档每一行的数据，然后我们编写的Map程序把这一行按空格切开成一个数组。并对这个数组遍历按"1"用标准的输出输出来，代表这个单词出现了一次。在Reduce中我们来统计单词的出现频率。　　　　PythonCode　　Map:mapper.py　　#!/usr/bin/envpythonimportsys#mapswordstotheircountsword2count={}#inputcomesfromSTDIN(standardinput)forlineinsys.stdin:#removeleadingandtrailingwhitespaceline=line.strip()#splitthelineintowordswhileremovinganyemptystringswords=filter(lambdaword:word,line.split())#increasecountersforwordinwords:#writetheresultstoSTDOUT(standardoutput);#whatweoutputherewillbetheinputforthe#Reducestep,i.e.theinputforreducer.py##tab-delimited;thetrivialwordcountis1print'%s\t%s'%(word,1)　　Reduce:reducer.py　　#!/usr/bin/envpythonfromoperatorimportitemgetterimportsys#mapswordstotheircountsword2count={}#inputcomesfromSTDINforlineinsys.stdin:#removeleadingandtrailingwhitespaceline=line.strip()#parsetheinputwegotfrommapper.pyword,count=line.split()#convertcount(currentlyastring)tointtry:count=int(count)word2count[word]=word2count.get(word,0)+countexceptValueError:#countwasnotanumber,sosilently#ignore/discardthislinepass#sortthewordslexigraphically;##thisstepisNOTrequired,wejustdoitsothatour#finaloutputwilllookmoreliketheofficialHadoop#wordcountexamplessorted_word2count=sorted(word2count.items(),key=itemgetter(0))#writetheresultstoSTDOUT(standardoutput)forword,countinsorted_word2count:print'%s\t%s'%(word,count)　　CCode　　Map:Mapper.c　　#include#include#include#include#defineBUF_SIZE2048#defineDELIM"\n"intmain(intargc,char*argv[]){charbuffer[BUF_SIZE];while(fgets(buffer,BUF_SIZE-1,stdin)){intlen=strlen(buffer);if(buffer[len-1]=='\n')buffer[len-1]=0;char*querys=index(buffer,'');char*query=NULL;if(querys==NULL)continue;querys+=1;/*nottoinclude'\t'*/query=strtok(buffer,"");while(query){printf("%s\t1\n",query);query=strtok(NULL,"");}}return0;}h>h>h>h>　　Reduce:Reducer.c　　#include#include#include#include#defineBUFFER_SIZE1024#defineDELIM"\t"intmain(intargc,char*argv[]){charstrLastKey[BUFFER_SIZE];charstrLine[BUFFER_SIZE];intcount=0;*strLastKey='\0';*strLine='\0';while(fgets(strLine,BUFFER_SIZE-1,stdin)){char*strCurrKey=NULL;char*strCurrNum=NULL;strCurrKey=strtok(strLine,DELIM);strCurrNum=strtok(NULL,DELIM);/*necessarytocheckerrorbut.*/if(strLastKey[0]=='\0'){strcpy(strLastKey,strCurrKey);}if(strcmp(strCurrKey,strLastKey)){printf("%s\t%d\n",strLastKey,count);count=atoi(strCurrNum);}else{count+=atoi(strCurrNum);}strcpy(strLastKey,strCurrKey);}printf("%s\t%d\n",strLastKey,count);/*flushthecount*/return0;}h>h>h>h>　　首先我们调试一下源码：　　chmod+xmapper.pychmod+xreducer.pyecho"foofooquuxlabsfoobarquux"|./mapper.py|./reducer.pybar1foo3labs1quux2g++Mapper.c-oMapperg++Reducer.c-oReducerchmod+xMapperchmod+xReducerecho"foofooquuxlabsfoobarquux"|./Mapper|./Reducerbar1foo2labs1quux1foo1quux1　　你可能看到C的输出和Python的不一样,因为Python是把他放在词典里了.我们在Hadoop时,会对这进行排序,然后相同的单词会连续在标准输出中输出.　　在Hadoop中运行程序　　首先我们要下载我们的测试文档wget页面中摘下的用php编写的MapReduce程序,供php程序员参考：Map:mapper.php　　#!/usr/bin/php$word2count=array();//inputcomesfromSTDIN(standardinput)while(($line=fgets(STDIN))!==false){//removeleadingandtrailingwhitespaceandlowercase$line=strtolower(trim($line));//splitthelineintowordswhileremovinganyemptystring$words=preg_split('/\W/',$line,0,PREG_SPLIT_NO_EMPTY);//increasecountersforeach($wordsas$word){$word2count[$word]+=1;}}//writetheresultstoSTDOUT(standardoutput)//whatweoutputherewillbetheinputforthe//Reducestep,i.e.theinputforreducer.pyforeach($word2countas$word=>$count){//tab-delimitedecho$word,chr(9),$count,PHP_EOL;}?>　　Reduce:mapper.php　　#!/usr/bin/php$word2count=array();//inputcomesfromSTDINwhile(($line=fgets(STDIN))!==false){//removeleadingandtrailingwhitespace$line=trim($line);//parsetheinputwegotfrommapper.phplist($word,$count)=explode(chr(9),$line);//convertcount(currentlyastring)toint$count=intval($count);//sumcountsif($count>0)$word2count[$word]+=$count;}//sortthewordslexigraphically////thissetisNOTrequired,wejustdoitsothatour//finaloutputwilllookmoreliketheofficialHadoop//wordcountexamplesksort($word2count);//writetheresultstoSTDOUT(standardoutput)foreach($word2countas$word=>$count){echo$word,chr(9),$count,PHP_EOL;}?>　　作者：马士华发表于：2008-03-05

如何使用python
使用：然后我们就可以在CMD命令行窗口中使用python了，一般比较小的python程序直接在记事本中或者sublime这样的编辑器中编写即可如果是比较大的python项目的话就需要打开IDE工具Pycharm了。第三方的库：最后使用python的时候一般会用到第三方的库，这个一般都是安装在python安装目录下面的site-packages文件夹综...

python如何使用
python是一种跨平台的计算机程序设计语言，使用python首先需要进行安装和配置，然后就可以在CMD命令行窗口中使用python了。安装：使用python之前我们需要先安装它，大家到python的官网下载即可，下载完了注意配置一下环境变量，将python的bin目录配置到path变量里面。使用：在CMD命令行窗口中使用python，一般比较小...

怎么运行python代码
1、使用Python解释器：打开命令行界面，输入python命令，进入Python解释器，然后输入您的Python代码，按下回车键即可执行代码。2、使用Python IDE：使用Python IDE（如PyCharm、IDLE、Spyder等）编写Python代码，并在IDE中运行代码。3、使用Python脚本：将Python代码保存为.py文件，并在命令行或IDE中运行该脚本...

Python怎么使用
1、首先在Window 上在安装 Python时，已经已经安装了默认的交互式编程客户端，提示窗口：在 python 提示符中输入以下文本信息，然后按 Enter 键查看运行效果。2、然后，通过脚本参数调用解释器开始执行脚本，直到脚本执行完毕。当脚本执行完成后，解释器不再有效。所有 Python 文件将以 .py 为扩展名。将以...

python 能干什么
Python是一种广泛使用的高级编程语言，它具有多种应用领域。以下是Python的主要用途和功能：一、Python的主要应用领域 1. 软件开发 Python被广泛用于软件开发，可用于编写各种应用程序和网站。由于其简洁易读的语法和强大的库支持，Python成为了许多开发者的首选语言。2. 数据科学 Python是数据科学领域的首选...

Python 是什么?如何正确使用?
。Python 的标准库非常庞大，可以支持很多编程任务，包括网页开发、数据分析、人工智能等领域。如果您想正确使用 Python，可以参考以下步骤：1. 安装 Python。2. 学习 Python 基础语法。3. 学习 Python 标准库的使用。4. 练习编写 Python 程序。5. 参考 Python 相关书籍和教程，进一步提高自己的技能。

python是什么,如何使用python
1、Python都被用在哪儿？自Python由Guido van Rossum于1989年底发明创建以来，基于此项技术的网站和软件项目已经有了数千个。Python 由于其独特性，使其在各种编程语言中脱颖而出，在全世界拥有大量拥护它的程序员。Python 的优点是什么？简单、免费、兼容性、面向对象、函数库在哪里使用 Python 语言？...

安装python后怎么用
1、安装之后首先可以启动Python交互式解释器，方法是在Windows命令行窗口，敲入python字样就可以打开了，在这里可以导入python的包，编写python语句，不过缺点就是修改起来不方便，需要把光标移动到相应的位置才能：2、然后也可以使用Python自身提供的IDLE集成开发环境，这是Pyhton自身提供了一个简洁的集成开发环境...

python怎么用
操作方法如下操作设备戴尔电脑操作系统win10 操作软件design 1打开你的功能文件，如下图所示2把功能封装成函数然后按CTRL+V粘贴在Design里复制的代码，如下图所示3加入信号，链接到函数，如下图所示4；python整数的表示方法1可以使用字符串str的isdigit方法判断字符串是否是一个仅有数字组成，也就是整数...

python该如何使用?
1、要使用string的方法要先import，但后来由于众多的python使用者的建议。2、从python2.0开始， string方法改为用S.method()的形式调用，只要S是一个字符串对象就可以这样使用，而不用import。3、同时为了保持向后兼容，现在的Python中仍然保留了一个string的module。阐述编制Python程序相关注意什么是Python...

西乌珠穆沁旗15360361583： 如何使用Python为Hadoop编写一个简单的MapReduce程序 - ？
员奖迪佳： 使用Python编写MapReduce代码技巧于我使用 HadoopStreaming 帮助我Map Reduce间传递百数据通STDIN (标准输入)STDOUT (标准输).我仅仅使用Pythonsys.stdin输入数据使用sys.stdout输数据做HadoopStreaming帮我办其事真别度相信!

西乌珠穆沁旗15360361583： 用python 写hadoop 需要怎么配置环境 - ？
员奖迪佳： 不用专门配置python,CHD里已经有了,可以输入python来试一试.可以直接调用.py文件来实现MapReduce功能.

西乌珠穆沁旗15360361583： python和hadoop有什么联系 - ？
员奖迪佳： 一个是编程语言,一个是大数据实现,这完全是两个不同领域的概念.我能想到的关系是这样的:如果Hadoop提供对Python的接口的话,就可以用Python调用Hadoop实现大数据的一些功能

西乌珠穆沁旗15360361583： 只懂Python能不能完全驾驭Hadoop - ？
员奖迪佳： 因为hadoop是分布式系统,计算是在所有节点上并行处理的,这意味着,无论你用何种语言,都必须在所有计算节点上安装该语言的解释器.例如,你想用nodejs写mapreduce是可以的,但你必须要在每台nodemanager上安装v8引擎才可以用...

西乌珠穆沁旗15360361583： hadoop和python能搭配在一起做一个项目么 - ？
员奖迪佳： Hadoop 的HDFS做存储,数据处理方面可以用python的hadoop框架做,比如用Mrjob 、pydoop 、Luigi等.(还有些其他框架如Dumbo、Hadoopy…都停止更新了,不建议使用)

西乌珠穆沁旗15360361583： 怎样学习hadoop呀,还要买很多机器么 - ？
员奖迪佳： 如果只是学习或测试,你可以用一台机,然后上面装虚拟机软件,虚拟几台机搭建hadoop环境

西乌珠穆沁旗15360361583： 怎么在python中安装mrjob - ？
员奖迪佳： 什么是mrjob 一个通过hadoop、emr的mapreduce编程接口(streamming),扩展出来的一个python的编程框架. 安装先安装python 2.5+版本(对应0.4) 线上目前版本:python 2.6.8 调度机安装mrjob即可: http://pythonhosted.org/mrjob/guides/...

西乌珠穆沁旗15360361583： 如何验证hadoop client安装是否成功 - ？
员奖迪佳： 5.1. 进入hadoop目录 cd /home/hadoop/hadoopinstall/hadoop 5.2. 运行bin目录下的hadoop文件,格式化namenode节点 bin/hadoop namenode -format 5.3. 运行bin目录下的start-all.sh文件,启动hadoop集群 bin/start-all.sh 5.4. jps验证进程是否启...

西乌珠穆沁旗15360361583： Python在大数据领域是怎么来应用的 - ？
员奖迪佳： 有些办法.比如使用array, numpy.array. 主要的思路是节约内存的使用,同时提高数据查询的效率.如果能够注意这些内容,处理几个GB的数据还是轻松的. 接下来就是分布式计算. 按mapreduce的思路.数据尽量在本地处理.所以算法上要...

西乌珠穆沁旗15360361583： 开源爬虫框架各有什么优缺点? - ？
员奖迪佳： 首先爬虫框架有三种1. 分布式爬虫:Nutch 2. JAVA单机爬虫:Crawler4j,WebMagic,WebCollector 3. 非JAVA单机爬虫:scrapy 第一类:分布式爬虫优点: 1. 海量URL管理 2. 网速快缺点: 1. Nutch是为搜索引擎设计的爬虫,大多数用户是需要...

你可能想看的相关专题

星空见康网

如何使用Python为Hadoop编写一个简单的MapReduce程序

你可能想看的相关专题