Tuesday, May 8, 2012

Hadoop Leaning Note


This installation and configuration are under winXP OS
prapare three software package
1、cygwin(http://cygwin.com/setup.exe)
2、hadoop (http://mirror.bjtu.edu.cn/apache/hadoop/common/hadoop-0.20.2/hadoop-0.20.2.tar.gz)
3、jdk ( above version 6)

cygwin installed under D:\ directory
Exctract hadoop unser D:\cygwin
install jdk under C:\

and then do configaration , and below commad in.bashrc
export JAVA_HOME==/cygdrive/c/Java/jdk1.7.0_03
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=$JAVA_HOME/lib/tools.jar:$JAVA_HOME/lib/dt.jar

addtionally , under hadoop/conf  we need modify conf/hadoop-env.sh
configure JAVA_HOME
export JAVA_HOME=/cygdrive/c/Java/jdk1.7.0_03

configuration is done
 
$ bin/hadoop
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
  namenode -format     format the DFS filesystem
  secondarynamenode    run the DFS secondary namenode
  namenode             run the DFS namenode
  datanode             run a DFS datanode
  dfsadmin             run a DFS admin client
  mradmin              run a Map-Reduce admin client
  fsck                 run a DFS filesystem checking utility
  fs                   run a generic filesystem user client
  balancer             run a cluster balancing utility
  jobtracker           run the MapReduce job Tracker node
  pipes                run a Pipes job
  tasktracker          run a MapReduce task Tracker node
  job                  manipulate MapReduce jobs
  queue                get information regarding JobQueues
  version              print the version
  jar <jar>            run a jar file
  distcp <srcurl> <desturl> copy file or directories recursively
  archive -archiveName NAME <src>* <dest> create a hadoop archive
  daemonlog            get/set the log level for each daemon
or
  CLASSNAME            run the class named CLASSNAME
Most commands print help when invoked w/o parameters.


next we may run a program wordcount
1、create an input folder(program will automatically create output)
2、put some test file into input forlder
3、$ bin/hadoop  jar hadoop-0.20.2-examples.jar wordcount input output
12/03/05 04:05:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/03/05 04:05:43 INFO input.FileInputFormat: Total input paths to process : 1
12/03/05 04:05:44 INFO mapred.JobClient: Running job: job_local_0001
12/03/05 04:05:44 INFO input.FileInputFormat: Total input paths to process : 1
12/03/05 04:05:44 INFO mapred.MapTask: io.sort.mb = 100
12/03/05 04:05:44 INFO mapred.MapTask: data buffer = 79691776/99614720
12/03/05 04:05:44 INFO mapred.MapTask: record buffer = 262144/327680
12/03/05 04:05:44 INFO mapred.MapTask: Starting flush of map output
12/03/05 04:05:44 WARN mapred.LocalJobRunner: job_local_0001
java.io.IOException: Expecting a line not the end of stream
        at org.apache.hadoop.fs.DF.parseExecResult(DF.java:109)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:179)
        at org.apache.hadoop.util.Shell.run(Shell.java:134)
        at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:329)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
        at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1221)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1129)
        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:549)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:623)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
12/03/05 04:05:45 INFO mapred.JobClient:  map 0% reduce 0%
12/03/05 04:05:45 INFO mapred.JobClient: Job complete: job_local_0001
12/03/05 04:05:45 INFO mapred.JobClient: Counters: 0
above problem can be solved by configuring LANG
export LANG=en.utf8
$ bin/hadoop  jar hadoop-0.20.2-examples.jar wordcount input output
12/03/05 04:07:18 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/03/05 04:07:18 INFO input.FileInputFormat: Total input paths to process : 1
12/03/05 04:07:19 INFO mapred.JobClient: Running job: job_local_0001
12/03/05 04:07:19 INFO input.FileInputFormat: Total input paths to process : 1
12/03/05 04:07:19 INFO mapred.MapTask: io.sort.mb = 100
12/03/05 04:07:19 INFO mapred.MapTask: data buffer = 79691776/99614720
12/03/05 04:07:19 INFO mapred.MapTask: record buffer = 262144/327680
12/03/05 04:07:19 INFO mapred.MapTask: Starting flush of map output
12/03/05 04:07:19 INFO mapred.MapTask: Finished spill 0
12/03/05 04:07:19 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/03/05 04:07:19 INFO mapred.LocalJobRunner:
12/03/05 04:07:19 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
12/03/05 04:07:19 INFO mapred.LocalJobRunner:
12/03/05 04:07:19 INFO mapred.Merger: Merging 1 sorted segments
12/03/05 04:07:19 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 5204 bytes
12/03/05 04:07:19 INFO mapred.LocalJobRunner:
12/03/05 04:07:19 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
12/03/05 04:07:19 INFO mapred.LocalJobRunner:
12/03/05 04:07:19 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
12/03/05 04:07:19 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to output
12/03/05 04:07:19 INFO mapred.LocalJobRunner: reduce > reduce
12/03/05 04:07:19 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
12/03/05 04:07:20 INFO mapred.JobClient:  map 100% reduce 100%
12/03/05 04:07:20 INFO mapred.JobClient: Job complete: job_local_0001
12/03/05 04:07:20 INFO mapred.JobClient: Counters: 12
12/03/05 04:07:20 INFO mapred.JobClient:   FileSystemCounters
12/03/05 04:07:20 INFO mapred.JobClient:     FILE_BYTES_READ=325874
12/03/05 04:07:20 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=356160
12/03/05 04:07:20 INFO mapred.JobClient:   Map-Reduce Framework
12/03/05 04:07:20 INFO mapred.JobClient:     Reduce input groups=383
12/03/05 04:07:20 INFO mapred.JobClient:     Combine output records=383
12/03/05 04:07:20 INFO mapred.JobClient:     Map input records=75
12/03/05 04:07:20 INFO mapred.JobClient:     Reduce shuffle bytes=0
12/03/05 04:07:20 INFO mapred.JobClient:     Reduce output records=383
12/03/05 04:07:20 INFO mapred.JobClient:     Spilled Records=766
12/03/05 04:07:20 INFO mapred.JobClient:     Map output bytes=6912
12/03/05 04:07:20 INFO mapred.JobClient:     Combine input records=663
12/03/05 04:07:20 INFO mapred.JobClient:     Map output records=663
12/03/05 04:07:20 INFO mapred.JobClient:     Reduce input records=383



OK ! 。
look at the result
$ cat part-r-00000
"Glory  1
"Grandiose      1
"I      1
"Putin  1
"Putinism",     1
"These  1
"We     4
"every  1
"the    1
"unfair 1
"would  1
'victory'       2
(14:00  1
-       1
--------------------------------------------------------------------------------        1
17%.    1
18:00   1
2008    1
58.3%   1
6,000   1
60%     2
62.3%.  1
64%,    1
Alexey  1
Analysis        1
BBC     1
BBC:    1
Bridget 1
But     2
Continue        2
December's      1
December,       1
Diplomatic      1
Dmitry  1
ElectionRussia  1

No comments:

Post a Comment