How to read the ORC file in the Hadoop stream?

I want to read the ORC file in mapreduce on Python. I try to run it:

hadoop jar /usr /lib/hadoop/lib/hadoop-streaming-2.6.0.2.2.6.0-2800.jar 
-file /hdfs/price/mymapper.py
-mapper'/usr/local/anaconda/ bin/python mymapper.py'
-file /hdfs/price/myreducer.py
-reducer'/usr/local/anaconda/bin/python myreducer.py'
-input /user /hive/orcfiles/*
-libjars /usr/hdp/2.2.6.0-2800/hive/lib/hive-exec.jar
-inputformat org.apache.hadoop.hive.ql.io. orc.OrcInputFormat
-numReduceTasks 1
-output /user/hive/output

But I get the error:

-inputformat : class not found: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

I found a similar question OrcNewInputformat as a inputformat for hadoop streaming, but the answer is not clear

Please give an example of how to correctly read the ORC file in the hadoop stream.

this is I use the ORC partitioned Hive table as one of the input examples:

hadoop jar /usr/hdp/2.2.4.12-1/hadoop-mapreduce/hadoop-streaming -2.6.0.2. 2.4.12-1.jar \
-libjars /usr/hdp/current/hive-client/lib/hive-exec.jar \
-Dmapreduce.task.timeout=0 -Dmapred.reduce. tasks=1 \
-Dmapreduce.job.queuename=default \
-file RStreamMapper.R RStreamReducer2.R \
-mapper "Rscript RStreamMapper.R" -reducer "Rscript RStreamReducer2.R" \
-input /hive/warehouse/asv.db/rtd_430304_fnl2 \
-output /user/Abhi/MRExample/Output \
-inputformat org.apache.hadoop.hive.ql.io .orc.OrcInputFormat
-outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat

here /apps/hive/warehouse/asv.db/rtd_430304_fnl2 is the background ORC of HIVE table The path to the data storage location. Rest I need to provide suitable jars for streaming and HIVE.

PS: Please don’t be distracted by my use of /usr/R64/bin/Rscript, because the code is ideally It should work without passing the executable path. I just gave this path because the default R executable file in my environment is 32-bit, and the path host mentioned earlier is 64-bit R.

I want to read the ORC file in mapreduce on Python. I try to run it:

hadoop jar /usr /lib/hadoop/lib/hadoop-streaming-2.6.0.2.2.6.0-2800.jar 
-file /hdfs/price/mymapper.py
-mapper'/usr/local/anaconda/ bin/python mymapper.py'
-file /hdfs/price/myreducer.py
-reducer'/usr/local/anaconda/bin/python myreducer.py'
-input /user/hive/orcfiles/*
-libjars /usr/hdp/2.2.6.0-2800/hive/lib/hive-exec.jar
-inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
-numReduceTasks 1
-output /user/hive/output

But I get the error:

-inputformat: class not found: org.apache. hadoop.hive.ql.io.orc.OrcInputFormat

I found a similar question OrcNewInputformat as a inputformat for hadoop streaming, but the answer is not clear

Please give an example of how to The ORC file is read correctly in the hadoop stream.

This is one of the examples where I use the ORC partitioned Hive table as input:

< p>

hadoop jar /usr/hdp/2.2.4.12-1/hadoop-mapreduce/hadoop-streaming-2.6.0.2.2.4.12-1.jar \
-libjars /usr /hdp/current/hive-client/lib/hive-exec.jar \
-Dmapreduce.task.timeout=0 -Dmapred.reduce.tasks=1 \
-Dmapreduce.job.queuename=default \
-file RStreamMapper.R RStreamReducer2.R \
-mapper "Rscript RStreamMapper.R" -reducer "Rscript RStreamReducer2.R" \
-input /hi ve/warehouse/asv.db/rtd_430304_fnl2 \
-output /user/Abhi/MRExample/Output \
-inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
-outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat

here /apps/hive/warehouse/asv.db/rtd_430304_fnl2 is the path to the storage location of the ORC data in the background of the HIVE table. Rest me Need to provide suitable jars for streaming media and HIVE.

PS: Please don’t be distracted by my use of /usr/R64/bin/Rscript, because the code should ideally not pass the executable path I just gave this path because the default R executable file in my environment is 32-bit, and the path host mentioned earlier is 64-bit R.

< /p>

Leave a Comment

Your email address will not be published.