I want to know if s3distcp contains any options to force line breaks, or if there are other ways to do this without directly modifying the source files (or copy them and do the same)
hadoop fs -mkdir hdfs:/// tmpoutputfolder/
hadoop fs -mkdir hdfs:///finaloutputfolder/
hadoop jar lib/emr-s3distcp-1.0.jar \
--src s3://inputfolder \
--dest hdfs:///tmpoutputfolder \
--targetSize 1000000000 \
--groupBy ".*(0).*"
hadoop jar /home/hadoop/contrib/streaming/ hadoop-streaming.jar \
-D mapred.reduce.tasks=1 \
--input hdfs:///tmpoutputfolder \
--output hdfs:///finaloutputfolder \
--mapper / bin/cat \
--reducer'/bin/sed "s/}{/}\n{/g"'
I have millions I want to merge the small line of s3 files. I have s3distcp syntax, but I found that after merging the files, the merge set does not contain line breaks.
I want to know whether s3distcp contains Any option to force line break, or is there any other way to accomplish this without directly modifying the source file (or copy them and do the same)
< p> If your text file starts/ends with a unique character sequence, you can first use s3distcp to merge them into one file (I achieved this by setting –targetSize to a very large number), and then use sed And Hadoop stream is added in a new line; in the following example, each file contains a json (file name starts with 0), and the sed command inserts a newline character between each instance of {}:
< /p>
hadoop fs -mkdir hdfs:///tmpoutputfolder/
hadoop fs -mkdir hdfs:///finaloutputfolder/
hadoop jar lib/emr-s3distcp -1.0.jar \
--src s3://inputfolder \
--dest hdfs:///tmpoutputfolder \
--targetSize 1000000000 \
--groupBy ". *(0).*"
hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar \
-D mapred.reduce.tasks=1 \
--input hdfs :///tmpoutputfolder \
--output hdfs:///finaloutputfolder \
--mapper /bin/cat \
--reducer'/bin/sed "s/}{/}\n{/g"'