Hadoop – How to make S3DISTCP complies with the wrap

I have millions of small one-line s3 files that I want to merge together. I have s3distcp syntax, but I found that after merging the files, the merged set does not contain newline characters.

I want to know if s3distcp contains any options to force line breaks, or if there are other ways to do this without directly modifying the source files (or copy them and do the same)

If your text file starts/ends with a unique character sequence, you can first use s3distcp to They are merged into one file (I achieved this by setting –targetSize to a very large number), and then added in a new line using sed and Hadoop streams; in the example below, each file contains a json( File names start with 0), the sed command inserts a newline character between each instance of {}:

hadoop fs -mkdir hdfs:/// tmpoutputfolder/
hadoop fs -mkdir hdfs:///finaloutputfolder/
hadoop jar lib/emr-s3distcp-1.0.jar \
--src s3://inputfolder \
--dest hdfs:///tmpoutputfolder \
--targetSize 1000000000 \
--groupBy ".*(0).*"
hadoop jar /home/hadoop/contrib/streaming/ hadoop-streaming.jar \
-D mapred.reduce.tasks=1 \
--input hdfs:///tmpoutputfolder \
--output hdfs:///finaloutputfolder \
--mapper / bin/cat \
--reducer'/bin/sed "s/}{/}\n{/g"'

I have millions I want to merge the small line of s3 files. I have s3distcp syntax, but I found that after merging the files, the merge set does not contain line breaks.

I want to know whether s3distcp contains Any option to force line break, or is there any other way to accomplish this without directly modifying the source file (or copy them and do the same)

< p> If your text file starts/ends with a unique character sequence, you can first use s3distcp to merge them into one file (I achieved this by setting –targetSize to a very large number), and then use sed And Hadoop stream is added in a new line; in the following example, each file contains a json (file name starts with 0), and the sed command inserts a newline character between each instance of {}:

< /p>

hadoop fs -mkdir hdfs:///tmpoutputfolder/
hadoop fs -mkdir hdfs:///finaloutputfolder/
hadoop jar lib/emr-s3distcp -1.0.jar \
--src s3://inputfolder \
--dest hdfs:///tmpoutputfolder \
--targetSize 1000000000 \
--groupBy ". *(0).*"
hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar \
-D mapred.reduce.tasks=1 \
--input hdfs :///tmpoutputfolder \
--output hdfs:///finaloutputfolder \
--mapper /bin/cat \
--reducer'/bin/sed "s/}{/}\n{/g"'

Leave a Comment

Your email address will not be published.