Hadoop – write Spark data frame as inlaid to S3 instead of creating _temporary folder

Using pyspark I am reading a data frame from a parquet file on Amazon S3

dataS3 = sql. read.parquet("s3a://" + s3_bucket_in)

This is no problem. But then I try to write data

dataS3.write.parquet( "s3a://" + s3_bucket_out)

I did get the following exception

py4j.protocol.Py4JJavaError: An error occurred while calling o39.parquet. 
: java.lang.IllegalArgumentException: java.net.URISyntaxException:
Relative path in absolute URI: s3a://_temporary

In my opinion, Spark is writing First try to create a _temporary folder before writing to the given bucket. This can be prevented in some way, so spark writes directly to the given output bucket?

You cannot eliminate the _temporary file because it is used to keep the intermediate files
hidden query Work until completion

But that’s okay, because that’s not the problem. The problem is that the output submitter is a bit confused when trying to write to the root directory (can’t delete it, see)

< p>You need to use the full prefix to write to the subdirectories under the bucket. For example,
s3a: // mybucket / work / out.

I should add that trying to submit data to S3A is not Reliable, precisely because the way it imitates rename() is similar to ls -rlf src | xargs -p8 -I% “cp%dst /%&& rm%”. Because ls delays consistency on S3, it may You will miss the newly created files, so don’t copy them.

For more information, see: Improving Apache Spark.

Now, you can only be reliable by writing to HDFS and then copying Submit to s3a. EMR s3 solves this problem by using DynamoDB to provide a consistent list

Using pyspark I am reading data from parquet files on Amazon S3 Box

dataS3 = sql.read.parquet("s3a://" + s3_bucket_in)

This is no problem. But then I tried Write data

dataS3.write.parquet("s3a://" + s3_bucket_out)

I do get the following exception

py4j.protocol.Py4JJavaError: An error occurred while calling o39.parquet.
: java.lang.IllegalArgumentException: java.net.URISyntaxException:
Relative path in absolute URI: s3a ://_temporary

In my opinion, Spark first tries to create a _t before writing to a given bucket emporary folder. This can be prevented somehow, so spark writes directly to the given output bucket?

You cannot eliminate the _temporary file because it is used to keep the intermediate files
hidden query work until completion

But that’s okay, because this is not the problem. The problem is that the output committer is a bit confused when trying to write to the root directory (can’t delete it, see)

You need to use the full prefix to write to the bucket For example,
s3a: //mybucket/work/out.

I should add that trying to submit data to S3A is unreliable, precisely because it imitates rename() The way is similar to ls -rlf src | xargs -p8 -I% “cp%dst /%&& rm%”. Because ls delays consistency on S3, it may miss newly created files, so don’t copy them.

For more information, see: Improving Apache Spark.

Now, you can only reliably submit to s3a by writing to HDFS and then copying. EMR s3 provides consistency by using DynamoDB List to solve this problem

Leave a Comment

Your email address will not be published.