Performance – How to get time costs from HDFS read data from HDFS from spark

Spark’s timeline contains:

>Scheduler delay
>Task deserialization time
>Random play time
>Performer calculation time
>Random play write time
>Result serialization time
>Result time

It seems that the time cost of reading data from the source (such as hdfs) includes In Executor Computing Time. But I am not sure.

If it is in Executor Computing Time, how to get it without including the time cost of calculation.

Thank you.< /p>

It is difficult to correctly distinguish the time spent in the read operation, because the data is being read while the data is being read. Processing.

A simple best choice is to apply a simple operation (such as counting), the overhead of this operation is very small. If your file is quite large, then the reading will be extremely The earth governs trivial operations, especially if it is a count that can be done without moving data between nodes (except for single value results).

Spark The timeline contains:

>Scheduler delay
>Task deserialization time
>Random play time
>Performer calculation time
>Random play write Import time
>Result serialization time
>Get result time

It seems that the time cost of reading data from the source (such as hdfs) is included in the Executor Computing Time. But I am not sure.< /p>

If it is in Executor Computing Time, how to get it without including the calculated time cost.

Thank you.

p>

It is difficult to correctly distinguish the time spent in the read operation, because the data is processed while the data is being read.

A simple best choice is to apply a simple operation (Such as counting), the overhead of this operation is very small. If your file is quite large, then reading will greatly dominate trivial operations, especially if it is a count that can move data between nodes. Complete under circumstances (except for single value results).

Leave a Comment

Your email address will not be published.