[ad_1]
I have a Map
in my driver node. I am then using this map for processing inside each of the executors, using the forEach()
action. So, essentially there is only 1 Spark job, having parallelism of 10. Now, as the documentation recites, explicitly creating broadcast variables are only beneficial when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
So, for my use-case, what all should I consider to decide whether or not I should be broadcasting
this map. I have 5 executors running for this Spark application.
[ad_2]
لینک منبع