var calcUsers = new ArrayBuffer[Int]()
nRDD.foreach(item=>{
val arr = item.split(' ')
val currUserId = arr(1).toInt
calcUsers.+=(currUserId)
println("calcUsers",currUserId,calcUsers.length)
})
println("calcUsers",calcUsers.length)
第一个println可以看到数组长度再不断变长
第二个println却会输出0
为什么
如何解决?
RDD's is a disctributed data structure. The RDD actually does not live on the driver node (the node on which your code is actually running). All RDD operations (map, foreach) etc are actually performed on the executor nodes. So, Spark creates a closure of the operating function (you can think of it as an object containing the copies of all required variables and the function itself) and sends this object to each executor node, where it execute on the actual RDD.
In simpler words, Spark will create multiple copies of your calcUsers and will send it to executor nodes along with the function. Each executor will then execute the function using their own copy of calcUsers. The calcUser which you are seeing here will not be used at all.