Learn about Flow -- a parallel computing library for elixir

"We don't lack computers, we lack ways to use them intelligently."

In daily programming, I sometimes unconsciously regard the computer as a person, and assign tasks to the computer by speaking to people. One of the main differences between a computer and a human, however, is that it executes programs verbatim, and doesn't work around special situations.

For example, if we want to count the word frequency in a file, the most intuitive way is:

 File.stream!("path/to/some/file")
|> Enum.flat_map(&String.split(&1, " "))
|> Enum.reduce(%{}, fn word, acc ->
  Map.update(acc, word, 1, & &1 + 1)
end)
|> Enum.to_list()

The first line is to use File.stream!/1 to open the file, which allows us to read the file line by line, this step will not read the file content. The second line is incredible, it will read the entire contents of the file into memory. If the file is too large here, it may directly burst the memory.

 File.stream!("path/to/some/file")
|> Stream.flat_map(&String.split(&1, " "))
|> Enum.reduce(%{}, fn word, acc ->
  Map.update(acc, word, 1, & &1 + 1)
end)
|> Enum.to_list()

Since Enum.flat_map/2 is too violent, we replace it with Stream.flat_map/2 , so that nothing will be read on the second line. To the third line Enum.reduce/3 here will start to read the file content line by line and use a hash map to count the word frequency. In this way, there is basically no memory explosion. Today's processors are basically multi-core, can we make use of multi-core processors?

For convenience, we use the following list to represent each line of the file (although this does not reflect the characteristics of dealing with large files, we just need to know that the program will not read the entire content into memory at once)

 data = [
  "rose are red",
  "violets are blue"
]

The first step, similar to Stream , we generate a lazy Flow data structure:

 opts = [stages: 2, max_demand: 1]

flow = flow
  |> Flow.from_enumerable(opts)

%Flow{
  operations: [],
  options: [stages: 2, max_demand: 1],
  producers: {:enumerables, [["rose are red", "violets are blue"]]},
  window: %Flow.Window.Global{periodically: [], trigger: nil}
}

stages can be understood as the number of parallel cores, essentially participating in parallel processing gen_stage the number of processes. Here we set it to 2, the same as the default configuration on dual-core machines.

The following flat_map and reduce the operation is very similar to the above.

 flow = flow 
  |> Flow.flat_map(&String.split/1)
  |> Flow.reduce(fn -> %{} end, fn word, acc -> Map.update(acc, word, 1, &(&1 + 1)) end)

%Flow{
  operations: [
    {:reduce, #Function<45.65746770/0 in :erl_eval.expr/5>,
     #Function<43.65746770/2 in :erl_eval.expr/5>},
    {:mapper, :flat_map, [&String.split/1]}
  ],
  options: [stages: 2, max_demand: 1],
  producers: {:enumerables, [["rose are red", "violets are blue"]]},
  window: %Flow.Window.Global{periodically: [], trigger: nil}
}

 flow |> Enum.to_list()

[{"are", 1}, {"blue", 1}, {"violets", 1}, {"are", 1}, {"red", 1}, {"rose", 1}]

By calling the functions of the immediate execution class, such as Enum.to_list/1 , Flow , the actual execution finally starts. Notice that in the result {"are", 1} appears twice, why is this?

Remember we set the stages: 2, max_demand: 1 option, which means that the number of stages involved in processing tasks is 2, and each stage can process at most 1 event at a time. The result of this setting is that "rose are red" and "violets are blue" are handed over to different stages for processing, and the final result is simply stitched together. To complete the final merge, it will be an operation that can only be performed by a single process, which we do not want to see.

Is there a way to avoid this problem when assigning events? If we can assign the same events to the same stage, we can avoid the final merge problem. Using a hash to assign events is great, Flow.partition does exactly that.

 flow = flow 
  |> Flow.flat_map(&String.split/1)
  |> Flow.partition(opts)
  |> Flow.reduce(fn -> %{} end, fn word, acc -> Map.update(acc, word, 1, &(&1 + 1)) end)

%Flow{
  operations: [
    {:reduce, #Function<45.65746770/0 in :erl_eval.expr/5>,
     #Function<43.65746770/2 in :erl_eval.expr/5>}
  ],
  options: [stages: 2, max_demand: 1],
  producers: {:flows,
   [
     %Flow{
       operations: [{:mapper, :flat_map, [&String.split/1]}],
       options: [stages: 2, max_demand: 1],
       producers: {:enumerables, [["rose are red", "violets are blue"]]},
       window: %Flow.Window.Global{periodically: [], trigger: nil}
     }
   ]},
  window: %Flow.Window.Global{periodically: [], trigger: nil}
}

You can see that the original Flow is nested inside the new Flow after the partition, which is why we need to pass in opts again. After the inner Flow is executed, the outer Flow will be executed next. This time, words are assigned to different stages according to the hash. (We don't see any information about hashes in the Flow structure above, because that's how events are dispatched by default)

 flow |> Enum.to_list()

[{"blue", 1}, {"rose", 1}, {"violets", 1}, {"are", 2}, {"red", 1}]

The expectation was successfully met that no additional computations were required to combine the results.

The code in the text comes from https://hexdocs.pm/flow/Flow.html

Learn about Flow -- a parallel computing library for elixir

Ljzn

引用和评论

写一个简单的项目