Latency-Optimized Parallelization of the FMM Near-Field Computations
We present a new parallelization scheme for the FMM near-field. The
parallelization is based on the Global Arrays Toolkit and uses one-sided
communication with overlapping. It employs a purely static load-balancing
approach to minimize the number of communication steps and benefits from a
maximum utilization of data locality.