Index Dependent Nested Loops Parallelization with an Even Distributed Number of Steps

Ádám Pintér, Sándor Szénási


Parallel processing of algorithms is an effective way to achieve higher performance on multiprocessor systems rather. During parallelization, it is critical to minimize the difference between the processing time for threads. It is necessary to choose a method that can efficiently distribute the workload evenly across the threads. This paper deals with a special kind of nested loops where the internal loop iterator depends on the outer loop iterator. In such cases, the process can be represented as an upper (or lower) triangular matrix. This paper introduces a method for partitioning the outer loop according to the indices in an almost optimal manner, so that the partial loops in each thread will take nearly the same number of steps. In addition, we examine the potential of a perfect partition and try to determine the maximum (but still meaningful) partition size.

Full Text:



T. Schmidt, J. Stoye, Quadratic time algorithms for finding common intervals in two and more sequences, Proc of CPM 2004 LNCS 3109(2004) 347–358.doi:10.1007/978-3-540-27801-6_26.

C. Lengauer, Loop parallelization in the polytope model, CONCUR’93 (1993) 398–416.

K. Bondalapati, Parallelizing dsp nested loops on reconfigurable architectures using data context switching, Proceedings of the 38th DesignAutomation Conference (IEEE Cat. No.01CH37232) (2001).doi:10.1145/378239.378483.

J. Ramanujam, Optimal software pipelining of nested loops, in: Proceedings of 8th International Parallel Processing Symposium, 1994, pp.335–342.doi:10.1109/IPPS.1994.288280.

S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, P. Sadayappan, Effective automatic parallelization of stencilcomputations, Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (2007) 235–244doi:10.1145/1250734.1250761.

B. J. Gaska, N. Jothi, M. S. Mohammadi, K. Volk, Handling nested parallelism and extreme load imbalance in an orbital analysis code,arXiv:1707.09668 (2017).

B. Bylina, J. Bylina, Strategies of parallelizing nested loops on the multicore architectures on the example of the wz factorization for thedense matrices, Proceedings of the Federated Conference on Computer Science and Information Systems (2015) 629–639doi:10.15439/2015F354.

A. Kejariwal, P. D’Alberto, A. Nicolau, C. D. Polychronopoulos, A geometric approach for partitioning n-dimensional non-rectangular iterationspaces, in: Languages and Compilers for High Performance Computing, Springer Berlin Heidelberg, Berlin, Heidelberg, 2005, pp. 102–116.

V. Beletskyy, M. P. Parallelizing, Perfectly nested loops with non-uniform dependences (2002).

J. Liu, J. Wickerson, G. A. Constantinides, Loop splitting for efficient pipelining in high-level synthesis, Proceedings of the Federated Con-ference on Computer Science and Information Systems (2016).doi:10.1109/FCCM.2016.27.

S. D. Polychronopoulos, Loop coalescing: a compiler transformation for parallel machines, University of Illinois at Urbana-Champaign, Centerfor Supercomputing Research and Development, Technical Report CSRD-635 (1987).

M. K. Philippe Clauss, Ervin Altıntas, Automatic collapsing of non-rectangular loops, Parallel and Distributed Processing Symposium(IPDPS), Orlando, United States (2017) 778–787.

N. Kafri, J. A. Sbeih, Simple optimal partitioning approch to perfect tringular iteration space, Proceedings of the 2008 High Performance,Computing & Simulation Conference©ECMS, Waleed W. Smari (Ed.), ISBN: 978-0-9553018-7-2 / ISBN: 978-0-9553018-6-5 (CD) (2008)124–131.

R. Sakellariou, A compile-time partitioning strategy for non-rectangular loop nests, Proceedings of the 11th International Parallel ProcessingSymposium, IPPS’97 (Geneva), IEEE Computer Society Press (1997) 633–637.

A. Jackson, O. Agathokleous, Dynamic loop parallelisation, arXiv:1205.2367 (2012).

T. H. Tzen, L. M. Ni, Dependence uniformization: a loop parallelization technique, IEEE Transactions on Parallel and Distributed Systems4 (5) (1993) 547–558.doi:10.1109/71.224217.

B. Bylina, J. Bylina, Parallelizing nested loops on the intel xeon phi on the example of the dense wz factorization, Proceedings of the FederatedConference on Computer Science and Information Systems 8 (2016) 655–664.doi:10.15439/2016F436.

M. Wolf, M. Lam, A loop transformation theory and an algorithm to maximize parallelism, IEEE Transactions on Parallel and DistributedSystems 2 (4) (1991) 452–471.doi:10.1109/71.97902.

H. Blinn, How to solve a quadratic equation?, IEEE Computer Graphics and Applications (Volume: 25 , Issue: 6) (2005)


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.