FastLA meeting

3 juillet : Les supports d'exécution -- Salle Ada Lovelace (Inria)

9h30 10h30 R. Namyst (Inria)

Titre : Programming modern (accelerator-based) multicore machines: a runtime system's perspective

Résumé : Heterogeneous accelerator-based parallel machines, featuring manycore CPUs and with GPU accelerators provide an unprecedented amount of processing power per node. Dealing with such a large number of heterogeneous processing units -- providing a highly unbalanced computing power -- is one of the biggest challenge that developpers of HPC applications have to face. To Fully tap into the potential of these heterogeneous machines, pure offloading approaches, that consist in running an application on regular processors while offloading part of the code on accelerators, are not sufficient.

In this talk, I will go through the major programming tools that were designed to harness parallel architectures. I will discuss some of the most critical issues programmers have to consider to achieve portability of performance, and I will show how advanced runtime techniques can greatly speed up applications.

Eventually, I will give some insights about the main challenges designers of program-ming environments will have to face in upcoming years.

10h30 -11h Café

11h00 -12h00 G. Bosilca (Inria & UTK)

Titre : Parallel Runtime Scheduling and Execution Controller (ParSEC)

Design of systems exceeding 1 Pflop/s and inching towards 1 Eflop/s forced a dramatic shift in hardware design. Various physical and engineering constraints resulted in introduction of massive parallelism and functional hybridization with the use of accelerator units. This paradigm change brings about a serious challenge for application developers as the management of multicore proliferation and heterogeneity rests on software.

This presentation will introduce a distributed runtime PaRSEC (Parallel Runtime Scheduling and Execution Controller), a runtime controller designed for massively parallel highly heterogeneous systems. Unlike other runtimes, PaRSEC mark a sharp break in its use of parameterized tasks graphs to provide a complete representation of the entire execution pattern, a symbolic representation allowing for an efficient distributed scheduling of tasks. Additionally, this representation disconnects the algorithmic expressions from the data distribution, providing programmers with a higher-level abstraction where exploring different data distributions and load-balancing can be done without a rewriting of the algorithm itself. Based on this runtime, a new dense linear algebra package have been developed, that approach distributed-memory machines with the use of automatic dependence analysis and the Direct Acyclic Graph Engine to deliver scalable high performance over many thousands of heterogeneous computing resources.

4 juillet : Solveurs rapides – Amphi du LaBRI (Bat A30)

9h30 - 10h30 E. Darve (Stanford)

Titre : Fast linear algebra for matrices with low-rank hierarchical structures

Résumé : Dense direct solvers have long had limited application because of their relative computational cost compared to iterative solvers such as multigrid, conjugate gradient, etc. However there are many problems in which iterative solvers fail, for various reasons, including the absence of a good and computationally cheap preconditioner. We are attempting to address these issues by developing a new class of direct solvers with a computational cost of O(N) where N is the size of the matrix. These solvers leverage the low-rank structure of certain off-diagonal blocks in the matrix. They are based on the Gaussian elimination and therefore do not suffer from convergence limitations of iterative methods. We will demonstrate results both on dense and sparse matrices.

10h30 -11h Café

11h00 - 12h00 Benoit Lizé (EADS)

Titre : Parallélisation d'un solveur direct rapide : H-Matrix. Application aux équations des ondes.

Résumé : La discrétisation d'équations intégrales par la méthode des éléments finis de frontière (BEM) mène à la résolution de grands systèmes linéaires pleins, dont le nombre d'inconnues $n$ dépasse largement $10^6$ dans les problèmes industriels.

Cette résolution est soit directe, avec un co√ªt en $O(n^3)$, soit itérative, pour un co√ªt de $O(n^2 × n_{iter})$. Dans les deux cas, ce co√ªt devient prohibitif pour des problèmes de grande taille. La méthode multipile rapide (FMM) amena des avancées spectaculaires dans les années 2000 (particulièrement dans une implémentation parallèle), réduisant la complexité d'un solveur itératif à $O(n log^2(n) n_{iter})$. Elle hérite néanmoins des inconvénient usuels des solveurs itératifs (convergence, grand nombre de seconds membres).

Nous présentons ici l'application de la méthode des H-Matrices, permettant de réaliser un solveur direct avec une complexité $O(n log^α(n))$. Cette méthode, basée sur des principes similaires à la FMM (approximations hiérarchiques, séparation des variables) se formalise par des algorithmes naturellement récursifs. Ceux-ci sont très irréguliers en terme de consommation mémoire et temps de calcul, et possèdent des dépendances non triviales entre les étapes de calcul, ce qui rend leur parallélisation délicate. Nous donnerons une parallélisation efficace de ces algorithmes, basée sur un formalisme récent de graphe de t√¢ches, au-dessus d'un moteur d'exécution, ici QUARK et StarPU.

La description plane de ces algorithmes sous la forme de graphe acyclique orienté (DAG) exprimant les contraintes d'exécution permet une parallélisation efficace, en mémoire partagée et distribuée, ainsi que sur architectures hybrides. Nous présenterons des applications industrielles à des problémes d'électromagnétisme et acoustique aujourd'hui inaccessibles au calcul par d'autres méthodes.