Research Activities

I am working in the TADaaM Inria team (previously Runtime Inria Team-Project). My primary research interests are (or have been):

Topology-aware High-Performance Computing (2006-...)

The increasing complexity in modern machines, with multiple processors, shared caches, cores, hardware threads, NUMA memory nodes, and I/O devices causes performance to dramatically depend on where task and data buffer are placed. While manual understanding of the hardware architecture may be feasible, achieving performance portability requires automatic discovery of the hardware topology and constraints. We build a structural model of the platform as a hierarchical tree of resources. Then, using such knowledges enable topology-aware placement of tasks according to their behavior and needs.

We exhibited the impact of task and data placement with regards to high-speed network performance and then development the corresponding automatic placement strategies. We also implemented the hwloc software which gathers deep knowledge about the hardware and exposes it to application in an abstracted and portable manner, letting MPI and OpenMP runtime systems place tasks in a clever way depending on their affinities. This model is now being extended to heterogeneous and non-volatile memories, as well as network topologies through the netloc software.

This work was carried out within Stéphanie Moreaud's PhD, and a collaboration between Inria and the Open MPI project.

More details about Hardware Locality (hwloc)

More details about Network Locality (netloc)

All publications about Topology.

Modeling Memory Access Performance (2010-...)

Modeling the hardware as a structural tree lacks information about the impact of each level of the hiearchy on the actual application performance. We developed several ways to quantify the performance of memory accesses depending on the way the application is mapped onto cores that share some caches. We proposed a memory model to predict the performance of memory-bound kernels, and we extended the roofline model to cope with NUMA and heterogenerous memory architectures.

This work was carried out within Bertrand Putigny's PhD, Nicolas Denoyelle's PhD, and Andrès Rubio PhD.

All publications about Memory.

HDR Habilitation (2006-2014)

I earnt my HDR habilitation at University of Bordeaux after 7 years in the Runtime Inria team at the Inria Bordeaux - Sud-Ouest Research centre and LaBRI laboratory. The PDF manuscrit is available below.

The corresponding research interests are described in other sections of this page.

High-Performance Intra-node MPI Communication (2006-2013)

The widespread use of multicore processors in high-performance computing makes intra-node parallelization as important as inter-node communication. While hybrid models such as MPI+OpenMP are being worked on, many applications still rely on MPI both intra-node and inter-node communication.

We showed that most existing MPI implementations offer limited throughput for large message intra-node data transfer and may be easily improved with custom implementations such as Open-MX specialized intra-node strategies. We now develop the generic kernel module KNEM to provide MPI implementors with an optimized data transfer model that reduces CPU consumption, cache pollution and memory usage while improving large message throughput.

This work was carried out within Stéphanie Moreaud's PhD, and collaborations between Inria and the MPICH2 team at Argonne National Lab and the Open MPI project.

More details about KNEM

All publications about KNEM.

High-Performance Message-Passing over Generic Ethernet (2006-2012)

The long-awaited convergence between specialized HPC networks (Infiniband, Myrinet, ...) and traditional networks (Ethernet) is finally occuring (for instance Myricom achieved 2 microseconds MPI latency over Ethernet in 2005). It raises the question of which hardware and software technology will be used in the future converged networks. While complex features such as RDMA or TOE have been proposed, modern NICs offer simple stateless offload features such as TSO or multiqueue to improve performance in a cost-effective manner.

We developed Open-MX to show that it is possible to achieve high performance communication by designing a MPI stack for such generic Ethernet hardware. We also propose some innovative ideas to improve performance by adding little support in the hardware without requiring a fully-specialized network stack.

Open-MX was designed within a collaboration between Inria and Myricom.

This work was also extended as part of the CCI project (Common Communication Interface) which aims at offering high performance communication for HPC and data centers.

All publications about Open-MX.

Dynamic Data Placement on NUMA architectures (2007-2011)

The emergence of multicore processors with multiple levels of shared caches and distributed NUMA memory architectures leads to a HPC world where machines are not flat but hierarchical. Running tasks on such machines requires to carefully place them and their data buffers according to their affinities so as to maximize locality. While OpenMP is an interesting way to parallelize codes using threads, most implementations fail to efficiently manage nested and irregular parallelism: they either spread the workload across all cores without maintaining affinities, or maintain affinity only at startup.

We developed ForestGOMP to tackle this problem. The parallel structure of the application (OpenMP parallel sections) and their associated data buffers are taken into account by the scheduler during the whole execution time. The whole workload is properly distributed on all cores and memory nodes while maintaining affinities, and it is redistributed properly whenever imbalance appears.

To achieve this goal, we had to develop advanced memory management abilities in the Linux kernel so as to let ForestGOMP distribute/migrate data buffers near their accessing tasks at runtime manually (with optimized memory migration primitives) or automagically (with convenient next-touch migration strategies).

This work was carried out within François Broquedis' PhD.

All publications about ForestGOMP.

PhD Thesis (2002-2005)

During my Ph.D (from 2002 to 2005, in the RESO Team at the LIP lab, ENS Lyon, France), I was working on Distributed File System in Clusters with Loïc Prylli, Olivier Glück, and Pascal Vicat-Blanc Primet. The PDF manuscrit is available below.

This work is also described in the next sections.

Making High-Performance Networks work in the Linux kernel (PhD thesis, 2004-2005)

Using high-performance networks such as Myricom's GM inside the Linux kernel for distributed filesystems raised severe issues regarding memory registration. First I had to modify the GM driver to allow user-memory registration. Then I developed the GMKRC kernel module for caching user-memory registration and avoiding deregistration overheads. Finally I added a new VMASpy infrastructure to the Linux kernel for notifying GMKRC of virtual memory area changes (somehow similar to MMU Notifiers later added to the official Linux kernel 2.6.27).

Then I used this work to directly implement the right features into the new MX driver (Myrinet Express, which was going to replace GM). We work with Myricom to enable in-kernel application such as file-systems to fully benefit from the MX performance improvement. This especially includes proposing the entire MX API in the kernel so that both user and kernel memory and even physical memory might be used in communications. This work was distributed by Myricom and used for production distributed filesystems such as Lustre.

Software Downloads

All publications from my PhD thesis work.

Improving High-Performance Network Use for Distributed Storage in Clusters (PhD thesis, 2002-2003)

As high-performance networks were hardly usable for anything but MPI, we developed the ORFA library to study the actual performance bottlenecks when using these networks for distributed storage in clusters.

The ORFA client is implemented as a shared library that transparently intercepts I/O calls from any legacy applications (using LD_PRELOAD). No application rewriting or recompiling is required to get full support for remote files manipulations. The client uses TCP over Ethernet, BIP and GM over Myrinet to contact a server using native on in-memory file systems. Trying to make the ORFA server as efficient as possible, we faced the problem of efficiently mixing Myrinet I/O with standard I/O. I wrote several patches to get BIP events through the epoll interface.

As filesystems are often implemented inside the Linux kernel to benefit from system caches, we later replaced the userspace ORFA client with a kernel client ORFS. This led to the development of the Myrinet Express in-kernel API.

Software Downloads

All publications from my PhD thesis work.

Updated on 2023/06/28.