Heterogeneous Map-Reduce: Scientific Visualisation (Part 3/3)

This article is the final part of the series about the heterogeneous map-reduce approach:
Part 1 – Heterogeneous Map-Reduce: Meet the Task System (Part 1/3)
Part 2 – Heterogeneous Map-Reduce: Seismic Imaging Application (Part 2/3)
Part 3 – [You are here] – Heterogeneous Map-Reduce: Scientific Visualisation (Part 3/3)

In the previous posts we have introduced the heterogeneous map-reduce framework and applied it on a seismic imaging problem. Here, we will use the heterogeneous map-reduce framework for scientific visualisation.

What is scientific visualisation?

Scientific visualisation is a way to showcase the numerical modelling or physical simulations with computer graphics. Why is it important? Because using scientific visualisation we can observe a three-dimensional structures and processes in (close to) real time and zoom in or our as we wish. Also, it is a possibility to produce cool movies like this using a rendering software (ray-tracer):

Rendering is an automatic process of generating images (also called frames) from 2D- or 3D-models that build a scene. Basically, the scene is what is viewed by the camera and it includes objects, lightening, shading, viewpoint, etc.

To make a simulation movie, we need to import the simulation results or objects into the scene. Afterwards, we need to render the scenes and produce frames/images. Finally, the images are collected into a movie.

Why not use Paraview?

Paraview makes it possible to visualise scientific data, but it renders the images with OpenGL. This means that the image can be rendered very quickly, but at the cost of quality. On the other hand, a ray-tracer will be slower but will produce better quality images (even photo realistic quality).

Levels of parallelism

Level 1: Since a movie consists of frames (separate images that compose a moving picture), the highest level of parallelism is over the frames.

Level 2: The second level of parallelism is either the internal multi-threading or multi-GPU implementation of the rendering software. In our case, we can use either the CPUs or GPUs to render the image, so 2 instances of the ray-tracer can run on one machine.

Task system (map-reduce)

The idea is to render each individual image or frame in parallel which makes it easy to assign one task to one frame. By using multiple threads for rendering, an image can be processed in parallel within a task.

It sounds pretty easy: one task for one frame. The complexity comes with the monitoring the tasks in real time. Ideally we would like to know what is each compute node doing at each moment in time and will be able to add/remove compute nodes after the computations have started.

For this we have developed a web server for the task system. The web server allows monitoring, as well as interacting with the server. Through the web-interface, we can check progress, and other indicators such as memory usage or CPU load.

In the figures below you will see snapshots of the web server while doing the scientific visualisation: list of frames to be rendered in the available queue, list of running tasks, statistics, settings and system information such as CPU load and memory usage in real-time.

Frames to be rendered
Frames to be rendered
Frames rendered
Frames rendered
System information
System information

Why use the task system for rendering?

Firstly, to make a movie we need to import the objects computed from a numerical simulation to the scene. In the previous example of the heterogeneous map-reduce (Part 2/3) we described the seismic migration. There, the objects are images of seismic migration. For each frame, a Python script would read and import the objects of interest in the ray-tracer and create the scene.

Secondly, the scene has to be rendered.

We observed that most of the time was being spent in the importing the objects. The total time per frame was too high and had to be done in a parallel, or distributed way. So that, we could use the task system for importing the objects as well as for rendering of the scenes.

Real life example

Here is the example of using the task system for the scientific visualisation from Hans. During the final stage of his PhD work at TU Delft, he implemented several seismic migration algorithms in parallel (on CPUs and GPUs). He thought it would be nice to show how migration images from each shot are forming final migration image in a movie (see more details here).

The Little Green Machine (which is the Dutch smallest supercomputer build from several CPUs connected to two GPUs) was perfect for that as he could use the power of GPUs for rendering. Many thanks to his supervisors Prof. Kees Vuik and Prof. Kees Oosterlee for giving access to the Little Green Machine.


In this series we introduced the heterogeneous map-reduce approach as a universal parallel framework. A very important tool within this framework is the task system that allows to split the work amongst compute nodes and monitor the execution.

We have shown in the previous post how to use the task system in the seismic imaging application to do seismic migration in parallel distributed way. Also, in this post we have shown how to use the task system in the scientific visualisation for importing and rendering the images.

What tools do you use for scientific visualisation?

Get EZNumeric’s future articles in your inbox:

Heterogeneous Map-Reduce: Seismic Imaging Application (Part 2/3)

This article is the second part of the series about the heterogeneous map-reduce approach:
Part 1 – Heterogeneous Map-Reduce: meet the Task System (Part 1/3)
Part 2 – [You are here] – Heterogeneous Map-Reduce: Seismic Imaging Application (Part 2/3)
Part 3 – Heterogeneous Map-Reduce: Scientific Visualisation (Part 3/3)

In the previous part we have described the heterogeneous map reduce framework. Here, we will start with an example from a seismic imaging application.

Problem setting

The oil and gas industry makes use of computational intensive algorithms such as reverse-time migration and full waveform inversion to provide an image of the subsurface. The image is obtained by sending a wave energy into the subsurface and recording the signal required for a seismic wave to reflect back to the surface from the interfaces with different physical properties. A seismic wave is usually generated by shots at known frequencies, placed close to the surface on land or to the water surface in the sea. Returning waves are usually recorded in time by hydrophones in the marine environment or by geophones during land acquisition. The goal of seismic imaging is to transform the seismograms to a spatial image of the subsurface.

Migration algorithms produce an image of the subsurface given seismic data measured at the surface. In particular, pre-stack depth migration produces the depth locations of reflectors by mapping seismic data from the time domain to the depth domain, assuming that a sufficiently accurate velocity model is available. The classic imaging principle is based on the correlation of the forward propagated wavefield from a source and a backward propagated wavefield from the receivers. In frequency domain the image is calculated as follows:

\text{Image} (x,y,z) = \sum_{\text{shots}}\sum_{\omega} W_{\text{source}}^*(x,y,z,\omega) W_{receivers}(x,y,z,\omega)

where W_{\text{source}}(x,y,z,\omega)  denotes the wavefield propagated from the source and W_{\text{receivers}}(x,y,z,\omega)  from the receivers respectively, \omega denotes the frequency. That means for every shot and every frequency we need to simulate the wave propagation twice.

Levels of parallelism

Level 1: The highest level of parallelization for frequency-domain migration is over the shots. Each shot is treated independently. We assume that the migration volume for one shot is computed on one compute node connected to none, one or more GPUs.

Level 2: The next level of parallelism involves the frequencies. For each frequency, a linear system of equations needs to be solved.

Level 3: The third level of parallelism includes matrix decomposition, where the matrix for the linear system of equations is decomposed into subsets of rows that fit on a single GPU.

Level 4: The last level of parallelism for migration in frequency domain is parallelization of matrix-vector multiplications (MVMs) and vector-vector operations.

Task system (Map-Reduce)

Here we describe how the task system works for the migration in frequency domain.

Read more about task system in the previous post: Heterogeneous Map-Reduce: Meet the Task System


Mostly we will use the first level of the parallelism. The server or ‘master node’ creates one task per source. Each task is added to a “Available” queue. When a client requests a task, a given task is moved from the queue to the “Running” list.

As we have seen earlier, the migration algorithm consists of forward and backward modelling in frequency domain for each source. Therefore, the second level of parallelism for migration consist of parallelization over all frequencies for each source.

Let’s assume that we have N_s sources and N_\omega frequencies. Then, one task consists of computations of one frequency \omega_{s_i} for a given source {s_i}, i=1,...,N_s. In total, we have N_\omega\cdot N_s tasks.

For each frequency, a linear system of equations needs to be solved.  The matrix size and memory requirements are the same for each frequency, but the lower frequencies require less compute time than the higher ones. Here, we assume that one frequency for one source in the frequency domain fits in one compute node. At this point, the auto load-balancing of the tasks comes into play. There is no need to know beforehand how to distribute the shots over the compute nodes.

If a compute node is connected to one or more GPUs, we can make use of the third level of parallelism and decompose the matrix across GPU(s). However, this is done within a task.

Migration movie

The movie starts with the velocity model. We run migration on the SEG/EAGE Overtrust model, which represents an acoustic constant-density medium with complex, layered structures and faults. Then the camera rotates and it shows the migrated image, which appears to be empty. The migrated data from each shots is added and shown in the movie as it is being received by the server (‘Reduce’ part of the algorithm).  We can see how layers become visible. At the end we see the final image that we obtained using migration in frequency domain.

More details on the simulation

The volume has a size of 1000×1000×620 m3. The problem is discretized on a grid with 301×301×187 points and a spacing of 3.33 m in each coordinate direction.

The discretization for migration in frequency domain is 2nd-order in space. A Ricker wavelet with a peak frequency of 15 Hz is chosen for the source and the maximum frequency in this experiment is 30 Hz. Note that by reducing the maximum frequency, we can increase the grid spacing. For instance, by choosing a maximum frequency of 8 Hz, the grid spacing can be chosen as 25 m in each direction. The line of sources is located at a depth of 10 m and is equally spaced with an interval of 18.367 m 4 in the x-direction.

The receivers are equally distributed in the two horizontal directions with the same spacing as the sources, at the depth of 20 m. The sampling interval for the modelled seismic data is 4 ms. The maximum simulation time is 0.5 s. For migration in the frequency domain we chose a frequency interval of 2 Hz.

What is next?

In this post we showed how to use the heterogeneous map-reduce framework and the task system on an application from seismic imaging : migration in frequency domain.

In the next post of this series we are going to have a look at scientific visualisation.

Get EZNumeric’s future articles in your inbox:

Heterogeneous Map-Reduce: meet the Task System (Part 1/3)

This article is the first part of the series about the heterogeneous map-reduce approach:
Part 1 – [You are here] – Heterogeneous Map-Reduce: meet the Task System (Part 1/3)
Part 2 – Heterogeneous Map-Reduce: Seismic Imaging Application (Part 2/3)
Part 3 – Heterogeneous Map-Reduce: Scientific Visualisation (Part 3/3)

Did it happen to you: you just finished a parallel implementation of a project and happily enjoying the speedups, until the next project arrives, where you have to use completely different hardware, and you have to start over again with the parallel framework for the new project? It happened to me several times! And Hans as well! Until he thought of a heterogeneous map-reduce approach, that ones implemented it can be easily adjusted for different hardware architectures (CPUs, GPUs, Intel’s Knight Hill, you name it).

Heterogeneous map-reduce

The idea of the heterogeneous map-reduce approach is a universal parallel framework. It assumes commodity hardware with several types of processors with different capabilities and performance (for example a cluster with some computers having one GPU, other having two GPUs or none), hence the name “heterogeneous”. The performance can also be affected by the number of users sharing a node.

The heterogeneous map-reduce approach is actually a way to distribute the computations across available hardware to achieve a good load balancing. After all, you would like to make use of the available compute resources.

The “map-reduce” component comes from the setup of the task system, which runs computations in parallel.

There is no need to know explicitly how to distribute the work beforehand.

Task system

The task system allows to split the work amongst compute nodes and monitor the execution. By a compute node we assume a multi-core CPU that might be connected to one or more GPUs, where GPUs can be used as a replacement or as an accelerator.

Read more about using GPUs as a replacement for CPU or as an accelerator in the post:
2 Ways of Using a GPU to Speedup Your Computations


The common approach for parallelization across multiple CPU nodes in the cluster is the so-called server-to-client approach. The server is aware of the number of nodes and the amount of work involved. It equally distributes the work-load among the nodes. This approach is efficient on clusters with homogeneous nodes as all CPU nodes have the same characteristics.

Since we are targeting the heterogeneous clusters, we propose a client-to-server approach where clients request tasks to the server.

The philosophy behind a task system is that a compute node will ask for a task, process it, and ask for another task as soon as the previous one has finished. A compute node is “busy” 100% of the time, regardless of its performance. The work is spread dynamically according to the speed of the compute nodes and load balancing is automatically achieved. Moreover, the task system gives the flexibility to launch new child-processes if one or more compute nodes crash or hang.

Each task is added to an “Available” queue. When a client (one compute node) requests a task, a given task is moved from the “Available” queue to the “Running” list. When the task is done, it is removed.

Failure-safe scheduling

It can happen that a node will crash due to a hardware failure. In that case, the tasks in the “Running” queue will be ranked against the starting time. Eventually, the task with the earliest starting time will be assigned to a new compute node. It may happen that 2 compute nodes might work on the same task. In this case, the first compute node that manages to compute the output “wins”.

The tasks are moving from the “Available” queue to the “Running” queue and after producing result to the “Done” queue. At the end of the computations, the “Available” queue is empty.
When the “Available” queue is empty, the tasks in the “Running” queue will be ranked against the starting time. The task with the earliest starting time will be assigned to a new compute node. Actually, now two compute nodes are running the same task. When one of them will finish first and produce a result, the (same) result from the second compute node will be discarded.
Why not MPI?

We have chosen the client-to-server approach over MPI for two reasons:

  1. If one MPI-child crashes, the master program will crash as well. We want to avoid that.
  2. It is currently not possible to add compute nodes dynamically to the MPI-server once the child processes are launched. We want to have a flexibility of adding additional compute nodes in case one has crashed or hangs.

You still can use MPI to spawn the child processes.

Levels of parallelism
Summarising, a task is an amount of work for one compute node in a heterogeneous cluster. It has the highest level of parallelism. The next step is to define other levels of parallelism to split this work across multiple threads within the compute node and also across one or more GPUs. The levels of parallelism will depend on the problem and the chosen way to solve it. Interestingly, the levels of parallelism have been introduced already two hundred years ago by Charles Babbage:
  1. Job level (the highest)
  2. Programm level
  3. Loop level
  4. Instruction level (the lowest)

Depending on your algorithm, you could use GPU(s) at the program or the loop level.

What is next?

In this post we have defined the heterogeneous map-reduce approach and the task system, that allow to spread work dynamically across your computing system and achieve automatic load balancing.

The next step is to demonstrate it on some applications from different fields. One example we will take from seismic imaging and another one from scientific visualisation. We will cover these in our next posts.

How many levels of parallelism do you use in your application?

Get EZNumeric’s future articles in your inbox:


2 ways of using a GPU to speedup computations

Did you know that you can use GPU in different ways to speedup your computations? Let’s have a closer look.

High-performance computer architectures are developing quickly by having more and faster cores in the CPUs (Central Processing Units) or GPUs (Graphics Processing Units). Recently, a new generation of GPUs appeared, offering tera-FLOPs performance on a single card.

CPU versus GPU

The GPU and CPU architectures have their own advantages and disadvantages.

CPUs are optimized for sequential performance and good at instruction level parallelism, pipelining, etc. With a powerful hierarchical cache, and scheduling mechanism, the sequential performance is very good.

In contrast, GPUs are designed for high instruction throughput with a much weaker cache or scheduling ability. In GPU programming, users have to spend more time to ensure good scheduling, load balancing and memory access, which can be done automatically on a CPU. As a result, GPU kernels are always simple and computationally intensive.

The GPU was originally designed to accelerate the manipulation of images in a frame buffer that was mapped to an output display. GPUs were used as a part of a so-called graphics pipeline, meaning that the graphics data was sent through a sequence of stages that were implemented as a combination of CPU software and GPU hardware. Nowadays GPUs are more and more used as GPGPU (General Purpose GPU) to speedup computations.

2 ways of using a GPU

A GPU can be used in two different ways:

  • as an independent compute node replacing the CPU or
  • as an accelerator.

In the first case, the algorithm is split to solve a number of independent sub-problems that are then transferred to the GPU and computed separately (with little or no communication). To achieve the best performance, the data is kept on the GPU when possible. As GPUs have generally much less memory available than CPUs, this impacts the size of the problem significantly.

In the second case, the GPU is considered as an accelerator, which means that the problem is solved on the CPU while off-loading some computational intensive parts of the algorithm to the GPU. Here, the data is transferred to and from the GPU for each new task.

Let’s take the wave equation as an example. The wave equation can be formulated in time or frequency domain. The wave equation in the time domain is usually solved using a time-stepping scheme, which does not require a solution of the linear system of equations. The wave equation in the frequency domain (Helmholtz equation), in opposite, is solved with an iterative method that contains the solver of the linear system of equations in matrix form.

The simplicity of the time-stepping algorithms makes it easy to use GPUs of modest size as accelerator to speedup the computations.

However, it is not trivial to use GPUs as accelerators for iterative methods that require solution of a linear system of equations. The main reason for this is that the most iterative methods consist of matrix-vector and vector-vector operations (e.g. matrix-vector multiplication). By using the GPU as an accelerator, the matrices need to be distributed across GPUs. The vectors would “live” on the CPU and are transferred when needed to the relevant GPU to execute matrix-vector multiplications.

Accelerator or replacement?

Ideally, GPUs would be used as a replacement but the limited memory makes this difficult for large numerical problems. There seem to be a trend where CPUs and GPUs are merging so that the same memory can be accessed equally fast from the GPU or the CPU. In that case the question “accelerator or replacement?” would become irrelevant as one can alternate between both hardware without taking into account the data location.

How do you use GPU: as replacement or as accelerator? Let us know in the comment box.

Related posts

Wave propagation simulation in 2D
Parallel wave modelling on a 8-GPU monster machine

Please leave your email address to register and receive our newsletter:


Parallel wave modelling on a 8-GPU monster machine

A few years ago we had the rare opportunity to program on a 8-GPU monster machine provided by NVIDIA. That was an experience! The goal was to parallelise a 3D wave modelling tool in the frequency domain (Helmholtz equation) on a machine with multiple GPUs. The motivation was to run a problem of a realistic size that are usually too large to fit in the memory of one GPU. Actually, we are facing the same issue as 20 years ago with CPUs.

The monster machine
Nvidia System with 8 GPUs

For our numerical experiments NVIDIA provided a Westmere based 12-cores machine connected to 8 GPUs Tesla 2050 as shown on the figure above. The 12-core machine has 48 GB of RAM. Each socket has 6 CPU cores Intel(R) Xeon(R) CPU X5670 @ 2.93GHz and is connected trough 2 PCI- buses to 4 graphics cards. Note that two GPUs are sharing one PCI-bus connected to a socket. Each GPU consist of 448 cores with a clock rate of 1.5 GHz and has 3 GB of memory.

Multi-GPU approach

There are several approaches to deal with multi-GPU:

  1. The data-parallel approach, where all matrix-vector and vector-vector operations are split between multiple GPUs. The advantage of this approach is that it is relatively easy to implement. However, matrix- vector multiplication requires exchange of the data between different GPUs, that can lead to significant data transfer times if the computational part is small.
  2. Split of the algorithm, where different parts of algorithm are executed on different devices. For instance, the solver is executed on one GPU and the preconditioner on another one. In this way the communication between GPUs will be minimized. However this approach requires individual solution for each algorithm.
  3. Domain-Decomposition approach, where the original continuous or discrete problem is decomposed into parts which are executed on different GPUs and the overlapping information (halos) is exchanged by data transfer. For our wave simulation in frequency domain this approach can however have difficulties with convergence for higher frequencies (see Ernst, Gander).

We have chosen the data-parallel approach and split of the algorithms and made a comparison between multi-core and multi- GPUs. We leave out the domain decomposition approach because the convergence of the Helmholtz solver is not guaranteed.

Computations on Multi-GPU

You can do computations on multi-GPU by either:

  • pushing different context to different GPUs, or
  • creating multiple threads on CPU, where each of them will communicate with one GPU.


Multi-GPU Pushing context
Multi-GPU Multi-threading











For our purposes we have chosen the second option, since it was easier to understand and implement.


Implementation on multi-GPUs requires careful consideration of possibilities and optimization options. The issues we encountered during our work are listed below:

  •  Multi-threading implementation, where the life of a thread should be as long as the application. This is crucial for the multi-threading way of implementation on multi-GPU. Note that in case of pushing contexts this is not an issue.
  • Because of limited GPU memory size, large problems need multiple GPUs.
  • Efficient memory reusage to avoid allocation/deallocation. Due to memory limitations the memory should be reused as much as possible, especially in the multigrid method.
    In our work we create a pool of vectors on the GPU and reuse them during the whole solution time.
  • Limit communications CPU\rightarrowGPU and GPU\rightarrowCPU.
  • Back then, it was beneficial to use texture memory when possible, but it was not easy as each GPU needs its own texture reference.
  • Coalescing is difficult since each matrix row has a different number of elements.
Helmholtz solver

The Helmholtz equation represents the time-harmonic wave propagation in the frequency domain and has applications in many fields of science and technology, e.g. in aeronautics, marine technology, geophysics, and optical problems. In particular we consider the Helmholtz equation discretized by a second order finite difference scheme.

As a solver for the Helmholtz equation (wave equation in frequency domain) we use Bi-CGSTAB with a shifted Laplacian multigrid preconditioner.

Since the Bi-CGSTAB is a collection of vector additions, dot products and matrix-vector multiplications, the multi-GPU version of the Bi-CGSTAB is straight forward. We have seen that the speedup on multi-GPUs is smaller than on single GPU due to the data transfer between CPU and GPU. However it is possible to compute a problem of a realistic size on multi-GPUs and the computation on multi-GPU is still many times faster than 12-core Westmere.

The shifted Laplace preconditioner consists of a coarse grid correction based on the Galerkin method with matrix-dependent prolongation and Gauss-Seidel as a smoother. The implementation of coarse grid correction on multi-GPU is straight forward, since the main ingredient of the coarse grid correction is matrix-vector multiplication. The coarse grid matrices are constructed on CPU and then transferred to the GPUs.

The Gauss-Seidel smoother on multi-GPU requires adaptation of the algorithm. We use eight color Gauss-Seidel, since the Helmholtz equation is given in three dimensions and computations at each discretization point should be done independent on the neighbours to allow parallelism.


We implemented the three-dimensional Helmholtz wave equation on a multi-GPU. To keep the double precision convergence the solver (Bi-CGSTAB method) is implemented on GPU in double precision and the preconditioner in single precision.

Two multi-GPU approaches have been considered: data parallel approach and a split of the algorithm.

For the data parallel approach, we were able to solve larger problems than on one GPU
and get a better performance than multi-threaded CPU implementation. However due to
the communication between GPUs and a CPU the resulting speedups have been considerably smaller
compared to the single-GPU implementation.

To minimize the communication but still be able to solve large problems we have introduced split of the  algorithm. In this case the speedup on multi-GPUs is similar to the single GPU compared to the multi-core implementation.

See full comparison of the multi-GPU implementation to a single-GPU and a multi-threaded CPU implementation on a realistic problem size here or request a copy by filling in the contact form or by sending us an email.

We would like to thank again NVIDIA Corporation (in particular François Courteille) for access to the their many-core-multi-GPU architecture.

This work has been presented during the following conferences:

  • GTC 2012, San Jose, US
  • ENUMATH 2011, Leicester, UK