Keywords

1 IS4DVAR Algorithm

The Incremental Strong Constraint 4DVAR (IS4DVAR) Algorithm is one of Data Assimilation modules of the Regional Ocean Modelling System (ROMS) [18,19,20]. It solves a regularized Non Linear Least Square (NL-LS) problem of the type (see [2,3,4, 8, 22] for details):

$$ argmin_{\mathbf {u} \in \mathfrak {R}^N} J_{DA}(\mathbf {u}) = argmin_{\mathbf {u} \in \mathfrak {R}^N}\Vert \mathbf {F}_{DA}(\mathbf {u},\mathcal {M}^{\varDelta \times \varOmega }, \mathbf {u}^b_0, \mathbf {R}, \mathbf {B}, \mathbf {v}, \varDelta , \varOmega )\Vert , $$

where \(\mathcal {M}^{\varDelta \times \varOmega }\) the predictive model defined in the time-and-space physical domain \(\varDelta \times \varOmega \) with initial condition \(\mathbf {u}^b_0\), \(\mathbf {R}\), and \(\mathbf {B}\) the covariance matrices and \(\mathbf {v}\) the vector of the observations.

The common approach for solving NL-LS problems consists in defining a sequence of local approximations of \(J_{DA}\) where each member of the sequence is minimized by employing Newton’s method or one its variants (such as Gauss-Newton, L-BFGS, Levenberg-Marquardt). See Algorithms 1 and 2. Approximations are obtained by means of truncated Taylor’s series, while the minimum is obtained by using second-order sufficient conditions [1, 24] (see step 7 of Algorithm 1). In particular, two approaches could be employed:

  1. (a)

    by truncating Taylor’s series expansion of \(\mathbf {J}_{DA}\) at the second order such as Newton’methods (including LBFGS and Levenberg-Marquardt) following the Newton’s descend direction (see Algorithm 3);

  2. (b)

    by truncating Taylor’s series expansion of \(\mathbf {J}_{DA}\) at the first order such as Gauss-Newton’s methods (including Truncated Gauss-Newton or Approximated Gauss-Newton) following the steepest descend direction, which is computed solving the normal equations arising from the local Linear Least Squares (LLS) problem (see Algorithm 4).

figure a
figure b
figure c
figure d

In ROMS-IS4DVAR the NL-LS problem is solved by using Gauss-Newton’s method, where solution of normal equations system is obtained by applying a Krylov subspace iterative method (this task is also referred to as the inner-loop while the steps along the descent direction are called the outer-loop) (see Algorithm 6). IS4DVAR is described in Algorithms 5 and 6 [13]. Finally, in Fig. 1 we report the flowchart of IS4DVAR algorithm as it is implemented in ROMS.

Figure 1 illustrates the IS4DVAR Algorithm as it is implemented in ROMS and in Fig. 2 we describe the software architecture of ROMS. For details see description in [18].

figure e
figure f

2 Performance Assessment of Parallel IS4DVAR Algorithm

As IS4DVAR is part of the ROMS, the parallelization strategy implemented for the IS4DVAR algorithm takes advantage of the parallelization strategy implemented in ROMS. In other words, each part of the IS4DVAR which depends on the forecasting model (in particular, NLROMS, TLROMS and ADROMS modules) implement the two dimensional DD approach (2D-DD) approach (i.e. a coarse-grain parallelism), while Preconditioner and Lanczos Algorithm modules implement the one dimensional DD (1D-DD) approach (i.e. a fine-grain parallelism). I/O is all happening on the master process unless you specifically ask it to use MPI-I/O. Concerning the 1D-DD approach, the parallelism in ROMS is introduced (into the step (vi) of Fig. 1) by distributing the data among a 1D processor grid blocked by rows (see the Parallel version of the ARPACK library [15] for details). We observe that this is the most suitable way to reduce communication overheads in the execution of linear algebra operations required by concurrently performing Lanczos algorithms.

Fig. 1.
figure 1

A flow chart illustrating ROMS-IS4DVar algorithm where NLROMS, TLROMS and ADROMS implement the ROMS non linear model, the tangent linear (First Order Taylor Approximation of ROMS) and the Adjoint (for computing the Adjoint operator of ROMS) [18]. Parameters k and m (where \(k \ll m\)) are the steps for the linearization (First Order Taylor approximation) and for the minimization algorithms (by using Lanczos algorithm) respectively.

Fig. 2.
figure 2

ROMS software architecture. The version 3.6 of ROMS is been installed. This is the last version available. The ROMS source code is only distributed using Subversion (SVN). Its parallel framework includes both shared-memory (OpenMP) and distributed-memory (MPI) paradigms. The Middleware level includes a copy of ARPACK in the Lib directory which is used for the adjoint-based algorithms. The Lib directory also contains a copy of the Model Coupling Toolkit (MCT) which you will need if you wish to couple ROMS to other models. ROMS-IS4DVAR is written in F90/F95 with dynamic allocation of memory which allows multiple levels of nesting and/or composed grids. Finally, ROMS-IS4DVAR uses extensive C-preprocessing (CPP) to configure its various numerical and physical options. ROMS supports serial, OpenMP, and MPI computations, with the user choosing between them at compile time. Here we focus on the compiling for MPI. Also, details about parallelization strategy and the variables involved are available on www.myroms.org/wiki/Parallelization.

Let us briefly model the coarse-and-fine parallelization strategy implemented in IS4DVAR Algorithm.

Definition 1

(1D and 2D Domain Decomposition Strategy). Let the domain \(\varOmega \) be decomposed in Ntile subdomains (also named tiles) with overlap areas, where

$$ Ntile=NtileI \times NtileJ . $$

If

$$ size(\varOmega )= N= N_1 \times N_2 \times N_3, $$

then in 2D-DD

$$ size_{2D-DD}(tile)=\frac{N_1}{NtileI}\times \frac{N_2}{NtileJ}\times N_3 ;$$

while in 1D-DD, it is

$$ size_{1D-DD}(tile)=\frac{N_1}{NtileI}\times N_2\times N_3. $$

\(\spadesuit \)

The surface S(NNtile) of each 2D-DD tile is

$$\begin{aligned} S(N,Ntile)= 2\left( \frac{N_1}{NtileI}\times \frac{N_2}{NtileJ}\right) + 2 \left( 2\frac{N_1}{NtileI}+2\frac{N_2}{NtileJ} \right) \times N_3 \end{aligned}$$
(1)

and the volume is

$$\begin{aligned} V(N,Ntile)= \frac{N_1}{NtileI}\times \frac{N_2}{NtileJ} \times N_3 . \end{aligned}$$
(2)

If the 2D-DD is uniform, i.e. if \( N_1=N_2=N_3=M,\) and \(NtileI=NtileJ=p \) then, from (1) and (2) it is

$$\begin{aligned} S(M,p) = O\left( 2\frac{M^2}{p^2}+ 2\frac{M^2}{p} \right) , \quad V(M,p)=O \left( \frac{M^3}{p^2} \right) . \end{aligned}$$
(3)

As communication is much slower than computation, we will continue to get slower relative to computation over time, so we address performance of IS4DVAR Algorithm computing an estimate of the communication overhead, let us say \(Oh_{com}\). In particular, we investigate the behavior of the communication overhead, let us denote \(Oh_{com}\), in terms of the surface-to-volume ratio, for the 2D-DD approach.

Definition 2

(Surface-to-volume). The surface-to-volume ratio is a measure of the amount of data exchange (proportional to surface area of domain) per unit operation (proportional to volume of domain).

Definition 3

(Communication Overhead). Let \(T_{com}\) denote the total communication time and \(T_{flop}\) the total computation time, then

$$Oh_{com}:=\frac{T_{com}}{T_{flop}}.$$

Proposition 1

Let \(t_{com}\) be the sustained communication time for sending/receiving one data in IS4DVAR and \(t_{flop}\) the sustained execution time of one floating point operation in IS4DVAR, such thatFootnote 1.

$$ t_{com}=\alpha t_{flop},\quad \alpha =10^q,\quad q > 1 .$$

For the IS4DVAR Algorithm it holds that

$$\begin{aligned} Oh_{com}<1 \Leftrightarrow 0< k< r-q \quad , q \in ]0,r[. \end{aligned}$$
(4)

Proof:

For each m, from (3) it follows

$$Oh_{com}=\frac{S}{V}\frac{T_{com}}{T_{flop}}= \frac{2 +2p \cdot t_{com}}{M\cdot t_{flop}}.$$

We write \(N=10^{r}\) and \(p=10^k\), then we have

$$Oh_{com}= O \left( \frac{10^q(2+2p)}{10^r}\right) = O\left( 10^{q-r}(2+2\cdot 10^{k}) \right) = O(10^{q-r+k})$$

i.e. the (4).

Expression in (4) states that in order to increase the upper bound on \(k=\log (p)\), the problems size should increases, and/or the ratio of the sustained unitary communication time over the sustained computation time (i.e. parameter \(a= 10^q\)) should decreases. Since the experiments which we consider here use realistic configurations of medium-size, performance results will confirm that the efficiency degrades below 50% for \(p>16\).

3 Experiments

We describe the configurations we have chosen for testing and analysing the performance of IS4DVAR on the California Current System, the Caspian sea and the Angola Basin. All the experiments are carried out on the CX2 (Helen) computing system provided by Imperial College LondonFootnote 2. For each experiment, we report strong scaling results, in terms of execution time, speed up and efficiency. The variable proc on the tables refer to the number of processors involved, \(T_{p}\) refers to the execution time, \(Ntile=p\), \(S_p=\frac{T_1}{T_p}, \quad E_p=\frac{S_p}{p}.\) Finally, we use the mapping \( proc \leftrightarrow MPI\ process .\) The test cases we have chosen refers to:

  • TC1: the California Current System (CCS) with 30 km (horizontal) resolution and 30 levels in the vertical direction. The global grid is then:

    $$N=54\times 53\times 30=8.586 \times 10^4 .$$
  • TC2: the Caspian Sea with 8 km resolution and 32 vertical layers. The vertical resolution is set with a minimum depth of 5 m. Then, problem dimension in terms of the grid/mesh size consists of

    $$ N= 90 \times 154 \times 32=4.43520 \times 10^5$$

    grid points. A set of sensitivity experiments (not shown) suggests that \(k=1\) and \(m=50\). In each of these experiments, only one assimilation cycle (4 days) is conducted.

  • TC3: the Angola Basin with 10 km of resolution and 40 terrain-following vertical levels. The vertical levels are stretched as so to increase resolution near the surface. The model domain with highlighted bathymetry is shown in Fig. 1. The experiments consist of a 4 day window using IS4DVar assimilating satellite Sea Surface Temperature (SST), in situ T&S profiles and Sea Surface Height (SSH) observations from the 1st to the 5th January 2013 (Table 1).

Table 1. Strong scaling results for TC1, TC2 and TC3 on CX2. As \(k=\log (p)=1.2\) and \(r=4\), experimental results confirm the upper bound in (4).

In all the experiments we use realistic configurations of medium-size, so performance results show that the efficiency degrades below 50% for \(p>16\). As \(k=\log (p)=1.2\) and \(r=4\) or \(r=5\) at most, experimental results confirm the upper bound in (4), assuming that for the ROMS implementation of IS4DVAR Algorithm the ratio of the sustained unitary communication time over the sustained computation time is \(a= 10^q\) where \(q \simeq 2.8\) or \(q \simeq 3.8\).

4 Conclusion and Future Work

The analysis showed that the surface-to-volume of the current parallelization strategy of IS4DVAR Algorithm strongly limits the performance of the ROMS software as it does not fulfill the features of the emerging architectures, where the unitary sustained communication time should be comparable to the computation time. In line with these issues, and relying on previous activities of the authors [23], the approach we are going to adopt in the NASDAC research activity meets the following demand: parallelization of IS4DVAR Algorithm has be considered from the beginning, which means on the numerical model [6, 9, 10].

In the next steps in future direction, we will focused on infrastructure improvement with particular regard to data movement [5, 7, 11, 14, 16, 17, 21] in order to implement a reliable mechanism able to move acquired data for processing, publishing and usage with techniques devoted to improve the scalability on HPC systems [12].