down

Section 12 Exercise: Magentic Relaxation -- Supercomputing


Subsections

  approx time

CVS

1 min

Understanding the code pieces

60 min

Testing locally

30 min

Accessing the DCSC-KU compute clusters

20 min

Submitting batch jobs

10 min

Quick-look at output data

55 min

Home work

2 hours

NOTE:

  • It's time course evaluations! You find it under the course on Absalon. Please fill in the evaluation form, even if (especially if ;-!) you are happy with the course!

  • The Thursday night dead-lines do not apply to Project 2, and there are no specific internal dead-lines for the various parts.


down top CVS

[about 1 minute]

To extract the exercise files for this weeks exercise, do
    cd ~/ComputerPhysics
    cvs update -d

In case of problems, see the CVS update help page.


down top Understanding the code pieces

[about 60 minutes]

The Magentic relexation formation simulation code is broken up into functional pieces, each one no longer than can be printed on a (in some case double-sided ;-) sheet of paper. It is a good idea to actually print out the most important parts of the code; specifically main.f90, grid.f90, timestep.f90, mhd.f90, and fourier_field.f90.

Please take your time, study one piece of code at a time, and answer the related questions:

main.f90

The listing of the main program illustrates both the main structure of the code and some formatting conventions. To start with the latter:
Formatting conventions:
In order to make the code both compact and well documented most comments are placed at the end of the lines, following an exclamation mark (comment sign) in column 80. There is nothing magic about the number 80 -- it is just a compromise between enough space for both the commands and the comments. Current computer screens are large enough to easily show the whole lines.

To help make it easier to find the relevant subroutines the file name that contains the subroutine is given (in parenthesis) at the end of the lines.

Use of UPPER / lower case is only to increase readability; case is not significant in Fortran. By writing PROGRAM, SUBROUTINE, FUNCTION, and END ditto in upper case it becomes easier to see the code structure

The USE clauses (also in upper case to make them easier to spot) often have "only:" clauses, listing explicitly the variables that come from each module. This is not necessary, but makes it easier to track the use of shared variables.

The print statement near the end of the listing computes and prints the number of microseconds needed to update one mesh point (the wc variables measure wall clock time in seconds and it is the number of time steps the code has taken).

If an experiment with 100x100x100 mesh points takes 1 hour to run 900 time steps, how many microseconds per point does it use? Credits: 5/-2
OK

Assuming that the code parallelizes perfectly (so N processors can do an N times larger problem in the same time as one processor uses on the smaller problem):

How many processors working in parallel are needed to run a 200x200x200 version of the same experiment in 1 hour, assuming that 1800 time steps are needed? Credits: 5/-2
OK

grid.f90

The grid.f90 file contains a module grid_m, with a list of parameters and a number of 3-D variables (arrays) that correspond to scalar and vector variables which occur in the partial differential equations. In addition there are some scratch variables, which are used and re-used in different places in the code.
Space requirements:
The code contains about 53 three-dimensional arrays, which may seem like a lot (and certainly some of them could be saved, by using and re-using the scratch arrays more systematically). However, memory is almost never a bottleneck when making large scale numerical simulations. This may seem like a strange statement, since certainly a lot of memory may be needed. The point is, however, that if the problem is large then a large number of time steps will be needed, and hence the problem size held by each processor must be small enough, so one time step does not take too long a time.

Questions (10% accuracy):

Assuming the code actually uses 53 arrays, how many megabytes (MB, defined as 10242 bytes) does it need per processor in the first case mentioned above (one processor holding 100x100x100 mesh points)? Credits: 5/-2
OK

How about the second case (the same experiment at twice the resolution)? Credits: 5/-2
OK

Consider the answers -- there is a point ;-!

timestep.f90

The timestep.f90 file contains subroutines related to the time stepping. There is no need to go into details about the particular method used (there is a reference to a journal paper in the code comments if you are interested). The particular advantage of this method is that it uses no extra space, in addition to the space needed for the (eight) variables and the (eight) time derivatives. The method uses three Runge-Kutta like sub-steps, and performs the linear combination of time derivative vales "in place" (cf. the mhd.f90; e.g. "drhodt = drhodt + ...").
Courant conditions:
The subroutine Courant is called in the first sub-step, checking that the (physical) time step is optimal; not too large, not too small. It does this by computing various rates-of-change, e.g. due to wave propagation, diffusion, and motions of steep density and energy transitions. The routine also tries to estimate the remaining time of execution, by comparing the time it has taken to get to the current point in time with the requested total experiment time. This can of course be somewhat misleading, especially if (as is the case here) the initial evolution is much faster than the subsequent evolution.

Questions: Assume that there are density gradients (e.g. moving shock fronts) where the density changes with a factor of five from point to point, and that we only allow 20% change of the density from time step to time step. Then we need about log(5)/log(1.2) time steps to move a shock front across one cell. If we would like shock fronts to have enough time to move across 200 cells (also called "zones" / "grids" / "meshes") ...

... how many time steps do we need to run (10% accuracy)? Credits: 5/-2
OK

Given the answer to the previous question, and the speeds assumed in the 1st and 2nd questions (900 steps in one hour) ...

... how many wall clock hours is it going to take? Credits: 5/-2
OK

io.f90

The io.f90 is very short and simple, and essentially only reads in parameters that determine how often snapshots of the solution are written to disk. The actual code that writes to disk depends on whether the code runs in parallel (using MPI), or not, and therefore that part of the code is placed in the Mpi/support.f90 and NoMpi/support.f90 files, respectively.
Wall clock:
Notice the wallclock() function, which returns the wall clock time since the start of run. It uses the standard Fortran intrinsic subroutine system_clock.

Question: Assuming that you are going to save 100 snapshots of the type handled by write_snapshot / read_snapshot, and assuming that the dimensions of each snapshot (with eight arrays) is 200x200x200:

How many giga-bytes (GB, defined aas 10243 bytes) of disk space will you need (10% accuracy)? Credits: 5/-2
OK

mhd.f90

The mhd.f90 file contains the most important part of the program; the code that corresponds to the partial differential equations that we are going to solve. Compare the code piece that implement for example the computation of the electric current with the analytical definition, and the code implementation of the 'induction equation' for dB/dt with the equation itself.
We are simulating the relaxation of a theoretical magnetic field configuration with little relation to anything realistic. So for this particular experiment we are not to worryed about the actual physical quantities (in CGS or SI units) of the experiment. The is written using scaled units (also called code units), where everything is of the order of unity. To translate this these to real physical numbers it is sufficient to choose three units of measurement -- for example units of time, length, and energy -- everything else can be derived from these units. This also shows that by changing one or more of these units, the result can be scaled to a totally different parameter regimes.

fourier_field.f90

The fourier_field.f90 file contains code that defines the initial magnetic field configuration. From Eq(125) in the notes, it can be seen that there is significant libierty in choosing the complex amplitudes of the vector potential. Not all combinations are equally lycky when looking for magnetic field with lots of complexity. We have therefore build in a fixed setup, where we are only able to change the wave number and the amplitude of the vector potential.

We have here limites the series to three k vectors with corresponding amplitudes. Using only one level creates a large scale magnetic field that contains 8 nulls with in the 3D domain. The nulls all have real eigenvalues and have locally a symmetric structure. Adding more wavenumbers increases the complexity of the field, by introdusing new nulls scatterd as satelites around the major nulls. Some of these contains strong current throught them, therefore showing spiraling nulls.

By changing the amplitude with the wave number one can change the positions of the null and possible also the type (real/complex) of some of the nulls. So feel free to experiment with these parameters --- more on this under the visualisation part later.

NoMpi/support.f90, Mpi/support.f90

These files contain support routines that differ between the single processor case and the case where MPI is used for parallelization. There is no need to read and understand the details of these routines, but if you would like to understand how MPI is used there are code pieces in the Mpi/support.f90 file that can be used as examples and templates. Also, since the NoMpi/support.f90 file contains simpler, non-parallel versions of the same routines, you can see there what the MPI code is supposed to do.
Parallel file input / output:
A case in point is the input/output handling, where the MPI version contains a simple interface that replaces the direct-access standard Fortran code in the non-MPI case.



down top Testing locally

[about 30 minutes]

Once you have gone through the code and answered the questions above it is time to start using the code. To compile, as usual we just do
 make -j
If you are doing this at home, or at some machine where you need to use a different compiler, check of there is a corresponding file in the directory Configure:
 ls -l Configure/*.make
If there is, do for example
 make -j COMPILER=g95
If there isn't, just copy and edit one of the existing files so it suits the compiler you have access to. There shold be no problem compiling the file, and an executable ffNoMpi.x should be produced (contact a teacher if you have a problem).

Take a moment and look at how the compiler choice is implemented in the Makefile, using an include statement that tries to read Configure/$(COMPILER).make.

Running the Magnetic Relaxsation test

Open the input.txt file and read the comments in it. This file contains namelists with values of parameters that control the code and the experiments. The file is divided into sections, with a first section that defining a very simple setup with only a single wave number in the definition of the initial magnetic field. Try running the experiment without changing the parameters:
 ./ffNoMpi.x
This will only run the setup routine and save the initial snapshot in the file snapshot.dat. Looking at the Benergy you will see a given number, namely the magnetic energy of the given setup.

The line init_model contains the definitions of the of the fourier amplitides and wave numbers. The first of these are read by the code. Now change the order, such that the one with 3 non zero amp values become the first line and rerun the code again.

Notice that the magnetic energy is higher. The energy difference between these two states could possible be the free energy that can be released in the relaxsation process. How can this in princible be check?

Lets make a quick test, the small data size should allow the program to run relative fast on even a small laptop. Move the lines from the Very small (sanity) test section up above the previous definitions of variables. Here nsnap is changed to 21, allowing the program to run until the internal time 2 -- depending on CPU speed this will take from one minute to 10 minutes.... Look at the develoment of the magnetic energy and the two following numbers that give the rms and peak value of the plasma velocity in the domain. Now try to do the same for the case with only one wave number.

What is the ratio of the magnetic energy between the two runs just after the 21st snapshot (small/large number with 1% accuracy)? Credits: 5/-2
OK

Is it suprising that even in the constant $\alpha$ case the magnetic energy decreases with time?


down top Accessing the DCSC-KU compute clusters

[about 20 minutes]

The next task is to logon to the DCSC-KU cluster, compile the code there, make a few test, and then run a large experiment, the output of which you can use for the visualization part of the project that starts on Monday. This year, it has been decided that we only get one shared user account on Steno. To get userid and passed, contact your teacher.

Then try

 ssh fend03.dcsc.ku.dk
and contact a teacher if you cannot login.
There are three hosts: fend01, fend02 and fend03 that are in principle identical. at pressent the two later ones are the only ones that can be accessed from outside steno.

Compiling the code at DCSC-KU

Once you have succeeded to login the first time, you should use the command mpi-selector (only needs to be once and has been done for this user!), to select the combination of compiler and MPI library. Here's how to do it:
 mpi-selector --list
 mpi-selector --set openmpi_intel-1.2.8
 exec bash            ; (or "exec tcsh" if you use the tcsh shell)
The first line is not strictly necessary, but shows the choices that are available, so if you know about this you may be able to use "mpi-selector" on other systems (such as your own laptop/PC if you install mpi-selector there; it is a part of the standard Linux distribution).
For the 'advanced / adventurous home user' only: Note that, if you want to try this at home you should first understand that "mpi-selector" is only the "glue" that binds a compiler to an MPI-library. So, you need (as a system administrator on your own laptop) to tell "mpi-selector" where to find the compiler and the library (and these must of course first exist / be installed).

As only one user exist, the setup is slightly different. You need to create two directories from where your data will store. The first is in the scratch directory where the data will be stored to avoid backing up these data. The second place is in the users directory where you more general data will be stored.

  cd ~/scratch;  mkdir "computer_fysik_id"
  cd ~/users;  mkdir "computer_fysik_id"
Then retrieve the ComputerPhysics files with CVS, and compile the code again.

  cd ~/users/"computer_fysik_id"
 cvs -d :pserver:$USER@astro.ku.dk:/usr/local/cvs/comp-phys login
 cvs -d :pserver:$USER@astro.ku.dk:/usr/local/cvs/comp-phys checkout ComputerPhysics
 cd ComputerPhysics/12_Relaxation
 make -j
Before you test the code again, you must make scratch directory where the large data file from the run is stored, to avoid them being backed up over night.
 cd ~/users/"computer_fysik_id"/ComputerPhysics/12_Relaxation     ; go to the experiment directory
 mkdir data                           ; make a directory to contain the data output
 cd data                              ; go into the data diretory
 ln -s ~/scratch/"computer_fysik_id"/snapshot.dat .    ; make a soft link, physically placing snapshot.dat is scratch
 ln -s ../ffNoMpi.x .                 ; make links to the executable in the directory above
 ln -s ../ffMpi.x .
 cp ../input.txt .                    ; copy the input file directory
There should be no problem, and you can try a short run again to verify the code is working:
 ./ffNoMpi.x                    ; run the code from the data dirctory

Compiling the code for MPI (Message Passing Interface)

To run the code in parallel you need to compile a parallel (MPI) version of the code:
 make MPI=Mpi -j
Take a look in the Makefile, and in the file Configire/ifort.make, to see what is going on here: When you set the MPI 'macro' in the call to Mpi, make chooses files from the Mpi/ subdirectory instead of from the default NoMpi/ subdirectory. It also compiles with the command mpif90 instead of the normal ifort command.
mpif90 is just a "front end" to the ifort compiler, which adds a few default libraries and include directories as hidden options during compilations. Thus all options that can be used with ifort can also be used with mpif90.

Running the code interactively with MPI

To run the code in parallel interactively, just do, for example
 mpirun -np 2 ./ffMpi.x
This runs the code using two processes on the frontend host (fend03 is a double dual-core host, so it has four CPUs).
mpirun is a "front end" command that starts the job in a parallel environment. When run on just one node it needs no further options, but when running on several compute nodes it need to have a list of host names -- see below.
You may notice that the code runs faster, but the small test runs very fast anyway, so to appreciate the difference
 time mpirun -np 4 ./ffMpi.x
 time mpirun -np 2 ./ffMpi.x
 time mpirun -np 1 ./ffMpi.x
 time ./ffNoMpi.x
The "micro-seconds-per-point" remains reasonably constant, but not perfectly so. The slight loss of speed per core when running on more cores is due to memory access speed limitations (if you notice a large reduction in speed it may be due to other people running on the same cores).

Now you are ready to try submitting batch jobs.


down top Submitting batch jobs

[about 10 minutes]

The directory Configure/DCSC-KU/ contains a README.txt file that explains how to submit jobs, and it also contains template job script astro2-queue.sh and astro-queue.sh. To submit a test, do (still from the 12_Relaxation directory):
 cp Configure/DCSC-KU/astro2-queue.sh ./
 llsubmit astro2-queue.sh
Wait a few seconds and then try
 llqstat -u $USER
to see the status of the job, looking particularly at the column marked "S", An "I" for your job means it hasn't started yet, "R" means it's running, and if no job can be seen it either has finished or else hasn't yet entered the queue.

Looking at the output log file

The default job runs the code on 8 cores on 1 node, so it should give essentially the same result as before, and it should finish quickly. The output log file has the same name as the job number, followed by .log.

Editing the job script, re-submitting

Edit the job script so it runs on 2 node (8 cores), and re-submit (note that you need to change both the number of nodes, and the number of processes (but not the number of processes per node). As we are all using the same user, there is a simple way to see which job is yours. Change the job_name, such that the entry after the = is your userid. Then copy the file (either with copy/paste or with ssh) to the machine you are running the browser on and upload it for verification:

OK, I have updated the astro2-queue.sh file -- here it is! Locate and upload your astro2-queue.sh file: Credits: 5/-5
OK

Setting up a large job

So far you should still be running with nsnap=5, or some similarly small number of snapshots, and with 48x48x48mesh points.

Testing the larger job

To run with output in the data directory, copy your files to there and submit:
 cp astro2-queue.sh data/
 cd data
 llsubmit astro2-queue.sh
If / when that works you are ready to start the production run:

Starting a large job

To submit a large job, which will run for several hours and produce data for the visualization part of the project: When everything seems to be OK, submit the large run
 llsubmit astro2-queue.sh
 ... wait a bit ...
 llqstat -u $USER

Cancelling a job

If, for some reason, you want to cancel a job (kill a running job, or remove a job from the queue), find the job name with llqstat -u $USER and then do
 llcancel mgnt03.nnnnnnn             ; (do not include the trailing dot in the job name)



down top Quick-look at output data

[about 55 minutes]

While the job is running you can actually already start looking at the data, to make sure it is there and is increasing in size.


down top Home work

[about 2 hours]

You should login to fend0?.dcsc.ku.dk from home, (either directly or through scharff and/or lynx (requires you have a fys account!)), and make sure your run is OK.


$Id: index.php,v 1.18 2009/07/12 09:46:53 aake Exp $