
D2.1.4 IST-033576
B.5 Checkpointing/restarting an MPI application
The best source of information on dealing with any BLCR-aware MPI implemen-
tation is the documentation provided with the MPI. However, here are some hints
that may be helpful.
B.5.1 Checkpoint/restart with LAM/MPI
• See the the LAM/MPI documentation for the most detailed info on using
LAM/MPI with BLCR.
• When building your own LAM/MPI, do NOT configure LAM to debug
mode, i.e. do not pass –with-debug to LAM’s configure script.
• To start a checkpointable LAM/MPI application, simply run it with the reg-
ular LAM mpirun launcher:
% mp irun C h e l l o _ m p i
Note: you may need to start up the LAM environment first by running lam-
boot before starting your application.
• To checkpoint the entire MPI application (across all nodes and processes),
simply run
% l a m c h e c k p o i n t 12305
Where ’12305’ is the process ID of the mpirun command. Do not pass the
pid of your MPI executable: when mpirun is checkpointed, it automatically
takes care of transitively checkpointing all of the processes involved in the
MPI job.
• To restart your MPI job, simply run the following against the mpirun pro-
cess’s context file:
% l a m r e s t a r t c o n t e x t . 12 3 0 5
All processes in the MPI job will be restarted as they were at checkpoint
time.
• By default, LAM will place the checkpoint files for the MPI application
processes in your $HOME directory. If you want them to be placed else-
where, override the default location in the filesystem where LAM stores
checkpoints.
47/49 XtreemOS–Integrated Project
Comments to this Manuals