CDA SC510 User's Guide Page 49

  • Download
  • Add to my manuals
  • Print
  • Page
    / 51
  • Table of contents
  • TROUBLESHOOTING
  • BOOKMARKS
  • Rated. / 5. Based on customer reviews
Page view 48
D2.1.4 IST-033576
B.5 Checkpointing/restarting an MPI application
The best source of information on dealing with any BLCR-aware MPI implemen-
tation is the documentation provided with the MPI. However, here are some hints
that may be helpful.
B.5.1 Checkpoint/restart with LAM/MPI
See the the LAM/MPI documentation for the most detailed info on using
LAM/MPI with BLCR.
When building your own LAM/MPI, do NOT configure LAM to debug
mode, i.e. do not pass –with-debug to LAM’s configure script.
To start a checkpointable LAM/MPI application, simply run it with the reg-
ular LAM mpirun launcher:
% mp irun C h e l l o _ m p i
Note: you may need to start up the LAM environment first by running lam-
boot before starting your application.
To checkpoint the entire MPI application (across all nodes and processes),
simply run
% l a m c h e c k p o i n t 12305
Where ’12305’ is the process ID of the mpirun command. Do not pass the
pid of your MPI executable: when mpirun is checkpointed, it automatically
takes care of transitively checkpointing all of the processes involved in the
MPI job.
To restart your MPI job, simply run the following against the mpirun pro-
cess’s context file:
% l a m r e s t a r t c o n t e x t . 12 3 0 5
All processes in the MPI job will be restarted as they were at checkpoint
time.
By default, LAM will place the checkpoint files for the MPI application
processes in your $HOME directory. If you want them to be placed else-
where, override the default location in the filesystem where LAM stores
checkpoints.
47/49 XtreemOS–Integrated Project
Page view 48
1 2 ... 44 45 46 47 48 49 50 51

Comments to this Manuals

No comments