Dear All,
I was running a long explicit analysis using cluster. My simulation may require 400 hrs to complete the whole analysis but I can set maximum walltime of 150 hrs in cluster. After waiting for 150 hrs I found that, analsyis is getting stopped due to unavailable cluster time. I would like to continue the explicit analsyis from 150 hrs, where it had acually stopped due to unavailable cluster time. How can I continue
simulation? I tried to run the same explicit analysis from command promt using suspend and resume command and it works fine. I also tried the same but using different node e.g suspend the job from one node and resume the the job from another node but it doesn't work. I found in abaqus documentation that recover command is generally used for restarting the same job but somehow this doesn't work for me. Here i have attached my script file for running an abaqus job
start.qsub script file
#!/bin/sh -login
#PBS -l nodes=1:ppn=1,walltime=00:10:00
#PBS -j oe
#PBS -W x=gres:explicit:5%abaqus:5
cd $PBS_O_WORKDIR
inputfile="Job-Dynamic-Model"
# Automatically calculate the number of processors
np=$(cat $PBS_NODEFILE | wc -l)
module unload mvapich
module load abaqus_parallel
#Make a temporary scratch space (this should be on /mnt/scratch)
scratch=/mnt/scratch/${USER}/${PBS_JOBID}
export TMPDIR=$scratch
mkdir -p $scratch
# Change to the working directory
cd ${PBS_O_WORKDIR}
# Run abaqus
abaqus job=$inputfile recover cpus=$np interactive &
PID=$!
sleep 600
# Remove scratch space
rm -rf $scratch
restart.qsub script file
#!/bin/sh -login
#PBS -l nodes=1:ppn=1,walltime=00:10:00
#PBS -j oe
#PBS -W x=gres:explicit:5%abaqus:5
cd $PBS_O_WORKDIR
inputfile="Job-Dynamic-Model"
# Automatically calculate the number of processors
np=$(cat $PBS_NODEFILE | wc -l)
module unload mvapich
module load abaqus_parallel
#Make a temporary scratch space (this should be on /mnt/scratch)
scratch=/mnt/scratch/${USER}/${PBS_JOBID}
export TMPDIR=$scratch
mkdir -p $scratch
# Change to the working directory
cd ${PBS_O_WORKDIR}
# Run abaqus
# abaqus restartjoin originalodb=odb-file-name
# restartodb=odb-file-name
# [copyoriginal] [history] [compressresult]
abaqus job=${inputfile} recover
echo "sleeping"
sleep 600
echo "done sleeping"
#abaqus terminate job=$inputfile
#qsub restart.qsub
# Remove scratch space
rm -rf $scratch
Any suggestion is greatly appreciated. Thank you in advance.
Thanks
Dr. Parimal Maity
Michigan State University
ME