Skip to main content
The NCI Community Hub will be retiring in May 2024. For more information please visit the NCIHub Retirement Page:https://ncihub.cancer.gov/groups/ncihubshutdown/overview
close

Getting CANDLE running on Biowulf

Version 40
by (unknown)
Version 41
by (unknown)

Deletions or items before changed

Additions or items after changed

1 === Whenever Singularity is used (as it is here), bind pertinent directories ===
2
3 {{{
4 export SINGULARITY_BINDPATH="/gs3,/gs4,/gs5,/gs6,/gs7,/gs8,/gs9,/gs10/,/gs11,/gpfs,/spin1,/data,/scratch,/fdb,/lscratch"
5 }}}
6
7 This is something you might want to put in your ~/.bashrc or ~/.bash_profile so it's automatically loaded upon login.
8
9 === Run a CANDLE benchmark ===
10
11 This is the most straightforward way to make sure everything is working; you don't have to run it to completion.
12
13 ==== (1) Set variables ====
14
15 {{{
16 working_dir=
17 gpu_type=
18 }}}
19
20 ==== (2) Clone CANDLE benchmarks from Github ====
21
22 {{{
23 mkdir ~/candle
24 cd ~/candle
25 git clone https://github.com/ECP-CANDLE/Benchmarks.git
26 }}}
27
28 ==== (3) Run benchmark ====
29
30 {{{
31 cd $working_dir
32 echo '#!/bin/bash' > ./jobrequest.sh
33 echo "module load singularity" >> ./jobrequest.sh
34 echo "singularity exec --nv /data/classes/candle/candle-gpu.img python /data/`whoami`/candle/Benchmarks/Pilot1/P1B1/p1b1_baseline_keras2.py" >> ./jobrequest.sh
35 sbatch --partition=gpu --mem=50G --gres=gpu:$gpu_type:1 ./jobrequest.sh
36 }}}
37
38 You should see your job queued or running in SLURM (e.g., {{{squeue -u $(whoami)}}}) and output being produced in $working_dir.
39
40 You can also SSH into the node on which the job is running (which is listed under "NODELIST (REASON)" of the {{{squeue}}} command) and even make sure the node's GPU is being used by running the {{{nvidia-smi}}} command.
41
42 Now that you know everything is working you can kill the job using {{{scancel }}}, where {{{}}} is listed under JOBID of the {{{squeue}}} command. Or if you're interested, you can let the job run; it should take about 30 min.
43
44 === Run a grid search (a type of hyperparameter optimization) using output from a test model ===
45
46 In our case the test model just returns random numbers, but this allows you to test the complete workflow you'll ultimately need for running your own model.
47
48 ==== (1) Set variables ====
49
50 {{{
51 working_dir=
52 expt_name=
53 ntasks= # should be greater than 2
54 job_time=
55 memory=
56 gpu_type=
57 }}}
58
59 ==== (2) Copy grid search template to working directory ====
60
61 {{{
62 cp -rp /data/classes/candle/grid-search-template/* $working_dir
63 }}}
64
65 ==== (3) Edit one file ====
66
67 In $working_dir/swift/swift-job.sh change {{{./turbine-workflow.sh}}} to {{{swift/turbine-workflow.sh}}}.
68
69 ==== (4) "Compile" and run the grid search ====
70
71 {{{
72 cd $working_dir
73 echo '#!/bin/bash' > ./compile_job.sh
74 -
echo "module load singularityā€¯ >> ./compile_job.sh
+
echo "module load singularity" >> ./compile_job.sh
75 echo "singularity exec /data/classes/candle/candle-gpu.img swift/stc-workflow.sh $expt_name" >> ./compile_job.sh
76 sbatch -W --time=1 ./compile_job.sh
77 experiment_id=${expt_name:-experiment}
78 sbatch --output=experiments/$experiment_id/output.txt --error=experiments/$experiment_id/error.txt --partition=gpu --gres=gpu:$gpu_type:1 --cpus-per-task=2 --ntasks=$ntasks --mem=$memory --job-name=$experiment_id --time=$job_time --ntasks-per-node=1 swift/swift-job.sh $experiment_id
79 }}}
80
81 === Run a grid search using your own model ===
82
83 We already transferred the CANDLE scripts to a local directory (in the above example, to working_dir=~/grid_search). With this directory structure in place, we will now adapt some of these scripts to your own data and model.
84
85 ==== (1) Set variables ====
86
87 {{{
88 expt_name=
89 ntasks= # should be greater than 2
90 job_time=
91 memory=
92 gpu_type=
93 }}}
94
95 ==== (2) Copy over new grid search template scripts ====
96
97 '''Warning:''' This will overwrite the two scripts in $working_dir/scripts.
98
99 {{{
100 cp -f /data/BIDS-HPC/public/grid_search_template/* $working_dir/scripts
101 }}}
102
103 ==== (3) Edit files ====
104
105 * '''$working_dir/scripts/run_model.sh:''' this is a helper shell script that accepts the hyperparameters as command line arguments and calls the model via a Python script, train_model.py, below. Typically you only need to edit the setting of $ml_model_path.
106 * '''$working_dir/scripts/train_model.py:''' this is the main, customizable Python script that calls the machine learning model with a particular set of hyperparameters. Its inputs should be a string defining a dictionary of hyperparameters (which is automatically generated in run_model.sh) and the text file containing the result of the model on the data with the current set of hyperparameters.
107 * '''$working_dir/data/dice-params.txt:''' text file containing a hyperparameter combination on every line. Can be generated with a script e.g. $working_dir/data/data-generator.py.
108
109 ==== (4) "Compile" and run the grid search ====
110
111 These are the same steps as for the test model.
112
113 {{{
114 cd $working_dir
115 echo '#!/bin/bash' > ./compile_job.sh
116 -
echo "module load singularityā€¯ >> ./compile_job.sh
+
echo "module load singularity" >> ./compile_job.sh
117 echo "singularity exec /data/classes/candle/candle-gpu.img swift/stc-workflow.sh $expt_name" >> ./compile_job.sh
118 sbatch -W --time=1 ./compile_job.sh
119 experiment_id=${expt_name:-experiment}
120 sbatch --output=experiments/$experiment_id/output.txt --error=experiments/$experiment_id/error.txt --partition=gpu --gres=gpu:$gpu_type:1 --cpus-per-task=2 --ntasks=$ntasks --mem=$memory --job-name=$experiment_id --time=$job_time --ntasks-per-node=1 swift/swift-job.sh $experiment_id
121 }}}