Getting CANDLE running on Biowulf
- Version 40
- by (unknown)
- Version 41
- by (unknown)
Deletions or items before changed
Additions or items after changed
1 | === Whenever Singularity is used (as it is here), bind pertinent directories === | |||
---|---|---|---|---|
2 | ||||
3 | {{{ | |||
4 | export SINGULARITY_BINDPATH="/gs3,/gs4,/gs5,/gs6,/gs7,/gs8,/gs9,/gs10/,/gs11,/gpfs,/spin1,/data,/scratch,/fdb,/lscratch" | |||
5 | }}} | |||
6 | ||||
7 | This is something you might want to put in your ~/.bashrc or ~/.bash_profile so it's automatically loaded upon login. | |||
8 | ||||
9 | === Run a CANDLE benchmark === | |||
10 | ||||
11 | This is the most straightforward way to make sure everything is working; you don't have to run it to completion. | |||
12 | ||||
13 | ==== (1) Set variables ==== | |||
14 | ||||
15 | {{{ | |||
16 | working_dir= |
|||
17 | gpu_type= |
|||
18 | }}} | |||
19 | ||||
20 | ==== (2) Clone CANDLE benchmarks from Github ==== | |||
21 | ||||
22 | {{{ | |||
23 | mkdir ~/candle | |||
24 | cd ~/candle | |||
25 | git clone https://github.com/ECP-CANDLE/Benchmarks.git | |||
26 | }}} | |||
27 | ||||
28 | ==== (3) Run benchmark ==== | |||
29 | ||||
30 | {{{ | |||
31 | cd $working_dir | |||
32 | echo '#!/bin/bash' > ./jobrequest.sh | |||
33 | echo "module load singularity" >> ./jobrequest.sh | |||
34 | echo "singularity exec --nv /data/classes/candle/candle-gpu.img python /data/`whoami`/candle/Benchmarks/Pilot1/P1B1/p1b1_baseline_keras2.py" >> ./jobrequest.sh | |||
35 | sbatch --partition=gpu --mem=50G --gres=gpu:$gpu_type:1 ./jobrequest.sh | |||
36 | }}} | |||
37 | ||||
38 | You should see your job queued or running in SLURM (e.g., {{{squeue -u $(whoami)}}}) and output being produced in $working_dir. | |||
39 | ||||
40 | You can also SSH into the node on which the job is running (which is listed under "NODELIST (REASON)" of the {{{squeue}}} command) and even make sure the node's GPU is being used by running the {{{nvidia-smi}}} command. | |||
41 | ||||
42 | Now that you know everything is working you can kill the job using {{{scancel |
|||
43 | ||||
44 | === Run a grid search (a type of hyperparameter optimization) using output from a test model === | |||
45 | ||||
46 | In our case the test model just returns random numbers, but this allows you to test the complete workflow you'll ultimately need for running your own model. | |||
47 | ||||
48 | ==== (1) Set variables ==== | |||
49 | ||||
50 | {{{ | |||
51 | working_dir= |
|||
52 | expt_name= |
|||
53 | ntasks= |
|||
54 | job_time= |
|||
55 | memory= |
|||
56 | gpu_type= |
|||
57 | }}} | |||
58 | ||||
59 | ==== (2) Copy grid search template to working directory ==== | |||
60 | ||||
61 | {{{ | |||
62 | cp -rp /data/classes/candle/grid-search-template/* $working_dir | |||
63 | }}} | |||
64 | ||||
65 | ==== (3) Edit one file ==== | |||
66 | ||||
67 | In $working_dir/swift/swift-job.sh change {{{./turbine-workflow.sh}}} to {{{swift/turbine-workflow.sh}}}. | |||
68 | ||||
69 | ==== (4) "Compile" and run the grid search ==== | |||
70 | ||||
71 | {{{ | |||
72 | cd $working_dir | |||
73 | echo '#!/bin/bash' > ./compile_job.sh | |||
74 | - | echo "module load |
+ | echo "module load singularity" >> ./compile_job.sh
|
75 | echo "singularity exec /data/classes/candle/candle-gpu.img swift/stc-workflow.sh $expt_name" >> ./compile_job.sh | |||
76 | sbatch -W --time=1 ./compile_job.sh | |||
77 | experiment_id=${expt_name:-experiment} | |||
78 | sbatch --output=experiments/$experiment_id/output.txt --error=experiments/$experiment_id/error.txt --partition=gpu --gres=gpu:$gpu_type:1 --cpus-per-task=2 --ntasks=$ntasks --mem=$memory --job-name=$experiment_id --time=$job_time --ntasks-per-node=1 swift/swift-job.sh $experiment_id | |||
79 | }}} | |||
80 | ||||
81 | === Run a grid search using your own model === | |||
82 | ||||
83 | We already transferred the CANDLE scripts to a local directory (in the above example, to working_dir=~/grid_search). With this directory structure in place, we will now adapt some of these scripts to your own data and model. | |||
84 | ||||
85 | ==== (1) Set variables ==== | |||
86 | ||||
87 | {{{ | |||
88 | expt_name= |
|||
89 | ntasks= |
|||
90 | job_time= |
|||
91 | memory= |
|||
92 | gpu_type= |
|||
93 | }}} | |||
94 | ||||
95 | ==== (2) Copy over new grid search template scripts ==== | |||
96 | ||||
97 | '''Warning:''' This will overwrite the two scripts in $working_dir/scripts. | |||
98 | ||||
99 | {{{ | |||
100 | cp -f /data/BIDS-HPC/public/grid_search_template/* $working_dir/scripts | |||
101 | }}} | |||
102 | ||||
103 | ==== (3) Edit files ==== | |||
104 | ||||
105 | * '''$working_dir/scripts/run_model.sh:''' this is a helper shell script that accepts the hyperparameters as command line arguments and calls the model via a Python script, train_model.py, below. Typically you only need to edit the setting of $ml_model_path. | |||
106 | * '''$working_dir/scripts/train_model.py:''' this is the main, customizable Python script that calls the machine learning model with a particular set of hyperparameters. Its inputs should be a string defining a dictionary of hyperparameters (which is automatically generated in run_model.sh) and the text file containing the result of the model on the data with the current set of hyperparameters. | |||
107 | * '''$working_dir/data/dice-params.txt:''' text file containing a hyperparameter combination on every line. Can be generated with a script e.g. $working_dir/data/data-generator.py. | |||
108 | ||||
109 | ==== (4) "Compile" and run the grid search ==== | |||
110 | ||||
111 | These are the same steps as for the test model. | |||
112 | ||||
113 | {{{ | |||
114 | cd $working_dir | |||
115 | echo '#!/bin/bash' > ./compile_job.sh | |||
116 | - | echo "module load |
+ | echo "module load singularity" >> ./compile_job.sh
|
117 | echo "singularity exec /data/classes/candle/candle-gpu.img swift/stc-workflow.sh $expt_name" >> ./compile_job.sh | |||
118 | sbatch -W --time=1 ./compile_job.sh | |||
119 | experiment_id=${expt_name:-experiment} | |||
120 | sbatch --output=experiments/$experiment_id/output.txt --error=experiments/$experiment_id/error.txt --partition=gpu --gres=gpu:$gpu_type:1 --cpus-per-task=2 --ntasks=$ntasks --mem=$memory --job-name=$experiment_id --time=$job_time --ntasks-per-node=1 swift/swift-job.sh $experiment_id | |||
121 | }}} |