NCI Hub - Wiki: Getting CANDLE running on Biowulf: Compare

Discoverability Visible
Join Policy Open/Anyone
Created 01 Jun 2018

NIH.AI

►

Wiki

Getting CANDLE running on Biowulf

Index
Search

Search pages Search

Version 40
: 2018-11-20 00:43:14 by (unknown)
Version 41
: 2018-11-20 01:04:52 by (unknown)

~~Deletions~~ or items before changed

Additions or items after changed

1	=== Whenever Singularity is used (as it is here), bind pertinent directories ===
2
3	{{{
4	export SINGULARITY_BINDPATH="/gs3,/gs4,/gs5,/gs6,/gs7,/gs8,/gs9,/gs10/,/gs11,/gpfs,/spin1,/data,/scratch,/fdb,/lscratch"
5	}}}
6
7	This is something you might want to put in your ~/.bashrc or ~/.bash_profile so it's automatically loaded upon login.
8
9	=== Run a CANDLE benchmark ===
10
11	This is the most straightforward way to make sure everything is working; you don't have to run it to completion.
12
13	==== (1) Set variables ====
14
15	{{{
16	working_dir=
17	gpu_type=
18	}}}
19
20	==== (2) Clone CANDLE benchmarks from Github ====
21
22	{{{
23	mkdir ~/candle
24	cd ~/candle
25	git clone https://github.com/ECP-CANDLE/Benchmarks.git
26	}}}
27
28	==== (3) Run benchmark ====
29
30	{{{
31	cd $working_dir
32	echo '#!/bin/bash' > ./jobrequest.sh
33	echo "module load singularity" >> ./jobrequest.sh
34	echo "singularity exec --nv /data/classes/candle/candle-gpu.img python /data/`whoami`/candle/Benchmarks/Pilot1/P1B1/p1b1_baseline_keras2.py" >> ./jobrequest.sh
35	sbatch --partition=gpu --mem=50G --gres=gpu:$gpu_type:1 ./jobrequest.sh
36	}}}
37
38	You should see your job queued or running in SLURM (e.g., {{{squeue -u $(whoami)}}}) and output being produced in $working_dir.
39
40	You can also SSH into the node on which the job is running (which is listed under "NODELIST (REASON)" of the {{{squeue}}} command) and even make sure the node's GPU is being used by running the {{{nvidia-smi}}} command.
41
42	Now that you know everything is working you can kill the job using {{{scancel }}}, where {{{}}} is listed under JOBID of the {{{squeue}}} command. Or if you're interested, you can let the job run; it should take about 30 min.
43
44	=== Run a grid search (a type of hyperparameter optimization) using output from a test model ===
45
46	In our case the test model just returns random numbers, but this allows you to test the complete workflow you'll ultimately need for running your own model.
47
48	==== (1) Set variables ====
49
50	{{{
51	working_dir=
52	expt_name=
53	ntasks= # should be greater than 2
54	job_time=
55	memory=
56	gpu_type=
57	}}}
58
59	==== (2) Copy grid search template to working directory ====
60
61	{{{
62	cp -rp /data/classes/candle/grid-search-template/* $working_dir
63	}}}
64
65	==== (3) Edit one file ====
66
67	In $working_dir/swift/swift-job.sh change {{{./turbine-workflow.sh}}} to {{{swift/turbine-workflow.sh}}}.
68
69	==== (4) "Compile" and run the grid search ====
70
71	{{{
72	cd $working_dir
73	echo '#!/bin/bash' > ./compile_job.sh
74	-	echo "module load ~~singularity”~~ >> ./compile_job.sh	+	echo "module load singularity" >> ./compile_job.sh
75	echo "singularity exec /data/classes/candle/candle-gpu.img swift/stc-workflow.sh $expt_name" >> ./compile_job.sh
76	sbatch -W --time=1 ./compile_job.sh
77	experiment_id=${expt_name:-experiment}
78	sbatch --output=experiments/$experiment_id/output.txt --error=experiments/$experiment_id/error.txt --partition=gpu --gres=gpu:$gpu_type:1 --cpus-per-task=2 --ntasks=$ntasks --mem=$memory --job-name=$experiment_id --time=$job_time --ntasks-per-node=1 swift/swift-job.sh $experiment_id
79	}}}
80
81	=== Run a grid search using your own model ===
82
83	We already transferred the CANDLE scripts to a local directory (in the above example, to working_dir=~/grid_search). With this directory structure in place, we will now adapt some of these scripts to your own data and model.
84
85	==== (1) Set variables ====
86
87	{{{
88	expt_name=
89	ntasks= # should be greater than 2
90	job_time=
91	memory=
92	gpu_type=
93	}}}
94
95	==== (2) Copy over new grid search template scripts ====
96
97	'''Warning:''' This will overwrite the two scripts in $working_dir/scripts.
98
99	{{{
100	cp -f /data/BIDS-HPC/public/grid_search_template/* $working_dir/scripts
101	}}}
102
103	==== (3) Edit files ====
104
105	* '''$working_dir/scripts/run_model.sh:''' this is a helper shell script that accepts the hyperparameters as command line arguments and calls the model via a Python script, train_model.py, below. Typically you only need to edit the setting of $ml_model_path.
106	* '''$working_dir/scripts/train_model.py:''' this is the main, customizable Python script that calls the machine learning model with a particular set of hyperparameters. Its inputs should be a string defining a dictionary of hyperparameters (which is automatically generated in run_model.sh) and the text file containing the result of the model on the data with the current set of hyperparameters.
107	* '''$working_dir/data/dice-params.txt:''' text file containing a hyperparameter combination on every line. Can be generated with a script e.g. $working_dir/data/data-generator.py.
108
109	==== (4) "Compile" and run the grid search ====
110
111	These are the same steps as for the test model.
112
113	{{{
114	cd $working_dir
115	echo '#!/bin/bash' > ./compile_job.sh
116	-	echo "module load ~~singularity”~~ >> ./compile_job.sh	+	echo "module load singularity" >> ./compile_job.sh
117	echo "singularity exec /data/classes/candle/candle-gpu.img swift/stc-workflow.sh $expt_name" >> ./compile_job.sh
118	sbatch -W --time=1 ./compile_job.sh
119	experiment_id=${expt_name:-experiment}
120	sbatch --output=experiments/$experiment_id/output.txt --error=experiments/$experiment_id/error.txt --partition=gpu --gres=gpu:$gpu_type:1 --cpus-per-task=2 --ntasks=$ntasks --mem=$memory --job-name=$experiment_id --time=$job_time --ntasks-per-node=1 swift/swift-job.sh $experiment_id
121	}}}