40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  PSL  Not logged in ELOG logo
Message ID: 2208     Entry time: Fri Jun 22 00:35:31 2018
Author: awade 
Type: HowTo 
Category: Computers 
Subject: Running parallel tasks on Caltech LIGO cluster 

Shruti and I are running various training routines for the machine learning and non-linear controls.  It can be hard to guess the best learning rates, random action injection rates and other hyperparameters of the NN and tensorflow optimizations.  Although the best approach is to work intuitively on simple examples and then scale up, the optimization and rates of learning can be a little opaque.  At some stage we will want to throw a bunch of computing power at systematically narrowing down what works and what doesn't.

We basically want to spin up a bunch of training trials to test a range of hyperparameters without having to wait a full day for turn around for each iteration through a list of values. Running tensorflow based training on GPU might offer speedup on each step but won't necessarily help if it isn't a well parallelized problem. Its not clear to me that, for instance, the baselines deepq minibatching will work faster if we simultaneously draw samples from the buffer and do the gradient decent in parallel with between-graph replication.  At the end of each training episode the outcomes of each separate minibatch gradients have to be combined (that seems non-trivial) and then redistributed across the GPU (which sounds like it will have some hefty overhead as we scale up).  Managing this kind of parallelizing seems too far down the rabbit-hole of optimization science for our investigations.

I've been poking the people over at the jupyterhub LIGO chat channel about running parallelized clusters from notebooks.  LIGO is now running python notebooks on the LDG at http://jupyter.ligo.caltech.edu (and test server http://jupyter2.ligo.caltech.edu).  These can now launch a cluster of n nodes directly from the jupyter gui and we can use the ipyparallel python module to run parallel tasks directly from jupyter.  The only problem is that it ships with a generic virtualenv for the python3 kernel that doesn't include our gym or baselines environments from OpenAI.  We've also made modifications to these packages making them even more propritory.  Furthurmore, there is a problem with ipyparallel clusters, we've found that they won't launch worker engines unless the version of python is exactly the same. The juputer notebook kernals that we are using are python 3.5.4 and the workers are something like python 3.6

As a workaround we can launch our own ipcluster cluster on the ldas-pcdev14 headnode (or ldas-pcdev5) and connect easly directly from jupyter notebooks. ipyparallel manages all of the scheduling and we can launch over 20 learning runs simultaneously and/or schedule a longer list to run. This is relativlty easy to do and doesn't involve much hackery. 

I've got this working from within a python notebook (attached) and have documented the steps needed to get it running.  The actual worker nodes work just a little slower (15% slower maybe) than our Macbook Pros.  The advanage is that we now have more scope to make a bunch of parrallel trials and to detach from those instances.

Edit (awade) Sun Jun 24 00:24:20 2018: fixed issue where tensorflow graph is kept somehow as a global variable by baselines.

 

Attachment 1: MinWorkingExample_baselines_with_ipyparallel.ipynb.zip  5 kB  Uploaded Sun Jun 24 01:25:20 2018
ELOG V3.1.3-