Create a docker directory with the kuber template for parallelism by expansion

This function creates a directory with 5 components from the kuber template: a Dockerfile (that is designed for expanded parallel taks), an R file exec.R (which contains some code that guides how your program should work), a job-tmpl.yaml (a template for the yaml files that will lauch the docker jobs), an RDS list.rds (which contains as.list(seq(1, 10)) just so you don't forget to setup the inputs that each job will take) and a jobs/ folder (where the actual job yaml files will go once you run kub_push_task()). For more information, please see the sections below.

kub_create_task(path, cluster_name, bucket_name, image_name,
  service_account)

Arguments

path	Directory where to initialize task (will be set as default for other Kuber functions)
cluster_name	Name of the cluster where to execute jobs (must exist already)
bucket_name	Name of the storage bucket where the files will be stored (will be created if necessary)
image_name	Name of the docker image where to build the container (either its full path in the form `[HOSTNAME]/[PROJECT_ID]/[IMAGE_NAME]:[VERSION]` or simply `[IMAGE_NAME]` for it to be automatically pushed to the Google Cloud Registry)
service_account	Path to the Service Account JSON file that will be used to authenticate `gsutil` inside each container (must be a storage object administrator)

Value

Path where the kuber folder was created

Dockerfile

This is a very simple Dockerfile based on rocker/tidyverse that installs tidyverse, devtools and abjutils, and copies exec.R and list.rds to the home directory. If you have any extra dependencies or want to use more files for your script, this is where you should add them.

Exec.R

The exec.R file is only a guide for what your script should probably be doing. It gets the number of the current job as its only argument, saves a result file as an RDS and uploads that file to the desired bucket. For more information, see the "Toy example" vignette; if you're having problems, see the "Debugging exec.R" vignette.

Job-tmpl.yaml

The job template is a very simple file that describes how the job should be run once it is activated by the pod, which is essentially running Rscript --vanilla exec.R [JOB_NUMBER]. Since this template uses parallelism via expansion, kub_push_task() will expand this template into as many job files as you want.

Jobs/

This is simply a folder that will store the job files once the template is expanded.

List.rds

This is more of a suggestion than a required file. It contains (by default) a list() with each integer from 1 to 10, but actually it could be any list of your choosing. The goal here is to be able to get the arguments for your script just by extracting the element with index equal to the job number (meaning that job number N might read and use everything stored in list.rds at index [[N]]). To illustrate this concept, take for example a webscraping script: list.rds would contain a list where each element is a character vector of URLs to scrape; each job would therefore read the file but only scrape list[[N]] so that it doesn't overlap with any other job. Fore more information, see the "Toy example" vignette.

Service account

In order to create a service account (necessary to manage storage resources remotely) you must access https://console.cloud.google.com/iam-admin/serviceaccounts and add to it the "storage object administrator" role. You should then click on the account and generate a key as a JSON file. This file will be downloaded to your computer and can be referenced via the service_account argument.

References

https://kubernetes.io/docs/tasks/job/parallel-processing-expansion/ https://cloud.google.com/container-registry/docs/quickstart