R/task.R
kub_create_task.Rd
This function creates a directory with 5 components from the
kuber template: a Dockerfile
(that is designed for expanded parallel taks),
an R file exec.R
(which contains some code that guides how your program
should work), a job-tmpl.yaml
(a template for the yaml files that will lauch
the docker jobs), an RDS list.rds
(which contains as.list(seq(1, 10))
just
so you don't forget to setup the inputs that each job will take) and a jobs/
folder (where the actual job yaml files will go once you run kub_push_task()
).
For more information, please see the sections below.
kub_create_task(path, cluster_name, bucket_name, image_name, service_account)
path | Directory where to initialize task (will be set as default for other Kuber functions) |
---|---|
cluster_name | Name of the cluster where to execute jobs (must exist already) |
bucket_name | Name of the storage bucket where the files will be stored (will be created if necessary) |
image_name | Name of the docker image where to build the container (either
its full path in the form |
service_account | Path to the Service Account JSON file that will be
used to authenticate |
Path where the kuber folder was created
This is a very simple Dockerfile
based on rocker/tidyverse
that installs
tidyverse
, devtools
and abjutils
, and copies exec.R
and list.rds
to
the home directory. If you have any extra dependencies or want to use more
files for your script, this is where you should add them.
The exec.R
file is only a guide for what your script should probably be
doing. It gets the number of the current job as its only argument, saves a
result file as an RDS and uploads that file to the desired bucket. For more
information, see the "Toy example" vignette; if you're having problems, see
the "Debugging exec.R" vignette.
The job template is a very simple file that describes how the job should be
run once it is activated by the pod, which is essentially running
Rscript --vanilla exec.R [JOB_NUMBER]
. Since this template uses parallelism
via expansion, kub_push_task()
will expand this template into as many job
files as you want.
This is simply a folder that will store the job files once the template is expanded.
This is more of a suggestion than a required file. It contains (by default)
a list()
with each integer from 1 to 10, but actually it could be any list
of your choosing. The goal here is to be able to get the arguments for your
script just by extracting the element with index equal to the job number
(meaning that job number N
might read and use everything stored in list.rds
at index [[N]]
). To illustrate this concept, take for example a webscraping
script: list.rds
would contain a list where each element is a character
vector of URLs to scrape; each job would therefore read the file but only
scrape list[[N]]
so that it doesn't overlap with any other job. Fore more
information, see the "Toy example" vignette.
In order to create a service account (necessary to manage storage resources
remotely) you must access https://console.cloud.google.com/iam-admin/serviceaccounts
and add to it the "storage object administrator" role. You should then click
on the account and generate a key as a JSON file. This file will be downloaded
to your computer and can be referenced via the service_account
argument.
https://kubernetes.io/docs/tasks/job/parallel-processing-expansion/ https://cloud.google.com/container-registry/docs/quickstart