Galactic Radio Telescope
This is an opt-in service which Galaxy admins can configure to contribute their job run data back to the community. We hope that by collecting this information we can build accurate models of tool CPU/memory/time requirements. In turn, admins will be able to use this analyzed data to optimize their job distribution across highly heterogeneous clusters.
Registration
You will need to register your Galaxy instance with the Galactic Radio Telescope (GRT). This can be done https://telescope.galaxyproject.org.
About the Script
Once you’ve registered your Galaxy instance, you’ll receive an instance ID and
an API key which are used to run scripts/grt/export.py
. The tool itself is very
simple to run. GRT will run and produce a directory of reports that can be
synced with the GRT server. Every time it is run, GRT only processes the list
of jobs that were run since the last time it was run. On first run, GRT will
attempt to export all job data for your instance which may be very slow
depending on your instance size. We have attempted to optimize this as much as
is feasible.
Data Privacy
All data submitted to the GRT will be released into the public domain. If there
are certain tools you do not want included, or certain parameters you wish to
hide (e.g. because they contain API keys), then you can take advantage of the
built-in sanitization. scripts/grt/grt.yml.sample
file allows you to build up
sanitization for the job logs.
sanitization:
# Blacklist the entire tool from appearing
tools:
- __SET_METADATA__
- upload1
# Or you can blacklist individual parameters from being submitted, e.g. if
# you have API keys as a tool parameter.
tool_params:
# Or to blacklist under a specific tool, just specify the ID
some_tool_id:
- dbkey
# If you need to specify a parameter multiple levels deep, you can
# do that as well. Currently we only support blacklisting via the
# full path, rather than just a path component. So everything under
# `path.to.parameter` will be blacklisted.
- path.to.parameter
# However you could not do "parameter" and have everything under
# `path.to.parameter` be removed.
# Repeats are rendered as an *, e.g.: repeat_name.*.values
To blacklist the results from specific tools appearing in results, just add the
tool ID under the tools
list.
Blacklisting tool parameters is more complex. In a key under the tool_params
key,
supply a list of parameters you wish to blacklist. NB: This will slow down
processing of records associated with that tool. Selecting keys is done
identically to writing test cases, except if you have a repeat element, just
replace the location of the numeric identifier with *
, e.g.
repeat_name.*.some_subkey
Data Collection Process
cd $GALAXY; python scripts/grt/export.py -l debug
export.py
connects to your galaxy database and makes queries against the
database for three primary tables:
job
job_parameter
job_metric_numeric
these are exported with very little processing, as tabular files to the GRT
reports directory, $GALAXY/reports/
. We only collect new job data that we
have not seen since the previous run. The last-seen job ID is stored in
$GALAXY/reports/.checkpoint
. Once the files have been exported, they are
put in a compressed archive, and some metadata about the export process is
written to a json file with the same name as the report archive.
You may wish to inspect these files to be sure that you’re comfortable with the information being sent.
Once you’re happy with the data, you can submit it with the GRT submission tool…
Data Submission
cd $GALAXY; python scripts/grt/upload.py
scripts/grt/upload.py
is a script which will submit your data to the
configured GRT server. You must first be registered with the server which will
also walk you through the setup process.
With your reports, submitting them is very simple. The script will login to the server and determine which reports the server does not have yet. Then it will begin uploading those.
For administrators with firewalled galaxies and no internet access, if you are able to exfiltrate your files to somewhere with internet, then you can still take advantage of GRT. Alternatively you can deploy GRT on your own infrastructure if you don’t want to share your job logs.