AGalaxy contains a plugin infrastructure for collecting and displaying metrics for job execution (see the original pull request for implementation details).
config/job_metrics_conf.xml.sample is included in the Galaxy distribution that describes which plugins are enabled and how they are configured. This will be updated for each new plugin added and so should be considered the most updated, complete source of documentation for what plugins are available and how to configure them.
If Galaxy targets different clusters or just different resources - it may make sense to configure different plugins for different resources, different jobs, etc.... To enable this - individual job destinations may disable, load a different job metrics file, or define metrics directly in
job_conf.xml in an embedded fashion.
Job metrics can be collected for Pulsar jobs - but there is added complexity. The deployer is responsible for ensuring the same plugins are available to both applications and the Pulsar app is configured with a
job_metrics_conf.xml that matches Galaxy's configuration for that destination. In practice the Pulsar and Galaxy code bases should stay in-sync and if one is not using per destination metric configurations - this is simply a matter of ensuring the Pulsar and Galaxy are configured with identical
job_metrics_conf.xml.sample for the most updated information.
core plugin is one enabled by default and it is a cross-platform plugin used to caputre job run time and core allocation.
env is the only other cross-platform plugin (should work with any *nix) and can be used to record all environment variables set at job runtime or just specific variables (e.g.
PATH) - this can potentially be useful for debugging cluster issues.
There are a number of remaining Linux-only plugins - there include
meminfo to collect CPU and memory about the node the Galaxy job runs on and
uname to likewise collect host and operating system information.
A more experimental and more powerful plugin providing deep integration with Collectl is available. The possibilities for uses with this plugin are many and it contains many configuration options. Two of the more useful possibilities this plugin provides include aggregating statistics across the process tree generated by a job to provide per-job resource statistics such as detailed memory usage, detailed CPU breakdown, course I/O information, etc.... or alternatively simply recording a Collectl output file for each job with detailed time-series data for all processes on the runtime system.
Viewing Collected Data¶
After a job complete successful, all information collected from job metric plugins should be visible to admin users under the dataset details for all of the dataset produces by the job. See the screenshots on the original pull request for examples.
Developing New Plugins¶
Galaxy ships with a good variety of plugins ranging from very simple to very complex - a good place to start is likely to just find the most similar plugin and use it as a template for creating a new plugin. Galaxy's existing plugins and new ones to be added should be placed in lib/galaxy/jobs/metrics/instrumenters/
New plugins should subclass galaxy.jobs.metrics.instrumenters:InstrumentPlugin. The general strategy for implementing plugins is to describe the commands that Galaxy should run on the remote server with
pre_execute_instrument. These commands should likely write output to files in the job's working directory (produce relative file names with
_instrument_file_name or full paths with
_instrument_file_path to ensure remote Galaxy job runners such as the Pulsar can determine what needs to be shipped back to Galaxy for processing). Finally, plugins must implement a
job_properties method to parse these files created at runtime after the fact back on the Galaxy server and produce a dictionary of properties to summarize the job.
Plugins can also affect how these properties are displayed to the user by overriding the
formatter attribute on the
InstrumentPlugin (see existing plugins for examples).