Warning
This document is for an in-development version of Galaxy. You can alternatively view this page in the latest release if it exists or view the top of the latest release's documentation.
Collection Semantics
This document describes the semantics around working with Galaxy dataset collections. In particular it describes how they operate within Galaxy tools and workflows.
You Probably Don’t Need to Read This
Any significantly sophisticated workflow language will have ways to collect data into arrays or vectors or dictionaries and apply operations across this data (mapping) or reduce the dimensionality of this data (reductions). Typically, this explicitly annotated with map functions or for loops. Galaxy however is designed to be a point and click interface for connecting steps and running tools. It is important that steps just connect and just do the most natural thing - and this is what Galaxy does. This document just provides a mathematical formalism to that “what should just intuitively work” that can be used to document test cases and help with implementation. This is reference documentation not user documentation, Galaxy should just work.
Mapping
If a tool consumes a simple dataset parameter and produces a simple dataset parameter, then any collection type may be “mapped over” the data input to that tool. The result of that is the tool being applied to each element of the collection and “implicit collections” being created from the outputs that are produced from those operations. Those implicit collections have the same element identifiers in the same order as the input collection that is mapped over. Each element of the implicit collections correspond to their own job and Galaxy very naturally and intuitively parallelizes jobs without extra work from the user and without any knowledge of the tool.
Examples
Example: BASIC_MAPPING_PAIRED
Assuming,
\( d_f \), \( d_r \) are datasets
\( tool \text{ is } (i: \text{ dataset }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}paired,\left\{ \text{ forward }=d_f, \text{ reverse }=d_r \right\}\text{>} \)
then
Example: BASIC_MAPPING_PAIRED_OR_UNPAIRED_PAIRED
Assuming,
\( d_f \), \( d_r \) are datasets
\( tool \text{ is } (i: \text{ dataset }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}paired\_or\_unpaired,\left\{ \text{ forward }=d_f, \text{ reverse }=d_r \right\}\text{>} \)
then
Example: BASIC_MAPPING_PAIRED_OR_UNPAIRED_UNPAIRED
Assuming,
\( d_u \) is a dataset
\( tool \text{ is } (i: \text{ dataset }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}paired\_or\_unpaired,\left\{ \text{ unpaired }=d_u \right\}\text{>} \)
then
Example: BASIC_MAPPING_LIST
Assuming,
\( d_1,...,d_n \) are datasets
\( tool \text{ is } (i: \text{ dataset }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}list,\left\{ \text{ i1 }=d_1, ..., \text{ in }=d_n \right\}\text{>} \)
then
The above description of mapping over inputs works naturally and as expected for nested collections.
Examples
Example: NESTED_LIST_MAPPING
Assuming,
\( d_1,...,d_n \) are datasets
\( tool \text{ is } (i: \text{ dataset }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}list:list,\left\{ \text{ o1 }=\left\{ \text{ inner }=d_1 \right\}, ..., \text{ on }=\left\{ \text{ inner }=d_n \right\} \right\}\text{>} \)
then
Example: BASIC_MAPPING_LIST_PAIRED_OR_UNPAIRED
Assuming,
\( d_f \), \( d_r \) are datasets
\( tool \text{ is } (i: \text{ dataset }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}list:paired\_or\_unpaired,\left\{ \text{ el1 }=\left\{ \text{ forward }=d_f, \text{ reverse }=d_r \right\} \right\}\text{>} \)
then
For tools with multiple data inputs, the tool can be executed with individual datasets for the non-mapped over input and each tool execution will just be executed with that dataset. The dataset not mapped over serves as the input for each execution.
Examples
Example: BASIC_MAPPING_INCLUDING_SINGLE_DATASET
Assuming,
\( d_1,...,d_n \), \( d_o \) are datasets
\( tool \text{ is } (i: \text{ dataset }, i2: \text{ dataset }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}list,\left\{ \text{ i1 }=d_1, ..., \text{ in }=d_n \right\}\text{>} \)
then
If a tool consumes two input datasets and produces one output dataset, you can map two collections with identical structure (same element identifiers in the same order) over the respective inputs and the result is an implicit collection with the same structure as the inputs and where each output in the implicit collection corresponds to the tool being executed with the two inputs corresponding to that position in the input collections.
The default behavior here is the collections are linked and the act of mapping over inputs to the tool are sort of a flat map or a dot product. No extra dimensionality in the resulting collections.
From a user perspective this means if you start with a collection and apply a bunch of map over operations on tools - the results will all continue to match and work together very naturally - again without extra work by the user and without extra knowledge by the tool author.
Examples
Example: BASIC_MAPPING_TWO_INPUTS_WITH_IDENTICAL_STRUCTURE
Assuming,
\( d1_1,...,d1_n \), \( d2_1,...,d2_n \) are datasets
\( tool \text{ is } (i: \text{ dataset }, i2: \text{ dataset }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C1 \) is \( \text{CollectionInstance<}list,\left\{ \text{ i1 }=d1_1, ..., \text{ in }=d1_n \right\}\text{>} \)
\( C2 \) is \( \text{CollectionInstance<}list,\left\{ \text{ i1 }=d2_1, ..., \text{ in }=d2_n \right\}\text{>} \)
then
Reduction
Not all tool executions result in implicit collections and mapping
over inputs. Tool inputs of type
data_collection
can consume
collections directly and do not necessarily result in mapping over.
Tools that consume collections and output datasets effectively reduce the dimension of the Galaxy data structure. When used at runtime this is often referred to a “reduction” in the code.
Examples
Example: COLLECTION_INPUT_PAIRED
Assuming,
\( d_f \), \( d_r \) are datasets
\( tool \text{ is } (i: \text{ collection<paired> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}paired,\left\{ \text{ forward }=d_f, \text{ reverse }=d_r \right\}\text{>} \)
then
Example: COLLECTION_INPUT_LIST
Assuming,
\( d1,...,dn \) are datasets
\( tool \text{ is } (i: \text{ collection<list> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}list,\left\{ \text{ el1 }=d_1, ..., \text{ eln }=d_n \right\}\text{>} \)
then
Example: COLLECTION_INPUT_PAIRED_OR_UNPAIRED
Assuming,
\( d_f \), \( d_r \) are datasets
\( tool \text{ is } (i: \text{ collection<paired_or_unpaired> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}paired\_or\_unpaired,\left\{ \text{ forward }=d_f, \text{ reverse }=d_r \right\}\text{>} \)
then
Example: COLLECTION_INPUT_LIST_PAIRED_OR_UNPAIRED
Assuming,
\( d_f \), \( d_r \) are datasets
\( tool \text{ is } (i: \text{ collection<list:paired_or_unpaired> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}list:paired\_or\_unpaired,\left\{ \text{ el1 }=\left\{ \text{ forward }=d_f, \text{ reverse }=d_r \right\} \right\}\text{>} \)
then
For nested collections where each rank is a list
or a paired
collection,
then collection inputs must match every part of the collection type input definition.
Examples
Example: COLLECTION_INPUT_LIST_NOT_CONSUMES_PAIRS
Assuming,
\( d_f \), \( d_r \) are datasets
\( tool \text{ is } (i: \text{ collection<list> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}paired,\left\{ \text{ forward }=d_f, \text{ reverse }=d_r \right\}\text{>} \)
then
Example: COLLECTION_INPUT_PAIRED_NOT_CONSUMES_LIST
Assuming,
\( d_1,...,d_n \) are datasets
\( tool \text{ is } (i: \text{ collection<paired> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}list,\left\{ \text{ i1 }=d_1, ..., \text{ in }=d_n \right\}\text{>} \)
then
In addition to explicit collection inputs, tool inputs of type
data
where multiple="true"
can consume lists directly. This is likewise a
“reduction” and does not result in implicit collection creation.
Examples
Example: LIST_REDUCTION
Assuming,
\( d_1,...,d_n \) are datasets
\( tool \text{ is } (i: \text{ dataset<multiple=true> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}list,\left\{ \text{ i1 }=d_1, ..., \text{ in }=d_n \right\}\text{>} \)
then
Paired collections cannot be reduced this way. paired
is not meant
to represent a list/array/vector data structure - it is more like a tuple.
Examples
Example: PAIRED_REDUCTION_INVALID
Assuming,
\( d_f \), \( d_r \) are datasets
\( tool \text{ is } (i: \text{ dataset<multiple=true> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}paired,\left\{ \text{ forward }=d_f, \text{ reverse }=d_r \right\}\text{>} \)
then
Example: PAIRED_OR_UNPAIRED_REDUCTION_INVALID
Assuming,
\( d_f \), \( d_r \) are datasets
\( tool \text{ is } (i: \text{ dataset<multiple=true> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}paired\_or\_unpaired,\left\{ forward=d_f, reverse=d_r \right\}\text{>} \)
then
Sub-collection Mapping
Examples
Example: MAPPING_LIST_PAIRED_OVER_PAIRED
Assuming,
\( d_f \), \( d_r \) are datasets
\( tool \text{ is } (i: \text{ collection<paired> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}list:paired,\left\{ \text{ el1 }=\left\{ \text{ forward }=d_f, \text{ reverse }=d_r \right\} \right\}\text{>} \)
\( C\_PAIRED \) is \( \text{CollectionInstance<}paired,\left\{ \text{ forward }=d_f, \text{ reverse }=d_r \right\}\text{>} \)
then
The natural extension of multiple data input parameters consuming list collections as describe
above when discussing reductions is that nested lists of lists (list:list
) can be mapped
over a multiple data input parameter. Each nested list will be reduced by this operation but the
results will be mapped over. The result will be a list with the same structure as the outer list
of the input collection.
Examples
Example: NESTED_LIST_REDUCTION
Assuming,
\( d_1,...,d_n \) are datasets
\( tool \text{ is } (i: \text{ dataset<multiple=true> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}list:list,\left\{ \text{ o1 }=\left\{ \text{ inner }=d_1 \right\}, ..., \text{ on }=\left\{ \text{ inner }=d_n \right\} \right\}\text{>} \)
then
Just as a paired collection won’t be reduced by a multiple data input, any sort of nested
collection ending in a paired collection cannot be mapped over such an input. So a multiple
data input parameter cannot be mapped over by a list of pairs (list:paired
) for instance.
Examples
Example: LIST_PAIRED_REDUCTION_INVALID
Assuming,
\( d_f \), \( d_r \) are datasets
\( tool \text{ is } (i: \text{ dataset<multiple=true> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}list:paired,\left\{ \text{ forward }=d_f, \text{ reverse }=d_r \right\}\text{>} \)
then
Example: LIST_PAIRED_OR_UNPAIRED_REDUCTION_INVALID
Assuming,
\( d_f \), \( d_r \) are datasets
\( tool \text{ is } (i: \text{ dataset<multiple=true> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}list:paired\_or\_unpaired,\left\{ \text{ forward }=d_f, \text{ reverse }=d_r \right\}\text{>} \)
then
paired_or_unpaired Collections
The collection type paired_or_unpaired
is meant to serve as a stand-in for
an entity that can be either a single dataset or what is effectively a paired
dataset collection. These collections either have one element with identifier
unpaired
or two elements with identifiers forward
and reverse
.
Tools can declare a data_collection input with collection type paired_or_unpaired
and that input will consume either an explicit paired_or_unpaired
collection
normally or can consume a paired
input.
Examples
Example: PAIRED_OR_UNPAIRED_CONSUMES_PAIRED
Assuming,
\( d_f \), \( d_r \) are datasets
\( tool \text{ is } (i: \text{ collection<paired_or_unpaired> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}paired,\left\{ \text{ forward }=d_f, \text{ reverse }=d_r \right\}\text{>} \)
\( C_AS_MIXED = CollectionInstance<paired\_or\_unpaired, \left\{\text{forward}: d_f, \text{reverse}: d_r\right\}> \)
then
This inverse of this doesn’t work intentionally. In some ways a paired
collection
acts as a paired_or_unpaired
collection but a paired_or_unpaired
is not a paired
collection. This makes a lot of sense in terms of tools - a tool consuming a paired
dataset expects to find both a forward
and reverse
element but these may not exist
in paired_or_unpaired
collection.
Examples
Example: PAIRED_OR_UNPAIRED_NOT_CONSUMED_BY_PAIRED
Assuming,
\( d_f \), \( d_r \) are datasets
\( tool \text{ is } (i: \text{ collection<paired> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}paired\_or\_unpaired,\left\{ forward=d_f, \text{ reverse }=d_r \right\}\text{>} \)
then
The same logic holds for mapping, lists of paired datasets (list:paired
) can be mapped over these
paired_or_unpaired
inputs and mixed lists of pairs (list:paired_or_unpaired
) cannot
be mapped over a paired
input. Following the same logic, list:paired_or_unpaired
cannot
be mapped over a list
input or multiple data input.
Examples
Example: MAPPING_LIST_PAIRED_OVER_PAIRED_OR_UNPAIRED
Assuming,
\( d_f \), \( d_r \) are datasets
\( tool \text{ is } (i: \text{ collection<paired_or_unpaired> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}list:paired,\left\{ \text{ el }=\left\{ \text{ forward }=d_f, \text{ reverse }=d_r \right\} \right\}\text{>} \)
\( C_AS_MIXED \) is \( \text{CollectionInstance<}list:paired\_or\_unpaired,\left\{ \text{ el }=\left\{ \text{ forward }=d_f, \text{ reverse }=d_r \right\} \right\}\text{>} \)
then
Example: PAIRED_OR_UNPAIRED_NOT_CONSUMED_BY_PAIRED_WHEN_MAPPING
Assuming,
\( d_f \), \( d_r \) are datasets
\( tool \text{ is } (i: \text{ collection<paired> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}list:paired\_or\_unpaired,\left\{ \text{ el }=\left\{ \text{ forward }=f, \text{ reverse }=r \right\} \right\}\text{>} \)
then
Example: PAIRED_OR_UNPAIRED_NOT_CONSUMED_BY_LIST_WHEN_MAPPING
Assuming,
\( d_f \), \( d_r \) are datasets
\( tool \text{ is } (i: \text{ collection<list> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}list:paired\_or\_unpaired,\left\{ \text{ el }=\left\{ \text{ forward }=f, \text{ reverse }=r \right\} \right\}\text{>} \)
then
This logic extends naturally into higher dimensional collections. A list:list:paired
can be mapped over either a paired_or_unpaired
input to produce a nested list (list:list
)
or a list:paired_or_unpaired
input to produce a flat list (list
).
In order for paired_or_unpaired
collections to also act as a single dataset,
a flat list can be mapped over a such an input with a special sub collection mapping
type of ‘single_datasets’.
Examples
Example: MAPPING_LIST_OVER_PAIRED_OR_UNPAIRED
Assuming,
\( d_1,...,d_n \) are datasets
\( tool \text{ is } (i: \text{ collection<paired_or_unpaired> }) \Rightarrow \{ o: \text{ dataset } \} \)
\( C \) is \( \text{CollectionInstance<}list,\left\{ \text{ i1 }=d_1, ..., \text{ in }=d_n \right\}\text{>} \)
\( C_AS_UNPAIRED_i = CollectionInstance<paired\_or\_unpaired,\left\{unpaired=di\right\}> for i from 1...n \)
then
This treatment of lists without pairing extends to nested structures naturally.
For instance, a list of list of datasets (list:list
) can be mapped over a
paired_or_unpaired
input to produce a nested list of lists (list:list
)
with a structure matching the input. Likewise, the nested list can be mapped over
a list:paired_or_unpaired
input to produce a flat list with the same structure
as the outer list of the input.
Due only implementation time, the special casing of allowing paired_or_unpaired act as both datasets and paired collections only works when it is the deepest collection type. So while list:paired can be consumed by a list:paired_or_unpaired input, a paired:list cannot be consumed by a paired_or_unpaired:list input though it should be able to for consistency. We have focused our time on data structures more likely to be used in actual Galaxy analyses given current and guessed future usage.