galaxy.datatypes.dataproviders package

Dataproviders are iterators with context managers that provide data to some consumer datum by datum.

As well as subclassing and overriding to get the proper data, Dataproviders can be piped from one to the other.

Note

be careful to NOT pipe providers into subclasses of those providers. Subclasses provide all the functionality of their superclasses, so there’s generally no need.

Note

be careful to when using piped providers that accept the same keywords in their __init__ functions (such as limit or offset) to pass those keywords to the proper (often final) provider. These errors that result can be hard to diagnose.

Submodules

galaxy.datatypes.dataproviders.base module

Base class(es) for all DataProviders.

class galaxy.datatypes.dataproviders.base.HasSettings(name, base_classes, attributes)[source]

Bases: type

Metaclass for data providers that allows defining and inheriting a dictionary named ‘settings’.

Useful for allowing class level access to expected variable types passed to class __init__ functions so they can be parsed from a query string.

class galaxy.datatypes.dataproviders.base.DataProvider(source, **kwargs)[source]

Bases: object

Base class for all data providers. Data providers:

  • have a source (which must be another file-like object)

  • implement both the iterator and context manager interfaces

  • do not allow write methods (but otherwise implement the other file object interface methods)

settings: Dict[str, str] = {}
__init__(source, **kwargs)[source]

Sets up a data provider, validates supplied source.

Parameters:

source – the source that this iterator will loop over. (Should implement the iterable interface and ideally have the context manager interface as well)

validate_source(source)[source]

Is this a valid source for this provider?

Raises:

InvalidDataProviderSource – if the source is considered invalid.

Meant to be overridden in subclasses.

truncate(size)[source]
write(string)[source]
writelines(sequence)[source]
readlines()[source]
class galaxy.datatypes.dataproviders.base.FilteredDataProvider(source, filter_fn=None, **kwargs)[source]

Bases: DataProvider

Passes each datum through a filter function and yields it if that function returns a non-None value.

Also maintains counters:
  • num_data_read: how many data have been consumed from the source.

  • num_valid_data_read: how many data have been returned from filter.

  • num_data_returned: how many data has this provider yielded.

__init__(source, filter_fn=None, **kwargs)[source]
Parameters:

filter_fn – a lambda or function that will be passed a datum and return either the (optionally modified) datum or None.

filter(datum)[source]

When given a datum from the provider’s source, return None if the datum ‘does not pass’ the filter or is invalid. Return the datum if it’s valid.

Parameters:

datum – the datum to check for validity.

Returns:

the datum, a modified datum, or None

Meant to be overridden.

settings: Dict[str, str] = {}
class galaxy.datatypes.dataproviders.base.LimitedOffsetDataProvider(source, offset=0, limit=None, **kwargs)[source]

Bases: FilteredDataProvider

A provider that uses the counters from FilteredDataProvider to limit the number of data and/or skip offset number of data before providing.

Useful for grabbing sections from a source (e.g. pagination).

settings: Dict[str, str] = {'limit': 'int', 'offset': 'int'}
__init__(source, offset=0, limit=None, **kwargs)[source]
Parameters:
  • offset – the number of data to skip before providing.

  • limit – the final number of data to provide.

class galaxy.datatypes.dataproviders.base.MultiSourceDataProvider(source_list, **kwargs)[source]

Bases: DataProvider

A provider that iterates over a list of given sources and provides data from one after another.

An iterator over iterators.

__init__(source_list, **kwargs)[source]
Parameters:

source_list – an iterator of iterables

settings: Dict[str, str] = {}

galaxy.datatypes.dataproviders.chunk module

Chunk (N number of bytes at M offset to a source’s beginning) provider.

Primarily for file sources but usable by any iterator that has both seek and read( N ).

class galaxy.datatypes.dataproviders.chunk.ChunkDataProvider(source, chunk_index=0, chunk_size=65536, **kwargs)[source]

Bases: DataProvider

Data provider that yields chunks of data from its file.

Note: this version does not account for lines and works with Binary datatypes.

MAX_CHUNK_SIZE = 65536
DEFAULT_CHUNK_SIZE = 65536
settings: Dict[str, str] = {'chunk_index': 'int', 'chunk_size': 'int'}
__init__(source, chunk_index=0, chunk_size=65536, **kwargs)[source]
Parameters:
  • chunk_index – if a source can be divided into N number of chunk_size sections, this is the index of which section to return.

  • chunk_size – how large are the desired chunks to return (gen. in bytes).

validate_source(source)[source]

Does the given source have both the methods seek and read? :raises InvalidDataProviderSource: if not.

encode(chunk)[source]

Called on the chunk before returning.

Overrride to modify, encode, or decode chunks.

class galaxy.datatypes.dataproviders.chunk.Base64ChunkDataProvider(source, chunk_index=0, chunk_size=65536, **kwargs)[source]

Bases: ChunkDataProvider

Data provider that yields chunks of base64 encoded data from its file.

encode(chunk)[source]

Return chunks encoded in base 64.

settings: Dict[str, str] = {'chunk_index': 'int', 'chunk_size': 'int'}

galaxy.datatypes.dataproviders.column module

Providers that provide lists of lists generally where each line of a source is further subdivided into multiple data (e.g. columns from a line).

class galaxy.datatypes.dataproviders.column.ColumnarDataProvider(source, indeces=None, column_count=None, column_types=None, parsers=None, parse_columns=True, deliminator='\t', filters=None, **kwargs)[source]

Bases: RegexLineDataProvider

Data provider that provide a list of columns from the lines of its source.

Columns are returned in the order given in indeces, so this provider can re-arrange columns.

If any desired index is outside the actual number of columns in the source, this provider will None-pad the output and you are guaranteed the same number of columns as the number of indeces asked for (even if they are filled with None).

settings: Dict[str, str] = {'column_count': 'int', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
__init__(source, indeces=None, column_count=None, column_types=None, parsers=None, parse_columns=True, deliminator='\t', filters=None, **kwargs)[source]
Parameters:
  • indeces (list or None) – a list of indeces of columns to gather from each row Optional: will default to None. If None, this provider will return all rows (even when a particular row contains more/less than others). If a row/line does not contain an element at a given index, the provider will-return/fill-with a None value as the element.

  • column_count (int) – an alternate means of defining indeces, use an int here to effectively provide the first N columns. Optional: will default to None.

  • column_types (list of strings) – a list of string names of types that the provider will use to look up an appropriate parser for the column. (e.g. ‘int’, ‘float’, ‘str’, ‘bool’) Optional: will default to parsing all columns as strings.

  • parsers (dictionary) – a dictionary keyed with column type strings and with values that are functions to use when parsing those types. Optional: will default to using the function _get_default_parsers.

  • parse_columns (bool) – attempt to parse columns? Optional: defaults to True.

  • deliminator (str) – character(s) used to split each row/line of the source. Optional: defaults to the tab character.

Note

that the subclass constructors are passed kwargs - so they’re params (limit, offset, etc.) are also applicable here.

parse_filter(filter_param_str)[source]
create_numeric_filter(column, op, val)[source]

Return an anonymous filter function that will be passed the array of parsed columns. Return None if no filter function can be created for the given params.

The function will compare the column at index column against val using the given op where op is one of:

  • lt: less than

  • le: less than or equal to

  • eq: equal to

  • ne: not equal to

  • ge: greather than or equal to

  • gt: greater than

val is cast as float here and will return None if there’s a parsing error.

create_string_filter(column, op, val)[source]

Return an anonymous filter function that will be passed the array of parsed columns. Return None if no filter function can be created for the given params.

The function will compare the column at index column against val using the given op where op is one of:

  • eq: exactly matches

  • has: the column contains the substring val

  • re: the column matches the regular expression in val

create_list_filter(column, op, val)[source]

Return an anonymous filter function that will be passed the array of parsed columns. Return None if no filter function can be created for the given params.

The function will compare the column at index column against val using the given op where op is one of:

  • eq: the list val exactly matches the list in the column

  • has: the list in the column contains the sublist val

get_default_parsers()[source]

Return parser dictionary keyed for each columnar type (as defined in datatypes).

Note

primitives only by default (str, int, float, boolean, None). Other (more complex) types are retrieved as strings.

Returns:

a dictionary of the form: { <parser type name> : <function used to parse type> }

filter(line)[source]

Determines whether to provide line or not.

Parameters:

line (str) – the incoming line from the source

Returns:

a line or None

parse_columns_from_line(line)[source]

Returns a list of the desired, parsed columns. :param line: the line to parse :type line: str

parse_column_at_index(columns, parser_index, index)[source]

Get the column type for the parser from self.column_types or None if the type is unavailable.

parse_value(val, type)[source]

Attempt to parse and return the given value based on the given type.

Parameters:
  • val – the column value to parse (often a string)

  • type – the string type ‘name’ used to find the appropriate parser

Returns:

the parsed value or value if no type found in parsers or None if there was a parser error (ValueError)

get_column_type(index)[source]

Get the column type for the parser from self.column_types or None if the type is unavailable. :param index: the column index :returns: string name of type (e.g. ‘float’, ‘int’, etc.)

filter_by_columns(columns)[source]
class galaxy.datatypes.dataproviders.column.DictDataProvider(source, column_names=None, **kwargs)[source]

Bases: ColumnarDataProvider

Data provider that zips column_names and columns from the source’s contents into a dictionary.

A combination use of both column_names and indeces allows ‘picking’ key/value pairs from the source.

Note

The subclass constructors are passed kwargs - so their params (limit, offset, etc.) are also applicable here.

settings: Dict[str, str] = {'column_count': 'int', 'column_names': 'list:str', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
__init__(source, column_names=None, **kwargs)[source]
Parameters:

column_names – an ordered list of strings that will be used as the keys for each column in the returned dictionaries. The number of key, value pairs each returned dictionary has will be as short as the number of column names provided.

galaxy.datatypes.dataproviders.dataset module

Dataproviders that use either:
  • the file contents and/or metadata from a Galaxy DatasetInstance as

    their source.

  • or provide data in some way relevant to bioinformatic data

    (e.g. parsing genomic regions from their source)

class galaxy.datatypes.dataproviders.dataset.DatasetDataProvider(dataset, **kwargs)[source]

Bases: DataProvider

Class that uses the file contents and/or metadata from a Galaxy DatasetInstance as its source.

DatasetDataProvider can be seen as the intersection between a datatype’s metadata and a dataset’s file contents. It (so far) mainly provides helper and conv. methods for using dataset metadata to set up and control how the data is provided.

__init__(dataset, **kwargs)[source]
Parameters:

dataset (model.DatasetInstance) – the Galaxy dataset whose file will be the source

classmethod get_column_metadata_from_dataset(dataset)[source]

Convenience class method to get column metadata from a dataset.

Returns:

a dictionary of column_count, column_types, and column_names if they’re available, setting each to None if not.

get_metadata_column_types(indeces=None)[source]

Return the list of column_types for this dataset or None if unavailable.

Parameters:

indeces (list of ints) – the indeces for the columns of which to return the types. Optional: defaults to None (return all types)

get_metadata_column_names(indeces=None)[source]

Return the list of column_names for this dataset or None if unavailable.

Parameters:

indeces (list of ints) – the indeces for the columns of which to return the names. Optional: defaults to None (return all names)

get_indeces_by_column_names(list_of_column_names)[source]

Return the list of column indeces when given a list of column_names.

Parameters:

list_of_column_names (list of strs) – the names of the columns of which to get indeces.

Raises:
  • KeyError – if column_names are not found

  • ValueError – if an entry in list_of_column_names is not in column_names

get_metadata_column_index_by_name(name)[source]

Return the 1-base index of a sources column with the given name.

get_genomic_region_indeces(check=False)[source]

Return a list of column indeces for ‘chromCol’, ‘startCol’, ‘endCol’ from a source representing a genomic region.

Parameters:

check (bool) – if True will raise a ValueError if any were not found.

Raises:

ValueError – if check is True and one or more indeces were not found.

Returns:

list of column indeces for the named columns.

settings: Dict[str, str] = {}
class galaxy.datatypes.dataproviders.dataset.ConvertedDatasetDataProvider(dataset, **kwargs)[source]

Bases: DatasetDataProvider

Class that uses the file contents of a dataset after conversion to a different format.

__init__(dataset, **kwargs)[source]
Parameters:

dataset (model.DatasetInstance) – the Galaxy dataset whose file will be the source

convert_dataset(dataset, **kwargs)[source]

Convert the given dataset in some way.

settings: Dict[str, str] = {}
class galaxy.datatypes.dataproviders.dataset.DatasetColumnarDataProvider(dataset, **kwargs)[source]

Bases: ColumnarDataProvider

Data provider that uses a DatasetDataProvider as its source and the dataset’s metadata to buuild settings for the ColumnarDataProvider it’s inherited from.

__init__(dataset, **kwargs)[source]

All kwargs are inherited from ColumnarDataProvider. .. seealso:: column.ColumnarDataProvider

If no kwargs are given, this class will attempt to get those kwargs from the dataset source’s metadata. If any kwarg is given, it will override and be used in place of any metadata available.

settings: Dict[str, str] = {'column_count': 'int', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
class galaxy.datatypes.dataproviders.dataset.DatasetDictDataProvider(dataset, **kwargs)[source]

Bases: DictDataProvider

Data provider that uses a DatasetDataProvider as its source and the dataset’s metadata to buuild settings for the DictDataProvider it’s inherited from.

__init__(dataset, **kwargs)[source]

All kwargs are inherited from DictDataProvider. .. seealso:: column.DictDataProvider

If no kwargs are given, this class will attempt to get those kwargs from the dataset source’s metadata. If any kwarg is given, it will override and be used in place of any metadata available.

The relationship between column_names and indeces is more complex: +—————–+——————————-+———————–+ | | Indeces given | Indeces NOT given | +=================+===============================+=======================+ | Names given | pull indeces, rename w/ names | pull by name | +=================+——————————-+———————–+ | Names NOT given | pull indeces, name w/ meta | pull all, name w/meta | +=================+——————————-+———————–+

settings: Dict[str, str] = {'column_count': 'int', 'column_names': 'list:str', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
class galaxy.datatypes.dataproviders.dataset.GenomicRegionDataProvider(dataset, chrom_column=None, start_column=None, end_column=None, named_columns=False, **kwargs)[source]

Bases: ColumnarDataProvider

Data provider that parses chromosome, start, and end data from a file using the datasets metadata settings.

Is a ColumnarDataProvider that uses a DatasetDataProvider as its source.

If named_columns is true, will return dictionaries with the keys ‘chrom’, ‘start’, ‘end’.

COLUMN_NAMES = ['chrom', 'start', 'end']
settings: Dict[str, str] = {'chrom_column': 'int', 'column_count': 'int', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'end_column': 'int', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'named_columns': 'bool', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'start_column': 'int', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
__init__(dataset, chrom_column=None, start_column=None, end_column=None, named_columns=False, **kwargs)[source]
Parameters:
  • dataset (model.DatasetInstance) – the Galaxy dataset whose file will be the source

  • chrom_column (int) – optionally specify the chrom column index

  • start_column (int) – optionally specify the start column index

  • end_column (int) – optionally specify the end column index

  • named_columns (bool) – optionally return dictionaries keying each column with ‘chrom’, ‘start’, or ‘end’. Optional: defaults to False

class galaxy.datatypes.dataproviders.dataset.IntervalDataProvider(dataset, chrom_column=None, start_column=None, end_column=None, strand_column=None, name_column=None, named_columns=False, **kwargs)[source]

Bases: ColumnarDataProvider

Data provider that parses chromosome, start, and end data (as well as strand and name if set in the metadata) using the dataset’s metadata settings.

If named_columns is true, will return dictionaries with the keys ‘chrom’, ‘start’, ‘end’ (and ‘strand’ and ‘name’ if available).

COLUMN_NAMES = ['chrom', 'start', 'end', 'strand', 'name']
settings: Dict[str, str] = {'chrom_column': 'int', 'column_count': 'int', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'end_column': 'int', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'name_column': 'int', 'named_columns': 'bool', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'start_column': 'int', 'strand_column': 'int', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
__init__(dataset, chrom_column=None, start_column=None, end_column=None, strand_column=None, name_column=None, named_columns=False, **kwargs)[source]
Parameters:
  • dataset (model.DatasetInstance) – the Galaxy dataset whose file will be the source

  • named_columns (bool) – optionally return dictionaries keying each column with ‘chrom’, ‘start’, ‘end’, ‘strand’, or ‘name’. Optional: defaults to False

class galaxy.datatypes.dataproviders.dataset.FastaDataProvider(source, ids=None, **kwargs)[source]

Bases: FilteredDataProvider

Class that returns fasta format data in a list of maps of the form:

{
    id: <fasta header id>,
    sequence: <joined lines of nucleotide/amino data>
}
settings: Dict[str, str] = {'ids': 'list:str'}
__init__(source, ids=None, **kwargs)[source]
Parameters:

ids (list or None) – optionally return only ids (and sequences) that are in this list. Optional: defaults to None (provide all ids)

class galaxy.datatypes.dataproviders.dataset.TwoBitFastaDataProvider(source, ids=None, **kwargs)[source]

Bases: DatasetDataProvider

Class that returns fasta format data in a list of maps of the form:

{
    id: <fasta header id>,
    sequence: <joined lines of nucleotide/amino data>
}
settings: Dict[str, str] = {'ids': 'list:str'}
__init__(source, ids=None, **kwargs)[source]
Parameters:

ids (list or None) – optionally return only ids (and sequences) that are in this list. Optional: defaults to None (provide all ids)

class galaxy.datatypes.dataproviders.dataset.WiggleDataProvider(source, named_columns=False, column_names=None, **kwargs)[source]

Bases: LimitedOffsetDataProvider

Class that returns chrom, pos, data from a wiggle source.

COLUMN_NAMES = ['chrom', 'pos', 'value']
settings: Dict[str, str] = {'column_names': 'list:str', 'limit': 'int', 'named_columns': 'bool', 'offset': 'int'}
__init__(source, named_columns=False, column_names=None, **kwargs)[source]
Parameters:
  • named_columns (bool) – optionally return dictionaries keying each column with ‘chrom’, ‘start’, ‘end’, ‘strand’, or ‘name’. Optional: defaults to False

  • column_names – an ordered list of strings that will be used as the keys for each column in the returned dictionaries. The number of key, value pairs each returned dictionary has will be as short as the number of column names provided.

class galaxy.datatypes.dataproviders.dataset.BigWigDataProvider(source, chrom, start, end, named_columns=False, column_names=None, **kwargs)[source]

Bases: LimitedOffsetDataProvider

Class that returns chrom, pos, data from a wiggle source.

COLUMN_NAMES = ['chrom', 'pos', 'value']
settings: Dict[str, str] = {'column_names': 'list:str', 'limit': 'int', 'named_columns': 'bool', 'offset': 'int'}
__init__(source, chrom, start, end, named_columns=False, column_names=None, **kwargs)[source]
Parameters:
  • chrom (str) – which chromosome within the bigbed file to extract data for

  • start (int) – the start of the region from which to extract data

  • end (int) – the end of the region from which to extract data

  • named_columns (bool) – optionally return dictionaries keying each column with ‘chrom’, ‘start’, ‘end’, ‘strand’, or ‘name’. Optional: defaults to False

  • column_names – an ordered list of strings that will be used as the keys for each column in the returned dictionaries. The number of key, value pairs each returned dictionary has will be as short as the number of column names provided.

class galaxy.datatypes.dataproviders.dataset.DatasetSubprocessDataProvider(dataset, *args, **kwargs)[source]

Bases: SubprocessDataProvider

Create a source from running a subprocess on a dataset’s file.

Uses a subprocess as its source and has a dataset (gen. as an input file for the process).

__init__(dataset, *args, **kwargs)[source]
Parameters:

args (variadic function args) – the list of strings used to build commands.

settings: Dict[str, str] = {}
class galaxy.datatypes.dataproviders.dataset.SamtoolsDataProvider(dataset, options_string='', options_dict=None, regions=None, **kwargs)[source]

Bases: RegexLineDataProvider

Data provider that uses samtools on a Sam or Bam file as its source.

This can be piped through other providers (column, map, genome region, etc.).

Note

that only the samtools ‘view’ command is currently implemented.

FLAGS_WO_ARGS = 'bhHSu1xXcB'
FLAGS_W_ARGS = 'fFqlrs'
VALID_FLAGS = 'bhHSu1xXcBfFqlrs'
__init__(dataset, options_string='', options_dict=None, regions=None, **kwargs)[source]
Parameters:
  • options_string (str) – samtools options in string form (flags separated by spaces) Optional: defaults to ‘’

  • options_dict (dict or None) – dictionary of samtools options Optional: defaults to None

  • regions (list of str or None) – list of samtools regions strings Optional: defaults to None

build_command_list(subcommand, options_string, options_dict, regions)[source]

Convert all init args to list form.

to_options_list(options_string, options_dict)[source]

Convert both options_string and options_dict to list form while filtering out non-‘valid’ options.

classmethod extract_options_from_dict(dictionary)[source]

Separrates valid samtools key/value pair options from a dictionary and returns both as a 2-tuple.

settings: Dict[str, str] = {'comment_char': 'str', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
class galaxy.datatypes.dataproviders.dataset.SQliteDataProvider(source, query=None, **kwargs)[source]

Bases: DataProvider

Data provider that uses a sqlite database file as its source.

Allows any query to be run and returns the resulting rows as sqlite3 row objects

settings: Dict[str, str] = {'query': 'str'}
__init__(source, query=None, **kwargs)[source]

Sets up a data provider, validates supplied source.

Parameters:

source – the source that this iterator will loop over. (Should implement the iterable interface and ideally have the context manager interface as well)

class galaxy.datatypes.dataproviders.dataset.SQliteDataTableProvider(source, query=None, headers=False, limit=9223372036854775807, **kwargs)[source]

Bases: DataProvider

Data provider that uses a sqlite database file as its source. Allows any query to be run and returns the resulting rows as arrays of arrays

settings: Dict[str, str] = {'headers': 'bool', 'limit': 'int', 'query': 'str'}
__init__(source, query=None, headers=False, limit=9223372036854775807, **kwargs)[source]

Sets up a data provider, validates supplied source.

Parameters:

source – the source that this iterator will loop over. (Should implement the iterable interface and ideally have the context manager interface as well)

class galaxy.datatypes.dataproviders.dataset.SQliteDataDictProvider(source, query=None, **kwargs)[source]

Bases: DataProvider

Data provider that uses a sqlite database file as its source. Allows any query to be run and returns the resulting rows as arrays of dicts

settings: Dict[str, str] = {'query': 'str'}
__init__(source, query=None, **kwargs)[source]

Sets up a data provider, validates supplied source.

Parameters:

source – the source that this iterator will loop over. (Should implement the iterable interface and ideally have the context manager interface as well)

galaxy.datatypes.dataproviders.decorators module

DataProvider related decorators.

galaxy.datatypes.dataproviders.decorators.has_dataproviders(cls)[source]

Wraps a class (generally a Datatype), finds methods within that have been decorated with @dataprovider and adds them, by their name, to a map in the class.

This allows a class to maintain a name -> method map, effectively ‘registering’ dataprovider factory methods:

@has_dataproviders
class MyDtype( data.Data ):

    @dataprovider_factory( 'bler' )
    def provide_some_bler( self, dataset, **settings ):
        '''blerblerbler'''
        dataset_source = providers.DatasetDataProvider( dataset )
        # ... chain other, intermidiate providers here
        return providers.BlerDataProvider( dataset_source, **settings )

# use the base method in data.Data
provider = dataset.datatype.dataprovider( dataset, 'bler',
                                          my_setting='blah', ... )
# OR directly from the map
provider = dataset.datatype.dataproviders[ 'bler' ]( dataset,
                                                     my_setting='blah', ... )
galaxy.datatypes.dataproviders.decorators.dataprovider_factory(name, settings=None)[source]

Wraps a class method and marks it as a dataprovider factory and creates a function to parse query strings to __init__ arguments as the parse_query_string_settings attribute of the factory function.

An example use of the parse_query_string_settings:

kwargs = dataset.datatype.dataproviders[ provider ].parse_query_string_settings( query_kwargs )
return list( dataset.datatype.dataprovider( dataset, provider, **kwargs ) )
Parameters:
  • name (any hashable var) – what name/key to register the factory under in cls.dataproviders

  • settings (dictionary) – dictionary containing key/type pairs for parsing query strings to __init__ arguments

galaxy.datatypes.dataproviders.exceptions module

DataProvider related exceptions.

exception galaxy.datatypes.dataproviders.exceptions.InvalidDataProviderSource(source=None, msg='')[source]

Bases: TypeError

Raised when a unusable source is passed to a provider.

__init__(source=None, msg='')[source]
exception galaxy.datatypes.dataproviders.exceptions.NoProviderAvailable(factory_source, format_requested=None, msg='')[source]

Bases: TypeError

Raised when no provider is found for the given format_requested.

Parameters:
  • factory_source – the item that the provider was requested from

  • format_requested – the format_requested (a hashable key to access factory_source.datatypes with)

Both params are attached to this class and accessible to the try-catch receiver.

Meant to be used within a class that builds dataproviders (e.g. a Datatype)

__init__(factory_source, format_requested=None, msg='')[source]

galaxy.datatypes.dataproviders.external module

Data providers that iterate over a source that is not in memory or not in a file.

class galaxy.datatypes.dataproviders.external.SubprocessDataProvider(*args, **kwargs)[source]

Bases: DataProvider

Data provider that uses the output from an intermediate program and subprocess as its data source.

__init__(*args, **kwargs)[source]
Parameters:

args (variadic function args) – the list of strings used to build commands.

subprocess(*command_list, **kwargs)[source]
Parameters:

args (variadic function args) – the list of strings used as commands.

settings: Dict[str, str] = {}
class galaxy.datatypes.dataproviders.external.RegexSubprocessDataProvider(*args, **kwargs)[source]

Bases: RegexLineDataProvider

RegexLineDataProvider that uses a SubprocessDataProvider as its data source.

__init__(*args, **kwargs)[source]
Parameters:
  • regex_list (list (of str)) – list of strings or regular expression strings that will be match`ed to each line Optional: defaults to `None (no matching)

  • invert (bool) – if True will provide only lines that do not match. Optional: defaults to False

settings: Dict[str, str] = {'comment_char': 'str', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
class galaxy.datatypes.dataproviders.external.URLDataProvider(url, method='GET', data=None, **kwargs)[source]

Bases: DataProvider

Data provider that uses the contents of a URL for its data source.

This can be piped through other providers (column, map, genome region, etc.).

VALID_METHODS = ('GET', 'POST')
__init__(url, method='GET', data=None, **kwargs)[source]
Parameters:
  • url – the base URL to open.

  • method – the HTTP method to use. Optional: defaults to ‘GET’

  • data (dict) – any data to pass (either in query for ‘GET’ or as post data with ‘POST’)

settings: Dict[str, str] = {}
class galaxy.datatypes.dataproviders.external.GzipDataProvider(source, **kwargs)[source]

Bases: DataProvider

Data provider that uses g(un)zip on a file as its source.

This can be piped through other providers (column, map, genome region, etc.).

__init__(source, **kwargs)[source]

Sets up a data provider, validates supplied source.

Parameters:

source – the source that this iterator will loop over. (Should implement the iterable interface and ideally have the context manager interface as well)

settings: Dict[str, str] = {}
class galaxy.datatypes.dataproviders.external.TempfileDataProvider(source, **kwargs)[source]

Bases: DataProvider

Writes the data from the given source to a temp file, allowing it to be used as a source where a file_name is needed (e.g. as a parameter to a command line tool: samtools view -t <this_provider.source.get_file_name()>)

__init__(source, **kwargs)[source]

Sets up a data provider, validates supplied source.

Parameters:

source – the source that this iterator will loop over. (Should implement the iterable interface and ideally have the context manager interface as well)

create_file()[source]
settings: Dict[str, str] = {}
write_to_file()[source]

galaxy.datatypes.dataproviders.hierarchy module

Dataproviders that iterate over lines from their sources.

class galaxy.datatypes.dataproviders.hierarchy.HierarchalDataProvider(source, **kwargs)[source]

Bases: BlockDataProvider

Class that uses formats where a datum may have a parent or children data.

e.g. XML, HTML, GFF3, Phylogenetic

__init__(source, **kwargs)[source]
Parameters:
  • new_block_delim_fn (function) – T/F function to determine whether a given line is the start of a new block.

  • block_filter_fn (function) – function that determines if a block is valid and will be provided. Optional: defaults to None (no filtering)

settings: Dict[str, str] = {'limit': 'int', 'offset': 'int'}
class galaxy.datatypes.dataproviders.hierarchy.XMLDataProvider(source, selector=None, max_depth=None, **kwargs)[source]

Bases: HierarchalDataProvider

Data provider that converts selected XML elements to dictionaries.

settings: Dict[str, str] = {'limit': 'int', 'max_depth': 'int', 'offset': 'int', 'selector': 'str'}
ITERPARSE_ALL_EVENTS = ('start', 'end', 'start-ns', 'end-ns')
__init__(source, selector=None, max_depth=None, **kwargs)[source]
Parameters:
  • selector – some partial string in the desired tags to return

  • max_depth – the number of generations of descendents to return

matches_selector(element, selector=None)[source]

Returns true if the element matches the selector.

Parameters:
  • element – an XML Element

  • selector – some partial string in the desired tags to return

Change point for more sophisticated selectors.

element_as_dict(element)[source]

Converts an XML element (its text, tag, and attributes) to dictionary form.

Parameters:

element – an XML Element

get_children(element, max_depth=None)[source]

Yield all children of element (and their children - recursively) in dictionary form. :param element: an XML Element :param max_depth: the number of generations of descendents to return

galaxy.datatypes.dataproviders.line module

Dataproviders that iterate over lines from their sources.

class galaxy.datatypes.dataproviders.line.FilteredLineDataProvider(source, strip_lines=True, strip_newlines=False, provide_blank=False, comment_char='#', **kwargs)[source]

Bases: LimitedOffsetDataProvider

Data provider that yields lines of data from its source allowing optional control over which line to start on and how many lines to return.

DEFAULT_COMMENT_CHAR = '#'
settings: Dict[str, str] = {'comment_char': 'str', 'limit': 'int', 'offset': 'int', 'provide_blank': 'bool', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
__init__(source, strip_lines=True, strip_newlines=False, provide_blank=False, comment_char='#', **kwargs)[source]
Parameters:
  • strip_lines (bool) – remove whitespace from the beginning an ending of each line (or not). Optional: defaults to True

  • strip_newlines – remove newlines only (only functions when strip_lines is false) Optional: defaults to False

  • provide_blank (bool) – are empty lines considered valid and provided? Optional: defaults to False

  • comment_char (str) – character(s) that indicate a line isn’t data (a comment) and should not be provided. Optional: defaults to ‘#’

filter(line)[source]

Determines whether to provide line or not.

Parameters:

line (str) – the incoming line from the source

Returns:

a line or None

class galaxy.datatypes.dataproviders.line.RegexLineDataProvider(source, regex_list=None, invert=False, **kwargs)[source]

Bases: FilteredLineDataProvider

Data provider that yields only those lines of data from its source that do (or do not when invert is True) match one or more of the given list of regexs.

Note

the regex matches are effectively OR’d (if any regex matches the line it is considered valid and will be provided).

settings: Dict[str, str] = {'comment_char': 'str', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
__init__(source, regex_list=None, invert=False, **kwargs)[source]
Parameters:
  • regex_list (list (of str)) – list of strings or regular expression strings that will be match`ed to each line Optional: defaults to `None (no matching)

  • invert (bool) – if True will provide only lines that do not match. Optional: defaults to False

filter(line)[source]

Determines whether to provide line or not.

Parameters:

line (str) – the incoming line from the source

Returns:

a line or None

filter_by_regex(line)[source]
class galaxy.datatypes.dataproviders.line.BlockDataProvider(source, new_block_delim_fn=None, block_filter_fn=None, **kwargs)[source]

Bases: LimitedOffsetDataProvider

Class that uses formats where multiple lines combine to describe a single datum. The data output will be a list of either map/dicts or sub-arrays.

Uses FilteredLineDataProvider as its source (kwargs not passed).

e.g. Fasta, GenBank, MAF, hg log Note: mem intensive (gathers list of lines before output)

__init__(source, new_block_delim_fn=None, block_filter_fn=None, **kwargs)[source]
Parameters:
  • new_block_delim_fn (function) – T/F function to determine whether a given line is the start of a new block.

  • block_filter_fn (function) – function that determines if a block is valid and will be provided. Optional: defaults to None (no filtering)

init_new_block()[source]

Set up internal data for next block.

filter(line)[source]

Line filter here being used to aggregate/assemble lines into a block and determine whether the line indicates a new block.

Parameters:

line (str) – the incoming line from the source

Returns:

a block or None

is_new_block(line)[source]

Returns True if the given line indicates the start of a new block (and the current block should be provided) or False if not.

add_line_to_block(line)[source]

Integrate the given line into the current block.

Called per line.

assemble_current_block()[source]

Build the current data into a block.

Called per block (just before providing).

filter_block(block)[source]

Is the current block a valid/desired datum.

Called per block (just before providing).

handle_last_block()[source]

Handle any blocks remaining after the main loop.

settings: Dict[str, str] = {'limit': 'int', 'offset': 'int'}