Warning
This document is for an old release of Galaxy. You can alternatively view this page in the latest release if it exists or view the top of the latest release's documentation.
galaxy.datatypes.dataproviders package¶
Dataproviders are iterators with context managers that provide data to some consumer datum by datum.
As well as subclassing and overriding to get the proper data, Dataproviders can be piped from one to the other.
Note
be careful to NOT pipe providers into subclasses of those providers. Subclasses provide all the functionality of their superclasses, so there’s generally no need.
Note
be careful to when using piped providers that accept the same keywords in their __init__ functions (such as limit or offset) to pass those keywords to the proper (often final) provider. These errors that result can be hard to diagnose.
Submodules¶
galaxy.datatypes.dataproviders.base module¶
Base class(es) for all DataProviders.
-
class
galaxy.datatypes.dataproviders.base.
HasSettings
(name, base_classes, attributes)[source]¶ Bases:
type
Metaclass for data providers that allows defining and inheriting a dictionary named ‘settings’.
Useful for allowing class level access to expected variable types passed to class __init__ functions so they can be parsed from a query string.
-
class
galaxy.datatypes.dataproviders.base.
DataProvider
(source, **kwargs)[source]¶ Bases:
object
Base class for all data providers. Data providers:
have a source (which must be another file-like object)
implement both the iterator and context manager interfaces
do not allow write methods (but otherwise implement the other file object interface methods)
-
__init__
(source, **kwargs)[source]¶ Sets up a data provider, validates supplied source.
- Parameters
source – the source that this iterator will loop over. (Should implement the iterable interface and ideally have the context manager interface as well)
-
validate_source
(source)[source]¶ Is this a valid source for this provider?
- Raises
InvalidDataProviderSource – if the source is considered invalid.
Meant to be overridden in subclasses.
-
class
galaxy.datatypes.dataproviders.base.
FilteredDataProvider
(source, filter_fn=None, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.base.DataProvider
Passes each datum through a filter function and yields it if that function returns a non-None value.
- Also maintains counters:
num_data_read: how many data have been consumed from the source.
num_valid_data_read: how many data have been returned from filter.
num_data_returned: how many data has this provider yielded.
-
__init__
(source, filter_fn=None, **kwargs)[source]¶ - Parameters
filter_fn – a lambda or function that will be passed a datum and return either the (optionally modified) datum or None.
-
class
galaxy.datatypes.dataproviders.base.
LimitedOffsetDataProvider
(source, offset=0, limit=None, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.base.FilteredDataProvider
A provider that uses the counters from FilteredDataProvider to limit the number of data and/or skip offset number of data before providing.
Useful for grabbing sections from a source (e.g. pagination).
-
class
galaxy.datatypes.dataproviders.base.
MultiSourceDataProvider
(source_list, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.base.DataProvider
A provider that iterates over a list of given sources and provides data from one after another.
An iterator over iterators.
galaxy.datatypes.dataproviders.chunk module¶
Chunk (N number of bytes at M offset to a source’s beginning) provider.
Primarily for file sources but usable by any iterator that has both seek and read( N ).
-
class
galaxy.datatypes.dataproviders.chunk.
ChunkDataProvider
(source, chunk_index=0, chunk_size=65536, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.base.DataProvider
Data provider that yields chunks of data from its file.
Note: this version does not account for lines and works with Binary datatypes.
-
MAX_CHUNK_SIZE
= 65536¶
-
DEFAULT_CHUNK_SIZE
= 65536¶
-
__init__
(source, chunk_index=0, chunk_size=65536, **kwargs)[source]¶ - Parameters
chunk_index – if a source can be divided into N number of chunk_size sections, this is the index of which section to return.
chunk_size – how large are the desired chunks to return (gen. in bytes).
-
-
class
galaxy.datatypes.dataproviders.chunk.
Base64ChunkDataProvider
(source, chunk_index=0, chunk_size=65536, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.chunk.ChunkDataProvider
Data provider that yields chunks of base64 encoded data from its file.
galaxy.datatypes.dataproviders.column module¶
Providers that provide lists of lists generally where each line of a source is further subdivided into multiple data (e.g. columns from a line).
-
class
galaxy.datatypes.dataproviders.column.
ColumnarDataProvider
(source, indeces=None, column_count=None, column_types=None, parsers=None, parse_columns=True, deliminator='\t', filters=None, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.line.RegexLineDataProvider
Data provider that provide a list of columns from the lines of its source.
Columns are returned in the order given in indeces, so this provider can re-arrange columns.
If any desired index is outside the actual number of columns in the source, this provider will None-pad the output and you are guaranteed the same number of columns as the number of indeces asked for (even if they are filled with None).
-
settings
: Dict[str, str] = {'column_count': 'int', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}¶
-
__init__
(source, indeces=None, column_count=None, column_types=None, parsers=None, parse_columns=True, deliminator='\t', filters=None, **kwargs)[source]¶ - Parameters
indeces (list or None) – a list of indeces of columns to gather from each row Optional: will default to None. If None, this provider will return all rows (even when a particular row contains more/less than others). If a row/line does not contain an element at a given index, the provider will-return/fill-with a None value as the element.
column_count (int) – an alternate means of defining indeces, use an int here to effectively provide the first N columns. Optional: will default to None.
column_types (list of strings) – a list of string names of types that the provider will use to look up an appropriate parser for the column. (e.g. ‘int’, ‘float’, ‘str’, ‘bool’) Optional: will default to parsing all columns as strings.
parsers (dictionary) – a dictionary keyed with column type strings and with values that are functions to use when parsing those types. Optional: will default to using the function _get_default_parsers.
parse_columns (bool) – attempt to parse columns? Optional: defaults to True.
deliminator (str) – character(s) used to split each row/line of the source. Optional: defaults to the tab character.
Note
that the subclass constructors are passed kwargs - so they’re params (limit, offset, etc.) are also applicable here.
-
create_numeric_filter
(column, op, val)[source]¶ Return an anonymous filter function that will be passed the array of parsed columns. Return None if no filter function can be created for the given params.
The function will compare the column at index column against val using the given op where op is one of:
lt: less than
le: less than or equal to
eq: equal to
ne: not equal to
ge: greather than or equal to
gt: greater than
val is cast as float here and will return None if there’s a parsing error.
-
create_string_filter
(column, op, val)[source]¶ Return an anonymous filter function that will be passed the array of parsed columns. Return None if no filter function can be created for the given params.
The function will compare the column at index column against val using the given op where op is one of:
eq: exactly matches
has: the column contains the substring val
re: the column matches the regular expression in val
-
create_list_filter
(column, op, val)[source]¶ Return an anonymous filter function that will be passed the array of parsed columns. Return None if no filter function can be created for the given params.
The function will compare the column at index column against val using the given op where op is one of:
eq: the list val exactly matches the list in the column
has: the list in the column contains the sublist val
-
get_default_parsers
()[source]¶ Return parser dictionary keyed for each columnar type (as defined in datatypes).
Note
primitives only by default (str, int, float, boolean, None). Other (more complex) types are retrieved as strings.
- Returns
a dictionary of the form: { <parser type name> : <function used to parse type> }
-
filter
(line)[source]¶ Determines whether to provide line or not.
- Parameters
line (str) – the incoming line from the source
- Returns
a line or None
-
parse_columns_from_line
(line)[source]¶ Returns a list of the desired, parsed columns. :param line: the line to parse :type line: str
-
parse_column_at_index
(columns, parser_index, index)[source]¶ Get the column type for the parser from self.column_types or None if the type is unavailable.
-
parse_value
(val, type)[source]¶ Attempt to parse and return the given value based on the given type.
- Parameters
val – the column value to parse (often a string)
type – the string type ‘name’ used to find the appropriate parser
- Returns
the parsed value or value if no type found in parsers or None if there was a parser error (ValueError)
-
-
class
galaxy.datatypes.dataproviders.column.
DictDataProvider
(source, column_names=None, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.column.ColumnarDataProvider
Data provider that zips column_names and columns from the source’s contents into a dictionary.
A combination use of both column_names and indeces allows ‘picking’ key/value pairs from the source.
Note
The subclass constructors are passed kwargs - so their params (limit, offset, etc.) are also applicable here.
-
settings
: Dict[str, str] = {'column_count': 'int', 'column_names': 'list:str', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}¶
-
__init__
(source, column_names=None, **kwargs)[source]¶ - Parameters
column_names – an ordered list of strings that will be used as the keys for each column in the returned dictionaries. The number of key, value pairs each returned dictionary has will be as short as the number of column names provided.
-
galaxy.datatypes.dataproviders.dataset module¶
- Dataproviders that use either:
- the file contents and/or metadata from a Galaxy DatasetInstance as
their source.
- or provide data in some way relevant to bioinformatic data
(e.g. parsing genomic regions from their source)
-
class
galaxy.datatypes.dataproviders.dataset.
DatasetDataProvider
(dataset, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.base.DataProvider
Class that uses the file contents and/or metadata from a Galaxy DatasetInstance as its source.
DatasetDataProvider can be seen as the intersection between a datatype’s metadata and a dataset’s file contents. It (so far) mainly provides helper and conv. methods for using dataset metadata to set up and control how the data is provided.
-
__init__
(dataset, **kwargs)[source]¶ - Parameters
dataset (model.DatasetInstance) – the Galaxy dataset whose file will be the source
-
classmethod
get_column_metadata_from_dataset
(dataset)[source]¶ Convenience class method to get column metadata from a dataset.
- Returns
a dictionary of column_count, column_types, and column_names if they’re available, setting each to None if not.
-
get_metadata_column_types
(indeces=None)[source]¶ Return the list of column_types for this dataset or None if unavailable.
- Parameters
indeces (list of ints) – the indeces for the columns of which to return the types. Optional: defaults to None (return all types)
-
get_metadata_column_names
(indeces=None)[source]¶ Return the list of column_names for this dataset or None if unavailable.
- Parameters
indeces (list of ints) – the indeces for the columns of which to return the names. Optional: defaults to None (return all names)
-
get_indeces_by_column_names
(list_of_column_names)[source]¶ Return the list of column indeces when given a list of column_names.
- Parameters
list_of_column_names (list of strs) – the names of the columns of which to get indeces.
- Raises
KeyError – if column_names are not found
ValueError – if an entry in list_of_column_names is not in column_names
-
get_metadata_column_index_by_name
(name)[source]¶ Return the 1-base index of a sources column with the given name.
-
get_genomic_region_indeces
(check=False)[source]¶ Return a list of column indeces for ‘chromCol’, ‘startCol’, ‘endCol’ from a source representing a genomic region.
- Parameters
check (bool) – if True will raise a ValueError if any were not found.
- Raises
ValueError – if check is True and one or more indeces were not found.
- Returns
list of column indeces for the named columns.
-
-
class
galaxy.datatypes.dataproviders.dataset.
ConvertedDatasetDataProvider
(dataset, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.dataset.DatasetDataProvider
Class that uses the file contents of a dataset after conversion to a different format.
-
__init__
(dataset, **kwargs)[source]¶ - Parameters
dataset (model.DatasetInstance) – the Galaxy dataset whose file will be the source
-
-
class
galaxy.datatypes.dataproviders.dataset.
DatasetColumnarDataProvider
(dataset, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.column.ColumnarDataProvider
Data provider that uses a DatasetDataProvider as its source and the dataset’s metadata to buuild settings for the ColumnarDataProvider it’s inherited from.
-
__init__
(dataset, **kwargs)[source]¶ All kwargs are inherited from ColumnarDataProvider. .. seealso:: column.ColumnarDataProvider
If no kwargs are given, this class will attempt to get those kwargs from the dataset source’s metadata. If any kwarg is given, it will override and be used in place of any metadata available.
-
settings
: Dict[str, str] = {'column_count': 'int', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}¶
-
-
class
galaxy.datatypes.dataproviders.dataset.
DatasetDictDataProvider
(dataset, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.column.DictDataProvider
Data provider that uses a DatasetDataProvider as its source and the dataset’s metadata to buuild settings for the DictDataProvider it’s inherited from.
-
__init__
(dataset, **kwargs)[source]¶ All kwargs are inherited from DictDataProvider. .. seealso:: column.DictDataProvider
If no kwargs are given, this class will attempt to get those kwargs from the dataset source’s metadata. If any kwarg is given, it will override and be used in place of any metadata available.
The relationship between column_names and indeces is more complex: +—————–+——————————-+———————–+ | | Indeces given | Indeces NOT given | +=================+===============================+=======================+ | Names given | pull indeces, rename w/ names | pull by name | +=================+——————————-+———————–+ | Names NOT given | pull indeces, name w/ meta | pull all, name w/meta | +=================+——————————-+———————–+
-
settings
: Dict[str, str] = {'column_count': 'int', 'column_names': 'list:str', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}¶
-
-
class
galaxy.datatypes.dataproviders.dataset.
GenomicRegionDataProvider
(dataset, chrom_column=None, start_column=None, end_column=None, named_columns=False, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.column.ColumnarDataProvider
Data provider that parses chromosome, start, and end data from a file using the datasets metadata settings.
Is a ColumnarDataProvider that uses a DatasetDataProvider as its source.
If named_columns is true, will return dictionaries with the keys ‘chrom’, ‘start’, ‘end’.
-
COLUMN_NAMES
= ['chrom', 'start', 'end']¶
-
settings
: Dict[str, str] = {'chrom_column': 'int', 'column_count': 'int', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'end_column': 'int', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'named_columns': 'bool', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'start_column': 'int', 'strip_lines': 'bool', 'strip_newlines': 'bool'}¶
-
__init__
(dataset, chrom_column=None, start_column=None, end_column=None, named_columns=False, **kwargs)[source]¶ - Parameters
dataset (model.DatasetInstance) – the Galaxy dataset whose file will be the source
chrom_column (int) – optionally specify the chrom column index
start_column (int) – optionally specify the start column index
end_column (int) – optionally specify the end column index
named_columns (bool) – optionally return dictionaries keying each column with ‘chrom’, ‘start’, or ‘end’. Optional: defaults to False
-
-
class
galaxy.datatypes.dataproviders.dataset.
IntervalDataProvider
(dataset, chrom_column=None, start_column=None, end_column=None, strand_column=None, name_column=None, named_columns=False, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.column.ColumnarDataProvider
Data provider that parses chromosome, start, and end data (as well as strand and name if set in the metadata) using the dataset’s metadata settings.
If named_columns is true, will return dictionaries with the keys ‘chrom’, ‘start’, ‘end’ (and ‘strand’ and ‘name’ if available).
-
COLUMN_NAMES
= ['chrom', 'start', 'end', 'strand', 'name']¶
-
settings
: Dict[str, str] = {'chrom_column': 'int', 'column_count': 'int', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'end_column': 'int', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'name_column': 'int', 'named_columns': 'bool', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'start_column': 'int', 'strand_column': 'int', 'strip_lines': 'bool', 'strip_newlines': 'bool'}¶
-
__init__
(dataset, chrom_column=None, start_column=None, end_column=None, strand_column=None, name_column=None, named_columns=False, **kwargs)[source]¶ - Parameters
dataset (model.DatasetInstance) – the Galaxy dataset whose file will be the source
named_columns (bool) – optionally return dictionaries keying each column with ‘chrom’, ‘start’, ‘end’, ‘strand’, or ‘name’. Optional: defaults to False
-
-
class
galaxy.datatypes.dataproviders.dataset.
FastaDataProvider
(source, ids=None, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.base.FilteredDataProvider
Class that returns fasta format data in a list of maps of the form:
{ id: <fasta header id>, sequence: <joined lines of nucleotide/amino data> }
-
class
galaxy.datatypes.dataproviders.dataset.
TwoBitFastaDataProvider
(source, ids=None, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.dataset.DatasetDataProvider
Class that returns fasta format data in a list of maps of the form:
{ id: <fasta header id>, sequence: <joined lines of nucleotide/amino data> }
-
class
galaxy.datatypes.dataproviders.dataset.
WiggleDataProvider
(source, named_columns=False, column_names=None, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.base.LimitedOffsetDataProvider
Class that returns chrom, pos, data from a wiggle source.
-
COLUMN_NAMES
= ['chrom', 'pos', 'value']¶
-
settings
: Dict[str, str] = {'column_names': 'list:str', 'limit': 'int', 'named_columns': 'bool', 'offset': 'int'}¶
-
__init__
(source, named_columns=False, column_names=None, **kwargs)[source]¶ - Parameters
named_columns (bool) – optionally return dictionaries keying each column with ‘chrom’, ‘start’, ‘end’, ‘strand’, or ‘name’. Optional: defaults to False
column_names – an ordered list of strings that will be used as the keys for each column in the returned dictionaries. The number of key, value pairs each returned dictionary has will be as short as the number of column names provided.
-
-
class
galaxy.datatypes.dataproviders.dataset.
BigWigDataProvider
(source, chrom, start, end, named_columns=False, column_names=None, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.base.LimitedOffsetDataProvider
Class that returns chrom, pos, data from a wiggle source.
-
COLUMN_NAMES
= ['chrom', 'pos', 'value']¶
-
settings
: Dict[str, str] = {'column_names': 'list:str', 'limit': 'int', 'named_columns': 'bool', 'offset': 'int'}¶
-
__init__
(source, chrom, start, end, named_columns=False, column_names=None, **kwargs)[source]¶ - Parameters
chrom (str) – which chromosome within the bigbed file to extract data for
start (int) – the start of the region from which to extract data
end (int) – the end of the region from which to extract data
named_columns (bool) – optionally return dictionaries keying each column with ‘chrom’, ‘start’, ‘end’, ‘strand’, or ‘name’. Optional: defaults to False
column_names – an ordered list of strings that will be used as the keys for each column in the returned dictionaries. The number of key, value pairs each returned dictionary has will be as short as the number of column names provided.
-
-
class
galaxy.datatypes.dataproviders.dataset.
DatasetSubprocessDataProvider
(dataset, *args, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.external.SubprocessDataProvider
Create a source from running a subprocess on a dataset’s file.
Uses a subprocess as its source and has a dataset (gen. as an input file for the process).
-
class
galaxy.datatypes.dataproviders.dataset.
SamtoolsDataProvider
(dataset, options_string='', options_dict=None, regions=None, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.line.RegexLineDataProvider
Data provider that uses samtools on a Sam or Bam file as its source.
This can be piped through other providers (column, map, genome region, etc.).
Note
that only the samtools ‘view’ command is currently implemented.
-
FLAGS_WO_ARGS
= 'bhHSu1xXcB'¶
-
FLAGS_W_ARGS
= 'fFqlrs'¶
-
VALID_FLAGS
= 'bhHSu1xXcBfFqlrs'¶
-
build_command_list
(subcommand, options_string, options_dict, regions)[source]¶ Convert all init args to list form.
-
to_options_list
(options_string, options_dict)[source]¶ Convert both options_string and options_dict to list form while filtering out non-‘valid’ options.
-
-
class
galaxy.datatypes.dataproviders.dataset.
SQliteDataProvider
(source, query=None, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.base.DataProvider
Data provider that uses a sqlite database file as its source.
Allows any query to be run and returns the resulting rows as sqlite3 row objects
-
class
galaxy.datatypes.dataproviders.dataset.
SQliteDataTableProvider
(source, query=None, headers=False, limit=9223372036854775807, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.base.DataProvider
Data provider that uses a sqlite database file as its source. Allows any query to be run and returns the resulting rows as arrays of arrays
-
__init__
(source, query=None, headers=False, limit=9223372036854775807, **kwargs)[source]¶ Sets up a data provider, validates supplied source.
- Parameters
source – the source that this iterator will loop over. (Should implement the iterable interface and ideally have the context manager interface as well)
-
-
class
galaxy.datatypes.dataproviders.dataset.
SQliteDataDictProvider
(source, query=None, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.base.DataProvider
Data provider that uses a sqlite database file as its source. Allows any query to be run and returns the resulting rows as arrays of dicts
galaxy.datatypes.dataproviders.decorators module¶
DataProvider related decorators.
-
galaxy.datatypes.dataproviders.decorators.
has_dataproviders
(cls)[source]¶ Wraps a class (generally a Datatype), finds methods within that have been decorated with @dataprovider and adds them, by their name, to a map in the class.
This allows a class to maintain a name -> method map, effectively ‘registering’ dataprovider factory methods:
@has_dataproviders class MyDtype( data.Data ): @dataprovider_factory( 'bler' ) def provide_some_bler( self, dataset, **settings ): '''blerblerbler''' dataset_source = providers.DatasetDataProvider( dataset ) # ... chain other, intermidiate providers here return providers.BlerDataProvider( dataset_source, **settings ) # use the base method in data.Data provider = dataset.datatype.dataprovider( dataset, 'bler', my_setting='blah', ... ) # OR directly from the map provider = dataset.datatype.dataproviders[ 'bler' ]( dataset, my_setting='blah', ... )
-
galaxy.datatypes.dataproviders.decorators.
dataprovider_factory
(name, settings=None)[source]¶ Wraps a class method and marks it as a dataprovider factory and creates a function to parse query strings to __init__ arguments as the parse_query_string_settings attribute of the factory function.
An example use of the parse_query_string_settings:
kwargs = dataset.datatype.dataproviders[ provider ].parse_query_string_settings( query_kwargs ) return list( dataset.datatype.dataprovider( dataset, provider, **kwargs ) )
- Parameters
name (any hashable var) – what name/key to register the factory under in cls.dataproviders
settings (dictionary) – dictionary containing key/type pairs for parsing query strings to __init__ arguments
galaxy.datatypes.dataproviders.exceptions module¶
DataProvider related exceptions.
-
exception
galaxy.datatypes.dataproviders.exceptions.
InvalidDataProviderSource
(source=None, msg='')[source]¶ Bases:
TypeError
Raised when a unusable source is passed to a provider.
-
exception
galaxy.datatypes.dataproviders.exceptions.
NoProviderAvailable
(factory_source, format_requested=None, msg='')[source]¶ Bases:
TypeError
Raised when no provider is found for the given format_requested.
- Parameters
factory_source – the item that the provider was requested from
format_requested – the format_requested (a hashable key to access factory_source.datatypes with)
Both params are attached to this class and accessible to the try-catch receiver.
Meant to be used within a class that builds dataproviders (e.g. a Datatype)
galaxy.datatypes.dataproviders.external module¶
Data providers that iterate over a source that is not in memory or not in a file.
-
class
galaxy.datatypes.dataproviders.external.
SubprocessDataProvider
(*args, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.base.DataProvider
Data provider that uses the output from an intermediate program and subprocess as its data source.
-
__init__
(*args, **kwargs)[source]¶ - Parameters
args (variadic function args) – the list of strings used to build commands.
-
-
class
galaxy.datatypes.dataproviders.external.
RegexSubprocessDataProvider
(*args, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.line.RegexLineDataProvider
RegexLineDataProvider that uses a SubprocessDataProvider as its data source.
-
class
galaxy.datatypes.dataproviders.external.
URLDataProvider
(url, method='GET', data=None, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.base.DataProvider
Data provider that uses the contents of a URL for its data source.
This can be piped through other providers (column, map, genome region, etc.).
-
VALID_METHODS
= ('GET', 'POST')¶
-
-
class
galaxy.datatypes.dataproviders.external.
GzipDataProvider
(source, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.base.DataProvider
Data provider that uses g(un)zip on a file as its source.
This can be piped through other providers (column, map, genome region, etc.).
-
class
galaxy.datatypes.dataproviders.external.
TempfileDataProvider
(source, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.base.DataProvider
Writes the data from the given source to a temp file, allowing it to be used as a source where a file_name is needed (e.g. as a parameter to a command line tool: samtools view -t <this_provider.source.file_name>)
galaxy.datatypes.dataproviders.hierarchy module¶
Dataproviders that iterate over lines from their sources.
-
class
galaxy.datatypes.dataproviders.hierarchy.
HierarchalDataProvider
(source, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.line.BlockDataProvider
Class that uses formats where a datum may have a parent or children data.
e.g. XML, HTML, GFF3, Phylogenetic
-
class
galaxy.datatypes.dataproviders.hierarchy.
XMLDataProvider
(source, selector=None, max_depth=None, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.hierarchy.HierarchalDataProvider
Data provider that converts selected XML elements to dictionaries.
-
settings
: Dict[str, str] = {'limit': 'int', 'max_depth': 'int', 'offset': 'int', 'selector': 'str'}¶
-
ITERPARSE_ALL_EVENTS
= ('start', 'end', 'start-ns', 'end-ns')¶
-
__init__
(source, selector=None, max_depth=None, **kwargs)[source]¶ - Parameters
selector – some partial string in the desired tags to return
max_depth – the number of generations of descendents to return
-
matches_selector
(element, selector=None)[source]¶ Returns true if the
element
matches theselector
.- Parameters
element – an XML
Element
selector – some partial string in the desired tags to return
Change point for more sophisticated selectors.
-
galaxy.datatypes.dataproviders.line module¶
Dataproviders that iterate over lines from their sources.
-
class
galaxy.datatypes.dataproviders.line.
FilteredLineDataProvider
(source, strip_lines=True, strip_newlines=False, provide_blank=False, comment_char='#', **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.base.LimitedOffsetDataProvider
Data provider that yields lines of data from its source allowing optional control over which line to start on and how many lines to return.
-
DEFAULT_COMMENT_CHAR
= '#'¶
-
settings
: Dict[str, str] = {'comment_char': 'str', 'limit': 'int', 'offset': 'int', 'provide_blank': 'bool', 'strip_lines': 'bool', 'strip_newlines': 'bool'}¶
-
__init__
(source, strip_lines=True, strip_newlines=False, provide_blank=False, comment_char='#', **kwargs)[source]¶ - Parameters
strip_lines (bool) – remove whitespace from the beginning an ending of each line (or not). Optional: defaults to True
strip_newlines – remove newlines only (only functions when
strip_lines
is false) Optional: defaults to Falseprovide_blank (bool) – are empty lines considered valid and provided? Optional: defaults to False
comment_char (str) – character(s) that indicate a line isn’t data (a comment) and should not be provided. Optional: defaults to ‘#’
-
-
class
galaxy.datatypes.dataproviders.line.
RegexLineDataProvider
(source, regex_list=None, invert=False, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.line.FilteredLineDataProvider
Data provider that yields only those lines of data from its source that do (or do not when invert is True) match one or more of the given list of regexs.
Note
the regex matches are effectively OR’d (if any regex matches the line it is considered valid and will be provided).
-
settings
: Dict[str, str] = {'comment_char': 'str', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}¶
-
-
class
galaxy.datatypes.dataproviders.line.
BlockDataProvider
(source, new_block_delim_fn=None, block_filter_fn=None, **kwargs)[source]¶ Bases:
galaxy.datatypes.dataproviders.base.LimitedOffsetDataProvider
Class that uses formats where multiple lines combine to describe a single datum. The data output will be a list of either map/dicts or sub-arrays.
Uses FilteredLineDataProvider as its source (kwargs not passed).
e.g. Fasta, GenBank, MAF, hg log Note: mem intensive (gathers list of lines before output)
-
__init__
(source, new_block_delim_fn=None, block_filter_fn=None, **kwargs)[source]¶ - Parameters
new_block_delim_fn (function) – T/F function to determine whether a given line is the start of a new block.
block_filter_fn (function) – function that determines if a block is valid and will be provided. Optional: defaults to None (no filtering)
-
filter
(line)[source]¶ Line filter here being used to aggregate/assemble lines into a block and determine whether the line indicates a new block.
- Parameters
line (str) – the incoming line from the source
- Returns
a block or None
-
is_new_block
(line)[source]¶ Returns True if the given line indicates the start of a new block (and the current block should be provided) or False if not.
-
assemble_current_block
()[source]¶ Build the current data into a block.
Called per block (just before providing).
-