galaxy.datatypes.dataproviders package
Dataproviders are iterators with context managers that provide data to some consumer datum by datum.
As well as subclassing and overriding to get the proper data, Dataproviders can be piped from one to the other.
Note
be careful to NOT pipe providers into subclasses of those providers. Subclasses provide all the functionality of their superclasses, so there’s generally no need.
Note
be careful to when using piped providers that accept the same keywords in their __init__ functions (such as limit or offset) to pass those keywords to the proper (often final) provider. These errors that result can be hard to diagnose.
Submodules
galaxy.datatypes.dataproviders.base module
Base class(es) for all DataProviders.
- class galaxy.datatypes.dataproviders.base.HasSettings(name, base_classes, attributes)[source]
Bases:
type
Metaclass for data providers that allows defining and inheriting a dictionary named ‘settings’.
Useful for allowing class level access to expected variable types passed to class __init__ functions so they can be parsed from a query string.
- class galaxy.datatypes.dataproviders.base.DataProvider(source, **kwargs)[source]
Bases:
object
Base class for all data providers. Data providers:
have a source (which must be another file-like object)
implement both the iterator and context manager interfaces
do not allow write methods (but otherwise implement the other file object interface methods)
- __init__(source, **kwargs)[source]
Sets up a data provider, validates supplied source.
- Parameters:
source – the source that this iterator will loop over. (Should implement the iterable interface and ideally have the context manager interface as well)
- validate_source(source)[source]
Is this a valid source for this provider?
- Raises:
InvalidDataProviderSource – if the source is considered invalid.
Meant to be overridden in subclasses.
- class galaxy.datatypes.dataproviders.base.FilteredDataProvider(source, filter_fn=None, **kwargs)[source]
Bases:
DataProvider
Passes each datum through a filter function and yields it if that function returns a non-None value.
- Also maintains counters:
num_data_read: how many data have been consumed from the source.
num_valid_data_read: how many data have been returned from filter.
num_data_returned: how many data has this provider yielded.
- __init__(source, filter_fn=None, **kwargs)[source]
- Parameters:
filter_fn – a lambda or function that will be passed a datum and return either the (optionally modified) datum or None.
- class galaxy.datatypes.dataproviders.base.LimitedOffsetDataProvider(source, offset=0, limit=10000, **kwargs)[source]
Bases:
FilteredDataProvider
A provider that uses the counters from FilteredDataProvider to limit the number of data and/or skip offset number of data before providing.
Useful for grabbing sections from a source (e.g. pagination).
- class galaxy.datatypes.dataproviders.base.MultiSourceDataProvider(source_list, **kwargs)[source]
Bases:
DataProvider
A provider that iterates over a list of given sources and provides data from one after another.
An iterator over iterators.
galaxy.datatypes.dataproviders.chunk module
Chunk (N number of bytes at M offset to a source’s beginning) provider.
Primarily for file sources but usable by any iterator that has both seek and read( N ).
- class galaxy.datatypes.dataproviders.chunk.ChunkDataProvider(source, chunk_index=0, chunk_size=65536, **kwargs)[source]
Bases:
DataProvider
Data provider that yields chunks of data from its file.
Note: this version does not account for lines and works with Binary datatypes.
- MAX_CHUNK_SIZE = 65536
- DEFAULT_CHUNK_SIZE = 65536
- __init__(source, chunk_index=0, chunk_size=65536, **kwargs)[source]
- Parameters:
chunk_index – if a source can be divided into N number of chunk_size sections, this is the index of which section to return.
chunk_size – how large are the desired chunks to return (gen. in bytes).
- class galaxy.datatypes.dataproviders.chunk.Base64ChunkDataProvider(source, chunk_index=0, chunk_size=65536, **kwargs)[source]
Bases:
ChunkDataProvider
Data provider that yields chunks of base64 encoded data from its file.
galaxy.datatypes.dataproviders.column module
Providers that provide lists of lists generally where each line of a source is further subdivided into multiple data (e.g. columns from a line).
- class galaxy.datatypes.dataproviders.column.ColumnarDataProvider(source, indeces=None, column_count=None, column_types=None, parsers=None, parse_columns=True, deliminator='\t', filters=None, **kwargs)[source]
Bases:
RegexLineDataProvider
Data provider that provide a list of columns from the lines of its source.
Columns are returned in the order given in indeces, so this provider can re-arrange columns.
If any desired index is outside the actual number of columns in the source, this provider will None-pad the output and you are guaranteed the same number of columns as the number of indeces asked for (even if they are filled with None).
- settings: Dict[str, str] = {'column_count': 'int', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
- __init__(source, indeces=None, column_count=None, column_types=None, parsers=None, parse_columns=True, deliminator='\t', filters=None, **kwargs)[source]
- Parameters:
indeces (list or None) – a list of indeces of columns to gather from each row Optional: will default to None. If None, this provider will return all rows (even when a particular row contains more/less than others). If a row/line does not contain an element at a given index, the provider will-return/fill-with a None value as the element.
column_count (int) – an alternate means of defining indeces, use an int here to effectively provide the first N columns. Optional: will default to None.
column_types (list of strings) – a list of string names of types that the provider will use to look up an appropriate parser for the column. (e.g. ‘int’, ‘float’, ‘str’, ‘bool’) Optional: will default to parsing all columns as strings.
parsers (dictionary) – a dictionary keyed with column type strings and with values that are functions to use when parsing those types. Optional: will default to using the function _get_default_parsers.
parse_columns (bool) – attempt to parse columns? Optional: defaults to True.
deliminator (str) – character(s) used to split each row/line of the source. Optional: defaults to the tab character.
Note
that the subclass constructors are passed kwargs - so they’re params (limit, offset, etc.) are also applicable here.
- create_numeric_filter(column, op, val)[source]
Return an anonymous filter function that will be passed the array of parsed columns. Return None if no filter function can be created for the given params.
The function will compare the column at index column against val using the given op where op is one of:
lt: less than
le: less than or equal to
eq: equal to
ne: not equal to
ge: greather than or equal to
gt: greater than
val is cast as float here and will return None if there’s a parsing error.
- create_string_filter(column, op, val)[source]
Return an anonymous filter function that will be passed the array of parsed columns. Return None if no filter function can be created for the given params.
The function will compare the column at index column against val using the given op where op is one of:
eq: exactly matches
has: the column contains the substring val
re: the column matches the regular expression in val
- create_list_filter(column, op, val)[source]
Return an anonymous filter function that will be passed the array of parsed columns. Return None if no filter function can be created for the given params.
The function will compare the column at index column against val using the given op where op is one of:
eq: the list val exactly matches the list in the column
has: the list in the column contains the sublist val
- get_default_parsers()[source]
Return parser dictionary keyed for each columnar type (as defined in datatypes).
Note
primitives only by default (str, int, float, boolean, None). Other (more complex) types are retrieved as strings.
- Returns:
a dictionary of the form: { <parser type name> : <function used to parse type> }
- filter(line)[source]
Determines whether to provide line or not.
- Parameters:
line (str) – the incoming line from the source
- Returns:
a line or None
- parse_columns_from_line(line)[source]
Returns a list of the desired, parsed columns. :param line: the line to parse :type line: str
- parse_column_at_index(columns, parser_index, index)[source]
Get the column type for the parser from self.column_types or None if the type is unavailable.
- parse_value(val, type)[source]
Attempt to parse and return the given value based on the given type.
- Parameters:
val – the column value to parse (often a string)
type – the string type ‘name’ used to find the appropriate parser
- Returns:
the parsed value or value if no type found in parsers or None if there was a parser error (ValueError)
- class galaxy.datatypes.dataproviders.column.DictDataProvider(source, column_names=None, **kwargs)[source]
Bases:
ColumnarDataProvider
Data provider that zips column_names and columns from the source’s contents into a dictionary.
A combination use of both column_names and indeces allows ‘picking’ key/value pairs from the source.
Note
The subclass constructors are passed kwargs - so their params (limit, offset, etc.) are also applicable here.
- settings: Dict[str, str] = {'column_count': 'int', 'column_names': 'list:str', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
- __init__(source, column_names=None, **kwargs)[source]
- Parameters:
column_names – an ordered list of strings that will be used as the keys for each column in the returned dictionaries. The number of key, value pairs each returned dictionary has will be as short as the number of column names provided.
galaxy.datatypes.dataproviders.dataset module
- Dataproviders that use either:
- the file contents and/or metadata from a Galaxy DatasetInstance as
their source.
- or provide data in some way relevant to bioinformatic data
(e.g. parsing genomic regions from their source)
- class galaxy.datatypes.dataproviders.dataset.DatasetDataProvider(dataset, **kwargs)[source]
Bases:
DataProvider
Class that uses the file contents and/or metadata from a Galaxy DatasetInstance as its source.
DatasetDataProvider can be seen as the intersection between a datatype’s metadata and a dataset’s file contents. It (so far) mainly provides helper and conv. methods for using dataset metadata to set up and control how the data is provided.
- __init__(dataset, **kwargs)[source]
- Parameters:
dataset (model.DatasetInstance) – the Galaxy dataset whose file will be the source
- classmethod get_column_metadata_from_dataset(dataset)[source]
Convenience class method to get column metadata from a dataset.
- Returns:
a dictionary of column_count, column_types, and column_names if they’re available, setting each to None if not.
- get_metadata_column_types(indeces=None)[source]
Return the list of column_types for this dataset or None if unavailable.
- Parameters:
indeces (list of ints) – the indeces for the columns of which to return the types. Optional: defaults to None (return all types)
- get_metadata_column_names(indeces=None)[source]
Return the list of column_names for this dataset or None if unavailable.
- Parameters:
indeces (list of ints) – the indeces for the columns of which to return the names. Optional: defaults to None (return all names)
- get_indeces_by_column_names(list_of_column_names)[source]
Return the list of column indeces when given a list of column_names.
- Parameters:
list_of_column_names (list of strs) – the names of the columns of which to get indeces.
- Raises:
KeyError – if column_names are not found
ValueError – if an entry in list_of_column_names is not in column_names
- get_metadata_column_index_by_name(name)[source]
Return the 1-base index of a sources column with the given name.
- get_genomic_region_indeces(check=False)[source]
Return a list of column indeces for ‘chromCol’, ‘startCol’, ‘endCol’ from a source representing a genomic region.
- Parameters:
check (bool) – if True will raise a ValueError if any were not found.
- Raises:
ValueError – if check is True and one or more indeces were not found.
- Returns:
list of column indeces for the named columns.
- class galaxy.datatypes.dataproviders.dataset.ConvertedDatasetDataProvider(dataset, **kwargs)[source]
Bases:
DatasetDataProvider
Class that uses the file contents of a dataset after conversion to a different format.
- __init__(dataset, **kwargs)[source]
- Parameters:
dataset (model.DatasetInstance) – the Galaxy dataset whose file will be the source
- class galaxy.datatypes.dataproviders.dataset.DatasetColumnarDataProvider(dataset, **kwargs)[source]
Bases:
ColumnarDataProvider
Data provider that uses a DatasetDataProvider as its source and the dataset’s metadata to buuild settings for the ColumnarDataProvider it’s inherited from.
- __init__(dataset, **kwargs)[source]
All kwargs are inherited from ColumnarDataProvider. .. seealso:: column.ColumnarDataProvider
If no kwargs are given, this class will attempt to get those kwargs from the dataset source’s metadata. If any kwarg is given, it will override and be used in place of any metadata available.
- settings: Dict[str, str] = {'column_count': 'int', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
- class galaxy.datatypes.dataproviders.dataset.DatasetDictDataProvider(dataset, **kwargs)[source]
Bases:
DictDataProvider
Data provider that uses a DatasetDataProvider as its source and the dataset’s metadata to buuild settings for the DictDataProvider it’s inherited from.
- __init__(dataset, **kwargs)[source]
All kwargs are inherited from DictDataProvider. .. seealso:: column.DictDataProvider
If no kwargs are given, this class will attempt to get those kwargs from the dataset source’s metadata. If any kwarg is given, it will override and be used in place of any metadata available.
The relationship between column_names and indeces is more complex: +—————–+——————————-+———————–+ | | Indeces given | Indeces NOT given | +=================+===============================+=======================+ | Names given | pull indeces, rename w/ names | pull by name | +=================+——————————-+———————–+ | Names NOT given | pull indeces, name w/ meta | pull all, name w/meta | +=================+——————————-+———————–+
- settings: Dict[str, str] = {'column_count': 'int', 'column_names': 'list:str', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
- class galaxy.datatypes.dataproviders.dataset.GenomicRegionDataProvider(dataset, chrom_column=None, start_column=None, end_column=None, named_columns=False, **kwargs)[source]
Bases:
ColumnarDataProvider
Data provider that parses chromosome, start, and end data from a file using the datasets metadata settings.
Is a ColumnarDataProvider that uses a DatasetDataProvider as its source.
If named_columns is true, will return dictionaries with the keys ‘chrom’, ‘start’, ‘end’.
- COLUMN_NAMES = ['chrom', 'start', 'end']
- settings: Dict[str, str] = {'chrom_column': 'int', 'column_count': 'int', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'end_column': 'int', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'named_columns': 'bool', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'start_column': 'int', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
- __init__(dataset, chrom_column=None, start_column=None, end_column=None, named_columns=False, **kwargs)[source]
- Parameters:
dataset (model.DatasetInstance) – the Galaxy dataset whose file will be the source
chrom_column (int) – optionally specify the chrom column index
start_column (int) – optionally specify the start column index
end_column (int) – optionally specify the end column index
named_columns (bool) – optionally return dictionaries keying each column with ‘chrom’, ‘start’, or ‘end’. Optional: defaults to False
- class galaxy.datatypes.dataproviders.dataset.IntervalDataProvider(dataset, chrom_column=None, start_column=None, end_column=None, strand_column=None, name_column=None, named_columns=False, **kwargs)[source]
Bases:
ColumnarDataProvider
Data provider that parses chromosome, start, and end data (as well as strand and name if set in the metadata) using the dataset’s metadata settings.
If named_columns is true, will return dictionaries with the keys ‘chrom’, ‘start’, ‘end’ (and ‘strand’ and ‘name’ if available).
- COLUMN_NAMES = ['chrom', 'start', 'end', 'strand', 'name']
- settings: Dict[str, str] = {'chrom_column': 'int', 'column_count': 'int', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'end_column': 'int', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'name_column': 'int', 'named_columns': 'bool', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'start_column': 'int', 'strand_column': 'int', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
- __init__(dataset, chrom_column=None, start_column=None, end_column=None, strand_column=None, name_column=None, named_columns=False, **kwargs)[source]
- Parameters:
dataset (model.DatasetInstance) – the Galaxy dataset whose file will be the source
named_columns (bool) – optionally return dictionaries keying each column with ‘chrom’, ‘start’, ‘end’, ‘strand’, or ‘name’. Optional: defaults to False
- class galaxy.datatypes.dataproviders.dataset.FastaDataProvider(source, ids=None, **kwargs)[source]
Bases:
FilteredDataProvider
Class that returns fasta format data in a list of maps of the form:
{ id: <fasta header id>, sequence: <joined lines of nucleotide/amino data> }
- class galaxy.datatypes.dataproviders.dataset.TwoBitFastaDataProvider(source, ids=None, **kwargs)[source]
Bases:
DatasetDataProvider
Class that returns fasta format data in a list of maps of the form:
{ id: <fasta header id>, sequence: <joined lines of nucleotide/amino data> }
- class galaxy.datatypes.dataproviders.dataset.WiggleDataProvider(source, named_columns=False, column_names=None, **kwargs)[source]
Bases:
LimitedOffsetDataProvider
Class that returns chrom, pos, data from a wiggle source.
- COLUMN_NAMES = ['chrom', 'pos', 'value']
- settings: Dict[str, str] = {'column_names': 'list:str', 'limit': 'int', 'named_columns': 'bool', 'offset': 'int'}
- __init__(source, named_columns=False, column_names=None, **kwargs)[source]
- Parameters:
named_columns (bool) – optionally return dictionaries keying each column with ‘chrom’, ‘start’, ‘end’, ‘strand’, or ‘name’. Optional: defaults to False
column_names – an ordered list of strings that will be used as the keys for each column in the returned dictionaries. The number of key, value pairs each returned dictionary has will be as short as the number of column names provided.
- class galaxy.datatypes.dataproviders.dataset.BigWigDataProvider(source, chrom, start, end, named_columns=False, column_names=None, **kwargs)[source]
Bases:
LimitedOffsetDataProvider
Class that returns chrom, pos, data from a wiggle source.
- COLUMN_NAMES = ['chrom', 'pos', 'value']
- settings: Dict[str, str] = {'column_names': 'list:str', 'limit': 'int', 'named_columns': 'bool', 'offset': 'int'}
- __init__(source, chrom, start, end, named_columns=False, column_names=None, **kwargs)[source]
- Parameters:
chrom (str) – which chromosome within the bigbed file to extract data for
start (int) – the start of the region from which to extract data
end (int) – the end of the region from which to extract data
named_columns (bool) – optionally return dictionaries keying each column with ‘chrom’, ‘start’, ‘end’, ‘strand’, or ‘name’. Optional: defaults to False
column_names – an ordered list of strings that will be used as the keys for each column in the returned dictionaries. The number of key, value pairs each returned dictionary has will be as short as the number of column names provided.
- class galaxy.datatypes.dataproviders.dataset.DatasetSubprocessDataProvider(dataset, *args, **kwargs)[source]
Bases:
SubprocessDataProvider
Create a source from running a subprocess on a dataset’s file.
Uses a subprocess as its source and has a dataset (gen. as an input file for the process).
- class galaxy.datatypes.dataproviders.dataset.SamtoolsDataProvider(dataset, options_string='', options_dict=None, regions=None, **kwargs)[source]
Bases:
RegexLineDataProvider
Data provider that uses samtools on a Sam or Bam file as its source.
This can be piped through other providers (column, map, genome region, etc.).
Note
that only the samtools ‘view’ command is currently implemented.
- FLAGS_WO_ARGS = 'bhHSu1xXcB'
- FLAGS_W_ARGS = 'fFqlrs'
- VALID_FLAGS = 'bhHSu1xXcBfFqlrs'
- build_command_list(subcommand, options_string, options_dict, regions)[source]
Convert all init args to list form.
- to_options_list(options_string, options_dict)[source]
Convert both options_string and options_dict to list form while filtering out non-‘valid’ options.
- class galaxy.datatypes.dataproviders.dataset.SQliteDataProvider(source, query=None, **kwargs)[source]
Bases:
DataProvider
Data provider that uses a sqlite database file as its source.
Allows any query to be run and returns the resulting rows as sqlite3 row objects
- class galaxy.datatypes.dataproviders.dataset.SQliteDataTableProvider(source, query=None, headers=False, limit=9223372036854775807, **kwargs)[source]
Bases:
DataProvider
Data provider that uses a sqlite database file as its source. Allows any query to be run and returns the resulting rows as arrays of arrays
- __init__(source, query=None, headers=False, limit=9223372036854775807, **kwargs)[source]
Sets up a data provider, validates supplied source.
- Parameters:
source – the source that this iterator will loop over. (Should implement the iterable interface and ideally have the context manager interface as well)
- class galaxy.datatypes.dataproviders.dataset.SQliteDataDictProvider(source, query=None, **kwargs)[source]
Bases:
DataProvider
Data provider that uses a sqlite database file as its source. Allows any query to be run and returns the resulting rows as arrays of dicts
galaxy.datatypes.dataproviders.decorators module
DataProvider related decorators.
- galaxy.datatypes.dataproviders.decorators.has_dataproviders(cls)[source]
Wraps a class (generally a Datatype), finds methods within that have been decorated with @dataprovider and adds them, by their name, to a map in the class.
This allows a class to maintain a name -> method map, effectively ‘registering’ dataprovider factory methods:
@has_dataproviders class MyDtype( data.Data ): @dataprovider_factory( 'bler' ) def provide_some_bler( self, dataset, **settings ): '''blerblerbler''' dataset_source = providers.DatasetDataProvider( dataset ) # ... chain other, intermidiate providers here return providers.BlerDataProvider( dataset_source, **settings ) # use the base method in data.Data provider = dataset.datatype.dataprovider( dataset, 'bler', my_setting='blah', ... ) # OR directly from the map provider = dataset.datatype.dataproviders[ 'bler' ]( dataset, my_setting='blah', ... )
- galaxy.datatypes.dataproviders.decorators.dataprovider_factory(name, settings=None)[source]
Wraps a class method and marks it as a dataprovider factory and creates a function to parse query strings to __init__ arguments as the parse_query_string_settings attribute of the factory function.
An example use of the parse_query_string_settings:
kwargs = dataset.datatype.dataproviders[ provider ].parse_query_string_settings( query_kwargs ) return list( dataset.datatype.dataprovider( dataset, provider, **kwargs ) )
- Parameters:
name (any hashable var) – what name/key to register the factory under in cls.dataproviders
settings (dictionary) – dictionary containing key/type pairs for parsing query strings to __init__ arguments
galaxy.datatypes.dataproviders.exceptions module
DataProvider related exceptions.
- exception galaxy.datatypes.dataproviders.exceptions.InvalidDataProviderSource(source=None, msg='')[source]
Bases:
TypeError
Raised when a unusable source is passed to a provider.
- exception galaxy.datatypes.dataproviders.exceptions.NoProviderAvailable(factory_source, format_requested=None, msg='')[source]
Bases:
TypeError
Raised when no provider is found for the given format_requested.
- Parameters:
factory_source – the item that the provider was requested from
format_requested – the format_requested (a hashable key to access factory_source.datatypes with)
Both params are attached to this class and accessible to the try-catch receiver.
Meant to be used within a class that builds dataproviders (e.g. a Datatype)
galaxy.datatypes.dataproviders.external module
Data providers that iterate over a source that is not in memory or not in a file.
- class galaxy.datatypes.dataproviders.external.SubprocessDataProvider(*args, **kwargs)[source]
Bases:
DataProvider
Data provider that uses the output from an intermediate program and subprocess as its data source.
- __init__(*args, **kwargs)[source]
- Parameters:
args (variadic function args) – the list of strings used to build commands.
- class galaxy.datatypes.dataproviders.external.RegexSubprocessDataProvider(*args, **kwargs)[source]
Bases:
RegexLineDataProvider
RegexLineDataProvider that uses a SubprocessDataProvider as its data source.
- class galaxy.datatypes.dataproviders.external.URLDataProvider(url, method='GET', data=None, **kwargs)[source]
Bases:
DataProvider
Data provider that uses the contents of a URL for its data source.
This can be piped through other providers (column, map, genome region, etc.).
- VALID_METHODS = ('GET', 'POST')
- class galaxy.datatypes.dataproviders.external.GzipDataProvider(source, **kwargs)[source]
Bases:
DataProvider
Data provider that uses g(un)zip on a file as its source.
This can be piped through other providers (column, map, genome region, etc.).
- class galaxy.datatypes.dataproviders.external.TempfileDataProvider(source, **kwargs)[source]
Bases:
DataProvider
Writes the data from the given source to a temp file, allowing it to be used as a source where a file_name is needed (e.g. as a parameter to a command line tool: samtools view -t <this_provider.source.get_file_name()>)
galaxy.datatypes.dataproviders.hierarchy module
Dataproviders that iterate over lines from their sources.
- class galaxy.datatypes.dataproviders.hierarchy.HierarchalDataProvider(source, **kwargs)[source]
Bases:
BlockDataProvider
Class that uses formats where a datum may have a parent or children data.
e.g. XML, HTML, GFF3, Phylogenetic
- class galaxy.datatypes.dataproviders.hierarchy.XMLDataProvider(source, selector=None, max_depth=None, **kwargs)[source]
Bases:
HierarchalDataProvider
Data provider that converts selected XML elements to dictionaries.
- settings: Dict[str, str] = {'limit': 'int', 'max_depth': 'int', 'offset': 'int', 'selector': 'str'}
- ITERPARSE_ALL_EVENTS = ('start', 'end', 'start-ns', 'end-ns')
- __init__(source, selector=None, max_depth=None, **kwargs)[source]
- Parameters:
selector – some partial string in the desired tags to return
max_depth – the number of generations of descendents to return
- matches_selector(element, selector=None)[source]
Returns true if the
element
matches theselector
.- Parameters:
element – an XML
Element
selector – some partial string in the desired tags to return
Change point for more sophisticated selectors.
galaxy.datatypes.dataproviders.line module
Dataproviders that iterate over lines from their sources.
- class galaxy.datatypes.dataproviders.line.FilteredLineDataProvider(source, strip_lines=True, strip_newlines=False, provide_blank=False, comment_char='#', **kwargs)[source]
Bases:
LimitedOffsetDataProvider
Data provider that yields lines of data from its source allowing optional control over which line to start on and how many lines to return.
- DEFAULT_COMMENT_CHAR = '#'
- settings: Dict[str, str] = {'comment_char': 'str', 'limit': 'int', 'offset': 'int', 'provide_blank': 'bool', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
- __init__(source, strip_lines=True, strip_newlines=False, provide_blank=False, comment_char='#', **kwargs)[source]
- Parameters:
strip_lines (bool) – remove whitespace from the beginning an ending of each line (or not). Optional: defaults to True
strip_newlines – remove newlines only (only functions when
strip_lines
is false) Optional: defaults to Falseprovide_blank (bool) – are empty lines considered valid and provided? Optional: defaults to False
comment_char (str) – character(s) that indicate a line isn’t data (a comment) and should not be provided. Optional: defaults to ‘#’
- class galaxy.datatypes.dataproviders.line.RegexLineDataProvider(source, regex_list=None, invert=False, **kwargs)[source]
Bases:
FilteredLineDataProvider
Data provider that yields only those lines of data from its source that do (or do not when invert is True) match one or more of the given list of regexs.
Note
the regex matches are effectively OR’d (if any regex matches the line it is considered valid and will be provided).
- settings: Dict[str, str] = {'comment_char': 'str', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
- class galaxy.datatypes.dataproviders.line.BlockDataProvider(source, new_block_delim_fn=None, block_filter_fn=None, **kwargs)[source]
Bases:
LimitedOffsetDataProvider
Class that uses formats where multiple lines combine to describe a single datum. The data output will be a list of either map/dicts or sub-arrays.
Uses FilteredLineDataProvider as its source (kwargs not passed).
e.g. Fasta, GenBank, MAF, hg log Note: mem intensive (gathers list of lines before output)
- __init__(source, new_block_delim_fn=None, block_filter_fn=None, **kwargs)[source]
- Parameters:
new_block_delim_fn (function) – T/F function to determine whether a given line is the start of a new block.
block_filter_fn (function) – function that determines if a block is valid and will be provided. Optional: defaults to None (no filtering)
- filter(line)[source]
Line filter here being used to aggregate/assemble lines into a block and determine whether the line indicates a new block.
- Parameters:
line (str) – the incoming line from the source
- Returns:
a block or None
- is_new_block(line)[source]
Returns True if the given line indicates the start of a new block (and the current block should be provided) or False if not.
- assemble_current_block()[source]
Build the current data into a block.
Called per block (just before providing).