The GMQLDataset

Here we present the functions that can be used on a GMQLDataset.

class GMQLDataset(parser=None, index=None, location='local', path_or_name=None, local_sources=None, remote_sources=None, meta_profile=None)[source]

The main abstraction of the library. A GMQLDataset represents a genomic dataset in the GMQL standard and it is divided in region data and meta data. The function that can be applied to a GMQLDataset affect one of these two features or both.

For each operator function that can be applied to a GMQLDataset we provide the documentation, some examples, and we specify which operator of GMQL the function is wrapper of.

get_reg_attributes()[source]

Returns the region fields of the dataset

Returns:a list of field names
MetaField(name, t=None)[source]

Creates an instance of a metadata field of the dataset. It can be used in building expressions or conditions for projection or selection. Notice that this function is equivalent to call:

dataset["name"]

If the MetaField is used in a region projection (reg_project()), the user has also to specify the type of the metadata attribute that is selected:

dataset.reg_project(new_field_dict={'new_field': dataset['name', 'string']})
Parameters:
  • name – the name of the metadata that is considered
  • t – the type of the metadata attribute {string, int, boolean, double}
Returns:

a MetaField instance

RegField(name)[source]

Creates an instance of a region field of the dataset. It can be used in building expressions or conditions for projection or selection. Notice that this function is equivalent to:

dataset.name
Parameters:name – the name of the region field that is considered
Returns:a RegField instance
select(meta_predicate=None, region_predicate=None, semiJoinDataset=None, semiJoinMeta=None)[source]

Wrapper of SELECT

Selection operation. Enables to filter datasets on the basis of region features or metadata attributes. In addition it is possibile to perform a selection based on the existence of certain metadata semiJoinMeta attributes and the matching of their values with those associated with at least one sample in an external dataset semiJoinDataset.

Therefore, the selection can be based on:

  • Metadata predicates: selection based on the existence and values of certain metadata attributes in each sample. In predicates, attribute-value conditions can be composed using logical predicates & (and), | (or) and ~ (not)
  • Region predicates: selection based on region attributes. Conditions can be composed using logical predicates & (and), | (or) and ~ (not)
  • SemiJoin clauses: selection based on the existence of certain metadata semiJoinMeta attributes and the matching of their values with those associated with at least one sample in an external dataset semiJoinDataset

In the following example we select all the samples from Example_Dataset_1 regarding antibody CTCF. From these samples we select only the regions on chromosome 6. Finally we select only the samples which have a matching antibody_targetClass in Example_Dataset_2:

import gmql as gl
d1 = gl.get_example_dataset("Example_Dataset_1")
d2 = gl.get_example_dataset("Example_Dataset_2")

d_select = d.select(meta_predicate = d['antibody'] == "CTCF",
                    region_predicate = d.chr == "chr6",
                    semiJoinDataset=d2, semiJoinMeta=["antibody_targetClass"])
Parameters:
  • meta_predicate – logical predicate on the metadata <attribute, value> pairs
  • region_predicate – logical predicate on the region feature values
  • semiJoinDataset – an other GMQLDataset
  • semiJoinMeta – a list of metadata attributes (strings)
Returns:

a new GMQLDataset

meta_select(predicate=None, semiJoinDataset=None, semiJoinMeta=None)[source]

Wrapper of SELECT

Wrapper of the select() function filtering samples only based on metadata.

Parameters:
  • predicate – logical predicate on the values of the rows
  • semiJoinDataset – an other GMQLDataset
  • semiJoinMeta – a list of metadata
Returns:

a new GMQLDataset

Example 1:

output_dataset = dataset.meta_select(dataset['patient_age'] < 70)
# This statement can be written also as
output_dataset = dataset[ dataset['patient_age'] < 70 ]

Example 2:

output_dataset = dataset.meta_select( (dataset['tissue_status'] == 'tumoral') &
                                    (tumor_tag != 'gbm') | (tumor_tag == 'brca'))
# This statement can be written also as
output_dataset = dataset[ (dataset['tissue_status'] == 'tumoral') &
                        (tumor_tag != 'gbm') | (tumor_tag == 'brca') ]

Example 3:

JUN_POLR2A_TF = HG19_ENCODE_NARROW.meta_select( JUN_POLR2A_TF['antibody_target'] == 'JUN',
                                                semiJoinDataset=POLR2A_TF, semiJoinMeta=['cell'])

The meta selection predicate can use all the classical equalities and disequalities {>, <, >=, <=, ==, !=} and predicates can be connected by the classical logical symbols {& (AND), | (OR), ~ (NOT)} plus the isin function.

reg_select(predicate=None, semiJoinDataset=None, semiJoinMeta=None)[source]

Wrapper of SELECT

Wrapper of the select() function filtering regions only based on region attributes.

Parameters:
  • predicate – logical predicate on the values of the regions
  • semiJoinDataset – an other GMQLDataset
  • semiJoinMeta – a list of metadata
Returns:

a new GMQLDataset

An example of usage:

new_dataset = dataset.reg_select((dataset.chr == 'chr1') | (dataset.pValue < 0.9))

You can also use Metadata attributes in selection:

new_dataset = dataset.reg_select(dataset.score > dataset['size'])

This statement selects all the regions whose field score is strictly higher than the sample metadata attribute size.

The region selection predicate can use all the classical equalities and disequalities {>, <, >=, <=, ==, !=} and predicates can be connected by the classical logical symbols {& (AND), | (OR), ~ (NOT)} plus the isin function.

In order to be sure about the correctness of the expression, please use parenthesis to delimit the various predicates.

project(projected_meta=None, new_attr_dict=None, all_but_meta=None, projected_regs=None, new_field_dict=None, all_but_regs=None)[source]

Wrapper of PROJECT

The PROJECT operator creates, from an existing dataset, a new dataset with all the samples (with their regions and region values) in the input one, but keeping for each sample in the input dataset only those metadata and/or region attributes expressed in the operator parameter list. Region coordinates and values of the remaining metadata and region attributes remain equal to those in the input dataset. Differently from the SELECT operator, PROJECT allows to:

  • Remove existing metadata and/or region attributes from a dataset;
  • Create new metadata and/or region attributes to be added to the result.
Parameters:
  • projected_meta – list of metadata attributes to project on
  • new_attr_dict – an optional dictionary of the form {‘new_meta_1’: function1, ‘new_meta_2’: function2, …} in which every function computes the new metadata attribute based on the values of the others
  • all_but_meta – list of metadata attributes that must be excluded from the projection
  • projected_regs – list of the region fields to select
  • new_field_dict – an optional dictionary of the form {‘new_field_1’: function1, ‘new_field_2’: function2, …} in which every function computes the new region field based on the values of the others
  • all_but_regs – list of region fields that must be excluded from the projection
Returns:

a new GMQLDataset

meta_project(attr_list=None, all_but=None, new_attr_dict=None)[source]

Wrapper of PROJECT

Project the metadata based on a list of attribute names

Parameters:
  • attr_list – list of the metadata fields to select
  • all_but – list of metadata that must be excluded from the projection.
  • new_attr_dict – an optional dictionary of the form {‘new_field_1’: function1, ‘new_field_2’: function2, …} in which every function computes the new field based on the values of the others
Returns:

a new GMQLDataset

Notice that if attr_list is specified, all_but cannot be specified and viceversa.

Examples:

new_dataset = dataset.meta_project(attr_list=['antibody', 'ID'],
                                   new_attr_dict={'new_meta': dataset['ID'] + 100})
reg_project(field_list=None, all_but=None, new_field_dict=None)[source]

Wrapper of PROJECT

Project the region data based on a list of field names

Parameters:
  • field_list – list of the fields to select
  • all_but – keep only the region fields different from the ones specified
  • new_field_dict – an optional dictionary of the form {‘new_field_1’: function1, ‘new_field_2’: function2, …} in which every function computes the new field based on the values of the others
Returns:

a new GMQLDataset

An example of usage:

new_dataset = dataset.reg_project(['pValue', 'name'],
                                {'new_field': dataset.pValue / 2})

new_dataset = dataset.reg_project(field_list=['peak', 'pvalue'],
                                  new_field_dict={'new_field': dataset.pvalue * dataset['cell_age', 'float']})

Notice that you can use metadata attributes for building new region fields. The only thing to remember when doing so is to define also the type of the output region field in the metadata attribute definition (for example, dataset['cell_age', 'float'] is required for defining the new attribute new_field as float). In particular, the following type names are accepted: ‘string’, ‘char’, ‘long’, ‘integer’, ‘boolean’, ‘float’, ‘double’.

extend(new_attr_dict)[source]

Wrapper of EXTEND

For each sample in an input dataset, the EXTEND operator builds new metadata attributes, assigns their values as the result of arithmetic and/or aggregate functions calculated on sample region attributes, and adds them to the existing metadata attribute-value pairs of the sample. Sample number and their genomic regions, with their attributes and values, remain unchanged in the output dataset.

Parameters:new_attr_dict – a dictionary of the type {‘new_metadata’ : AGGREGATE_FUNCTION(‘field’), …}
Returns:new GMQLDataset

An example of usage, in which we count the number of regions in each sample and the minimum value of the pValue field and export it respectively as metadata attributes regionCount and minPValue:

import gmql as gl

dataset = gl.get_example_dataset("Example_Dataset_1")
new_dataset = dataset.extend({'regionCount' : gl.COUNT(),
                              'minPValue' : gl.MIN('pValue')})
cover(minAcc, maxAcc, groupBy=None, new_reg_fields=None, cover_type='normal')[source]

Wrapper of COVER

COVER is a GMQL operator that takes as input a dataset (of usually, but not necessarily, multiple samples) and returns another dataset (with a single sample, if no groupby option is specified) by “collapsing” the input samples and their regions according to certain rules specified by the COVER parameters. The attributes of the output regions are only the region coordinates, plus in case, when aggregate functions are specified, new attributes with aggregate values over attribute values of the contributing input regions; output metadata are the union of the input ones, plus the metadata attributes JaccardIntersect and JaccardResult, representing global Jaccard Indexes for the considered dataset, computed as the correspondent region Jaccard Indexes but on the whole sample regions.

Parameters:
  • cover_type – the kind of cover variant you want [‘normal’, ‘flat’, ‘summit’, ‘histogram’]
  • minAcc – minimum accumulation value, i.e. the minimum number of overlapping regions to be considered during COVER execution. It can be any positive number or the strings {‘ALL’, ‘ANY’}.
  • maxAcc – maximum accumulation value, i.e. the maximum number of overlapping regions to be considered during COVER execution. It can be any positive number or the strings {‘ALL’, ‘ANY’}.
  • groupBy – optional list of metadata attributes
  • new_reg_fields – dictionary of the type {‘new_region_attribute’ : AGGREGATE_FUNCTION(‘field’), …}
Returns:

a new GMQLDataset

An example of usage:

cell_tf = narrow_peak.cover("normal", minAcc=1, maxAcc="Any", 
                                groupBy=['cell', 'antibody_target'])    
normal_cover(minAcc, maxAcc, groupBy=None, new_reg_fields=None)[source]

Wrapper of COVER

The normal cover operation as described in cover(). Equivalent to calling:

dataset.cover("normal", ...)
flat_cover(minAcc, maxAcc, groupBy=None, new_reg_fields=None)[source]

Wrapper of COVER

Variant of the function cover() that returns the union of all the regions which contribute to the COVER. More precisely, it returns the contiguous regions that start from the first end and stop at the last end of the regions which would contribute to each region of a COVER.

Equivalent to calling:

cover("flat", ...)
summit_cover(minAcc, maxAcc, groupBy=None, new_reg_fields=None)[source]

Wrapper of COVER

Variant of the function cover() that returns only those portions of the COVER result where the maximum number of regions overlap (this is done by returning only regions that start from a position after which the number of overlaps does not increase, and stop at a position where either the number of overlapping regions decreases or violates the maximum accumulation index).

Equivalent to calling:

cover("summit", ...)
histogram_cover(minAcc, maxAcc, groupBy=None, new_reg_fields=None)[source]

Wrapper of COVER

Variant of the function cover() that returns all regions contributing to the COVER divided in different (contiguous) parts according to their accumulation index value (one part for each different accumulation value), which is assigned to the AccIndex region attribute.

Equivalent to calling:

cover("histogram", ...)
join(experiment, genometric_predicate, output='LEFT', joinBy=None, refName='REF', expName='EXP', left_on=None, right_on=None)[source]

Wrapper of JOIN

The JOIN operator takes in input two datasets, respectively known as anchor (the first/left one) and experiment (the second/right one) and returns a dataset of samples consisting of regions extracted from the operands according to the specified condition (known as genometric predicate). The number of generated output samples is the Cartesian product of the number of samples in the anchor and in the experiment dataset (if no joinby close if specified). The attributes (and their values) of the regions in the output dataset are the union of the region attributes (with their values) in the input datasets; homonymous attributes are disambiguated by prefixing their name with their dataset name. The output metadata are the union of the input metadata, with their attribute names prefixed with their input dataset name.

Parameters:
  • experiment – an other GMQLDataset
  • genometric_predicate – a list of Genometric atomic conditions. For an explanation of each of them go to the respective page.
  • output

    one of four different values that declare which region is given in output for each input pair of anchor and experiment regions satisfying the genometric predicate:

    • ’LEFT’: outputs the anchor regions from the anchor dataset that satisfy the genometric predicate
    • ’RIGHT’: outputs the anchor regions from the experiment dataset that satisfy the genometric predicate
    • ’INT’: outputs the overlapping part (intersection) of the anchor and experiment regions that satisfy the genometric predicate; if the intersection is empty, no output is produced
    • ’CONTIG’: outputs the concatenation between the anchor and experiment regions that satisfy the genometric predicate, i.e. the output region is defined as having left (right) coordinates equal to the minimum (maximum) of the corresponding coordinate values in the anchor and experiment regions satisfying the genometric predicate
  • joinBy – list of metadata attributes
  • refName – name that you want to assign to the reference dataset
  • expName – name that you want to assign to the experiment dataset
  • left_on – list of region fields of the reference on which the join must be performed
  • right_on – list of region fields of the experiment on which the join must be performed
Returns:

a new GMQLDataset

An example of usage, in which we perform the join operation between Example_Dataset_1 and Example_Dataset_2 specifying than we want to join the regions of the former with the first regions at a minimim distance of 120Kb of the latter and finally we want to output the regions of Example_Dataset_2 matching the criteria:

import gmql as gl

d1 = gl.get_example_dataset("Example_Dataset_1")
d2 = gl.get_example_dataset("Example_Dataset_2")

result_dataset = d1.join(experiment=d2,
                         genometric_predicate=[gl.MD(1), gl.DGE(120000)],
                         output="right")
map(experiment, new_reg_fields=None, joinBy=None, refName='REF', expName='EXP')[source]

Wrapper of MAP

MAP is a non-symmetric operator over two datasets, respectively called reference and experiment. The operation computes, for each sample in the experiment dataset, aggregates over the values of the experiment regions that intersect with a region in a reference sample, for each region of each sample in the reference dataset; we say that experiment regions are mapped to the reference regions. The number of generated output samples is the Cartesian product of the samples in the two input datasets; each output sample has the same regions as the related input reference sample, with their attributes and values, plus the attributes computed as aggregates over experiment region values. Output sample metadata are the union of the related input sample metadata, whose attribute names are prefixed with their input dataset name. For each reference sample, the MAP operation produces a matrix like structure, called genomic space, where each experiment sample is associated with a row, each reference region with a column, and each matrix row is a vector of numbers - the aggregates computed during MAP execution. When the features of the reference regions are unknown, the MAP helps in extracting the most interesting regions out of many candidates.

Parameters:
  • experiment – a GMQLDataset
  • new_reg_fields – an optional dictionary of the form {‘new_field_1’: AGGREGATE_FUNCTION(field), …}
  • joinBy – optional list of metadata
  • refName – name that you want to assign to the reference dataset
  • expName – name that you want to assign to the experiment dataset
Returns:

a new GMQLDataset

In the following example, we map the regions of Example_Dataset_2 on the ones of Example_Dataset_1 and for each region of Example_Dataset_1 we ouput the average Pvalue and number of mapped regions of Example_Dataset_2. In addition we specify that the output region fields and metadata attributes will have the D1 and D2 suffixes respectively for attributes and fields belonging the Example_Dataset_1 and Example_Dataset_2:

import gmql as gl

d1 = gl.get_example_dataset("Example_Dataset_1")
d2 = gl.get_example_dataset("Example_Dataset_2")

result = d1.map(experiment=d2, refName="D1", expName="D2",
                new_reg_fields={"avg_pValue": gl.AVG("pvalue")})
order(meta=None, meta_ascending=None, meta_top=None, meta_k=None, regs=None, regs_ascending=None, region_top=None, region_k=None)[source]

Wrapper of ORDER

The ORDER operator is used to order either samples, sample regions, or both, in a dataset according to a set of metadata and/or region attributes, and/or region coordinates. The number of samples and their regions in the output dataset is as in the input dataset, as well as their metadata and region attributes and values, but a new ordering metadata and/or region attribute is added with the sample or region ordering value, respectively.

Parameters:
  • meta – list of metadata attributes
  • meta_ascending – list of boolean values (True = ascending, False = descending)
  • meta_top – “top”, “topq” or “topp” or None
  • meta_k – a number specifying how many results to be retained
  • regs – list of region attributes
  • regs_ascending – list of boolean values (True = ascending, False = descending)
  • region_top – “top”, “topq” or “topp” or None
  • region_k – a number specifying how many results to be retained
Returns:

a new GMQLDataset

Example of usage. We order Example_Dataset_1 metadata by ascending antibody and descending antibody_class keeping only the first sample. We also order the resulting regions based on the score field in descending order, keeping only the first one also in this case:

import gmql as gl

d1 = gl.get_example_dataset("Example_Dataset_1")

result = d1.order(meta=["antibody", "antibody_targetClass"],
                  meta_ascending=[True, False], meta_top="top", meta_k=1,
                  regs=['score'], regs_ascending=[False],
                  region_top="top", region_k=1)
difference(other, joinBy=None, exact=False)[source]

Wrapper of DIFFERENCE

DIFFERENCE is a binary, non-symmetric operator that produces one sample in the result for each sample of the first operand, by keeping the same metadata of the first operand sample and only those regions (with their schema and values) of the first operand sample which do not intersect with any region in the second operand sample (also known as negative regions)

Parameters:
  • other – GMQLDataset
  • joinBy – (optional) list of metadata attributes. It is used to extract subsets of samples on which to apply the operator: only those samples in the current and other dataset that have the same value for each specified attribute are considered when performing the operation
  • exact – boolean. If true, the the regions are considered as intersecting only if their coordinates are exactly the same
Returns:

a new GMQLDataset

Example of usage. We compute the exact difference between Example_Dataset_1 and Example_Dataset_2, considering only the samples with same antibody:

import gmql as gl

d1 = gl.get_example_dataset("Example_Dataset_1")
d2 = gl.get_example_dataset("Example_Dataset_2")

result = d1.difference(other=d2, exact=True, joinBy=['antibody'])
union(other, left_name='LEFT', right_name='RIGHT')[source]

Wrapper of UNION

The UNION operation is used to integrate homogeneous or heterogeneous samples of two datasets within a single dataset; for each sample of either one of the input datasets, a sample is created in the result as follows:

  • its metadata are the same as in the original sample;
  • its schema is the schema of the first (left) input dataset; new identifiers are assigned to each output sample;
  • its regions are the same (in coordinates and attribute values) as in the original sample. Region attributes which are missing in an input dataset sample (w.r.t. the merged schema) are set to null.
Parameters:
  • other – a GMQLDataset
  • left_name – name that you want to assign to the left dataset
  • right_name – name tha t you want to assign to the right dataset
Returns:

a new GMQLDataset

Example of usage:

import gmql as gl

d1 = gl.get_example_dataset("Example_Dataset_1")
d2 = gl.get_example_dataset("Example_Dataset_2")

result = d1.union(other=d2, left_name="D1", right_name="D2")
merge(groupBy=None)[source]

Wrapper of MERGE

The MERGE operator builds a new dataset consisting of a single sample having

  • as regions all the regions of all the input samples, with the same attributes and values
  • as metadata the union of all the metadata attribute-values of the input samples.

A groupby clause can be specified on metadata: the samples are then partitioned in groups, each with a distinct value of the grouping metadata attributes, and the MERGE operation is applied to each group separately, yielding to one sample in the result dataset for each group. Samples without the grouping metadata attributes are disregarded

Parameters:groupBy – list of metadata attributes
Returns:a new GMQLDataset

Example of usage:

import gmql as gl

d1 = gl.get_example_dataset("Example_Dataset_1")
result = d1.merge(['antibody'])
group(meta=None, meta_aggregates=None, regs=None, regs_aggregates=None, meta_group_name='_group')[source]

Wrapper of GROUP

The GROUP operator is used for grouping both regions and/or metadata of input dataset samples according to distinct values of certain attributes (known as grouping attributes); new grouping attributes are added to samples in the output dataset, storing the results of aggregate function evaluations over metadata and/or regions in each group of samples. Samples having missing values for any of the grouping attributes are discarded.

Parameters:
  • meta – (optional) a list of metadata attributes
  • meta_aggregates – (optional) {‘new_attr’: fun}
  • regs – (optional) a list of region fields
  • regs_aggregates – {‘new_attr’: fun}
  • meta_group_name – (optional) the name to give to the group attribute in the metadata
Returns:

a new GMQLDataset

Example of usage. We group samples by antibody and we aggregate the region pvalues taking the maximum value calling the new region field maxPvalue:

import gmql as gl

d1 = gl.get_example_dataset("Example_Dataset_1")
result = d1.group(meta=['antibody'], regs_aggregates={'maxPvalue': gl.MAX("pvalue")})
meta_group(meta, meta_aggregates=None)[source]

Wrapper of GROUP

Group operation only for metadata. For further information check group()

regs_group(regs, regs_aggregates=None)[source]

Wrapper of GROUP

Group operation only for region data. For further information check group()

materialize(output_path=None, output_name=None, all_load=True, mode=None)[source]

Wrapper of MATERIALIZE

Starts the execution of the operations for the GMQLDataset. PyGMQL implements lazy execution and no operation is performed until the materialization of the results is requestd. This operation can happen both locally or remotely.

  • Local mode: if the GMQLDataset is local (based on local data) the user can specify the
Parameters:
  • output_path – (Optional) If specified, the user can say where to locally save the results of the computations.
  • output_name – (Optional) Can be used only if the dataset is remote. It represents the name that the user wants to give to the resulting dataset on the server
  • all_load – (Optional) It specifies if the result dataset should be directly converted to a GDataframe (True) or to a GMQLDataset (False) for future local queries.
Returns:

A GDataframe or a GMQLDataset

head(n=5)[source]

Returns a small set of regions and metadata from a query. It is supposed to be used for debugging purposes or for data exploration.

Parameters:n – how many samples to retrieve
Returns:a GDataframe

Loading functions

You can create a GMQLDataset by loading the data using the following functions:

load_from_file(path, parser: gmql.dataset.parsers.RegionParser.RegionParser)[source]

Loads a GDM dataset from a single BED-like file.

Parameters:
  • path – location of the file
  • parser – RegionParser object specifying the parser of the file
Returns:

a GMQLDataset

load_from_path(local_path, parser=None)[source]

Loads the data from a local path into a GMQLDataset. The loading of the files is “lazy”, which means that the files are loaded only when the user does a materialization (see materialize() ). The user can force the materialization of the data (maybe for an initial data exploration on only the metadata) by setting the reg_load (load in memory the region data), meta_load (load in memory the metadata) or all_load (load both region and meta data in memory). If the user specifies this final parameter as True, a GDataframe is returned, otherwise a GMQLDataset is returned

Parameters:
  • local_path – local path of the dataset
  • parser – the parser to be used for reading the data
  • all_load – if set to True, both region and meta data are loaded in memory and an instance of GDataframe is returned
Returns:

A new GMQLDataset or a GDataframe

load_from_remote(remote_name, owner=None)[source]

Loads the data from a remote repository.

Parameters:
  • remote_name – The name of the dataset in the remote repository
  • owner – (optional) The owner of the dataset. If nothing is provided, the current user is used. For public datasets use ‘public’.
Returns:

A new GMQLDataset or a GDataframe