Welcome to PyGMQL’s documentation!

PyGMQL is a python module that enables the user to perform operation on genomic data in a scalable way.

This library is part of the bigger project GMQL which aims at designing and developing a genomic data management and analysis software on top of big data engines for helping biologists, researchers and data scientists.

GMQL is a declarative language with a SQL-like syntax. PyGMQL translates this paradigm to the interactive and script-oriented world of python, enabling the integration of genomic data with classical Python packages for machine learning and data science.

Installation

Prerequisites

Here we list the requirements of this library from the point of view of the Python versions that are supported and the external programs needed in order to use it.

Java

In order to use the library you need to have Java installed in your system. And, in particular, the environment variable JAVA_HOME must be setted to your current Java installation.

If JAVA_HOME is not setted an error will be thrown at the first import of the library. In that case the following steps must be perfomed:

  1. Install the latest version of Java (follows this link)

  2. Set JAVA_HOME. This can be done differently depending on the your OS:

    1. Linux:
    echo export "JAVA_HOME=/path/to/java" >> ~/.bash_profile
    source ~/.bash_profile
    
    1. Mac:
    echo export "JAVA_HOME=\$(/usr/libexec/java_home)" >> ~/.bash_profile
    source ~/.bash_profile
    
    1. Windows:
      1. Right click My Computer and select Properties
      2. On the Advanced tab, select Environment Variables, and then edit JAVA_HOME to point to where the JDK software is located, for example, C:\Program Files\Java\jdk1.6.0_02

Python

Currently PyGMQL supports only Python 3.5, 3.6 and 3.7.

Using the github repository

You can install this library by downloading its source code from the github repository:

git clone https://github.com/DEIB-GECO/PyGMQL.git

and then using:

cd PyGMQL/
pip install -e .

This will install the library and its dependencies in your system.

Using PIP

The package can be also downloaded and installed directly in your python distribution using:

pip install gmql

Installation of the backend

PyGMQL computational engine is written in Scala. The backend comes as a JAR file which will be downloaded at the first usage of the library:

import gmql

Introduction

In this brief tutorial we will explain the typical workflow of the user of PyGMQL. In the github page of the project you can find a lot more example of usage of the library.

You can use this library both interactively and programmatically. We strongly suggest to use it inside a Jupyter notebook for the best graphical render and data exploration.

Importing the library

To import the library simply type:

import gmql as gl

If it is the first time you use PyGMQL, the Scala backend program will be downloaded. Therefore we suggest to be connected to the internet the first time you use the library.

Loading of data

The first thing we want to do with PyGMQL is to load our data. You can do that by calling the gmql.dataset.loaders.Loader.load_from_path(). This method loads a reference to a local gmql dataset in memory and creates a GMQLDataset. If the dataset in the specified directory is already GMQL standard (has the xml schema file), you only need to do the following:

dataset = gl.load_from_path(local_path="path/to/local/dataset")

while, if the dataset has no schema, you need to provide it manually. This can be done by creating a custom parser using RegionParser like in the following:

custom_parser = gl.parsers.RegionParser(chrPos=0, startPos=1, stopPos=2,
                                        otherPos=[(3, "gene", "string")])
dataset = gl.load_from_path(local_path="path/to/local/dataset", parser=custom_parser)

Writing a GMQL query

Now we have a dataset. We can now perform some GMQL operations on it. For example, we want to select samples that satisfy a specific metadata condition:

selected_samples = dataset[ (dataset['cell'] == 'es') | (dataset['tumor'] == 'brca') ]

Each operation on a GMQLDataset returns an other GMQLDataset. You can also do operations using two datasets:

other_dataset = gl.load_from_path("path/to/other/local/dataset")

union_dataset = dataset.union(other_dataset)                            # the union
join_dataset = dataset.join(other_dataset, predicate=[gl.MD(10000)])    # a join

Materializing a result

PyGMQL implements a lazy execution strategy. No operation is performed until a materialize operation is requested:

result = join_dataset.materialize()

If nothing is passed to the materialize operation, all the data are directly loaded in memory without writing the result dataset to the disk. If you want also to save the data for future computation you need to specify the output path:

result = join_dataset.materialize("path/to/output/dataset")

The result data structure

When you materialize a result, a GDataframe object is returned. This object is a wrapper of two pandas dataframes, one for the regions and one for the metadata. You can access them in the following way:

result.meta     # for the metadata
result.regs     # for the regions

These dataframes are structured as follows:

  • The region dataframe puts in every line a different region. The source sample for the specific region is the index of the line. Each column represent a field of the region data.
  • The metadata dataframe has one row for each sample in the dataset. Each column represent a different metadata attribute. Each cell of this dataframe represent the values of a specific metadata for that specific sample. Multiple values are allowed for each cell.
_images/GMQLDataset_to_GDataframe.png

The Genomic data model

As we have said, PyGMQL is a Python interface to the GMQL system. In order to understand how the library works, a little insights on the data model used by GMQL is necessary.

GMQL is based on a representation of the genomic information known as GDM - Genomic Data Model. Datasets are composed of samples , which in turn contains two kinds of data:

  • Region values (or simply regions), aligned w.r.t. a given reference, with specific left-right ends within a chromosome. Regions can store different information regarding the “spot” they mark in a particular sample, such as region length or statistical significance. Regions of the model describe processed data, e.g. mutations, expression or bindings; they have a schema , with 5 common attributes ( id , chr , left , right , strand ) including the id of the region and the region coordinates, along the aligned reference genome, and then arbitrary typed attributes. This provides interoperability across a plethora of genomic data formats
  • Metadata, storing all the knowledge about the particular sample, are arbitrary attribute-value pairs, independent from any standardization attempt; they trace the data provenance, including biological and clinical aspects

This is exemplified by the figure below, showing on the left the regions and on the right the metadata of a dataset sample.

_images/GDM.png

Genometric Query Language

GMQL is a declarative language for genomic region and metadata manipulation with a SQL-inspired syntax. With GMQL the user can perform complex queries on the basis of positional, categorical and numeric features of the datasets.

You can find more information about the language at the following links:

NB: In order to use PyGMQL one should have at least clear the semantics of the GMQL operators, but the library is designed to be self contained and can be used without a strong background knowledge of the language.

GMQL engine

The GMQL engine is composed by various sub-systems:
  • A repository, which enables the user to store his/her datasets, the results of the queries and to access the public datasets shared between the users of the same GMQL instance
  • An engine implementation, which implements the GMQL operators. Currently the Spark engine is the most updated and complete implementation and it is the one used also by PyGMQL
_images/GMQL.png

GMQL WEB interface

The GMQL system is publicly available at this link.

The GMQLDataset

Here we present the functions that can be used on a GMQLDataset.

class GMQLDataset(parser=None, index=None, location='local', path_or_name=None, local_sources=None, remote_sources=None, meta_profile=None)[source]

The main abstraction of the library. A GMQLDataset represents a genomic dataset in the GMQL standard and it is divided in region data and meta data. The function that can be applied to a GMQLDataset affect one of these two features or both.

For each operator function that can be applied to a GMQLDataset we provide the documentation, some examples, and we specify which operator of GMQL the function is wrapper of.

get_reg_attributes()[source]

Returns the region fields of the dataset

Returns:a list of field names
MetaField(name, t=None)[source]

Creates an instance of a metadata field of the dataset. It can be used in building expressions or conditions for projection or selection. Notice that this function is equivalent to call:

dataset["name"]

If the MetaField is used in a region projection (reg_project()), the user has also to specify the type of the metadata attribute that is selected:

dataset.reg_project(new_field_dict={'new_field': dataset['name', 'string']})
Parameters:
  • name – the name of the metadata that is considered
  • t – the type of the metadata attribute {string, int, boolean, double}
Returns:

a MetaField instance

RegField(name)[source]

Creates an instance of a region field of the dataset. It can be used in building expressions or conditions for projection or selection. Notice that this function is equivalent to:

dataset.name
Parameters:name – the name of the region field that is considered
Returns:a RegField instance
select(meta_predicate=None, region_predicate=None, semiJoinDataset=None, semiJoinMeta=None)[source]

Wrapper of SELECT

Selection operation. Enables to filter datasets on the basis of region features or metadata attributes. In addition it is possibile to perform a selection based on the existence of certain metadata semiJoinMeta attributes and the matching of their values with those associated with at least one sample in an external dataset semiJoinDataset.

Therefore, the selection can be based on:

  • Metadata predicates: selection based on the existence and values of certain metadata attributes in each sample. In predicates, attribute-value conditions can be composed using logical predicates & (and), | (or) and ~ (not)
  • Region predicates: selection based on region attributes. Conditions can be composed using logical predicates & (and), | (or) and ~ (not)
  • SemiJoin clauses: selection based on the existence of certain metadata semiJoinMeta attributes and the matching of their values with those associated with at least one sample in an external dataset semiJoinDataset

In the following example we select all the samples from Example_Dataset_1 regarding antibody CTCF. From these samples we select only the regions on chromosome 6. Finally we select only the samples which have a matching antibody_targetClass in Example_Dataset_2:

import gmql as gl
d1 = gl.get_example_dataset("Example_Dataset_1")
d2 = gl.get_example_dataset("Example_Dataset_2")

d_select = d.select(meta_predicate = d['antibody'] == "CTCF",
                    region_predicate = d.chr == "chr6",
                    semiJoinDataset=d2, semiJoinMeta=["antibody_targetClass"])
Parameters:
  • meta_predicate – logical predicate on the metadata <attribute, value> pairs
  • region_predicate – logical predicate on the region feature values
  • semiJoinDataset – an other GMQLDataset
  • semiJoinMeta – a list of metadata attributes (strings)
Returns:

a new GMQLDataset

meta_select(predicate=None, semiJoinDataset=None, semiJoinMeta=None)[source]

Wrapper of SELECT

Wrapper of the select() function filtering samples only based on metadata.

Parameters:
  • predicate – logical predicate on the values of the rows
  • semiJoinDataset – an other GMQLDataset
  • semiJoinMeta – a list of metadata
Returns:

a new GMQLDataset

Example 1:

output_dataset = dataset.meta_select(dataset['patient_age'] < 70)
# This statement can be written also as
output_dataset = dataset[ dataset['patient_age'] < 70 ]

Example 2:

output_dataset = dataset.meta_select( (dataset['tissue_status'] == 'tumoral') &
                                    (tumor_tag != 'gbm') | (tumor_tag == 'brca'))
# This statement can be written also as
output_dataset = dataset[ (dataset['tissue_status'] == 'tumoral') &
                        (tumor_tag != 'gbm') | (tumor_tag == 'brca') ]

Example 3:

JUN_POLR2A_TF = HG19_ENCODE_NARROW.meta_select( JUN_POLR2A_TF['antibody_target'] == 'JUN',
                                                semiJoinDataset=POLR2A_TF, semiJoinMeta=['cell'])

The meta selection predicate can use all the classical equalities and disequalities {>, <, >=, <=, ==, !=} and predicates can be connected by the classical logical symbols {& (AND), | (OR), ~ (NOT)} plus the isin function.

reg_select(predicate=None, semiJoinDataset=None, semiJoinMeta=None)[source]

Wrapper of SELECT

Wrapper of the select() function filtering regions only based on region attributes.

Parameters:
  • predicate – logical predicate on the values of the regions
  • semiJoinDataset – an other GMQLDataset
  • semiJoinMeta – a list of metadata
Returns:

a new GMQLDataset

An example of usage:

new_dataset = dataset.reg_select((dataset.chr == 'chr1') | (dataset.pValue < 0.9))

You can also use Metadata attributes in selection:

new_dataset = dataset.reg_select(dataset.score > dataset['size'])

This statement selects all the regions whose field score is strictly higher than the sample metadata attribute size.

The region selection predicate can use all the classical equalities and disequalities {>, <, >=, <=, ==, !=} and predicates can be connected by the classical logical symbols {& (AND), | (OR), ~ (NOT)} plus the isin function.

In order to be sure about the correctness of the expression, please use parenthesis to delimit the various predicates.

project(projected_meta=None, new_attr_dict=None, all_but_meta=None, projected_regs=None, new_field_dict=None, all_but_regs=None)[source]

Wrapper of PROJECT

The PROJECT operator creates, from an existing dataset, a new dataset with all the samples (with their regions and region values) in the input one, but keeping for each sample in the input dataset only those metadata and/or region attributes expressed in the operator parameter list. Region coordinates and values of the remaining metadata and region attributes remain equal to those in the input dataset. Differently from the SELECT operator, PROJECT allows to:

  • Remove existing metadata and/or region attributes from a dataset;
  • Create new metadata and/or region attributes to be added to the result.
Parameters:
  • projected_meta – list of metadata attributes to project on
  • new_attr_dict – an optional dictionary of the form {‘new_meta_1’: function1, ‘new_meta_2’: function2, …} in which every function computes the new metadata attribute based on the values of the others
  • all_but_meta – list of metadata attributes that must be excluded from the projection
  • projected_regs – list of the region fields to select
  • new_field_dict – an optional dictionary of the form {‘new_field_1’: function1, ‘new_field_2’: function2, …} in which every function computes the new region field based on the values of the others
  • all_but_regs – list of region fields that must be excluded from the projection
Returns:

a new GMQLDataset

meta_project(attr_list=None, all_but=None, new_attr_dict=None)[source]

Wrapper of PROJECT

Project the metadata based on a list of attribute names

Parameters:
  • attr_list – list of the metadata fields to select
  • all_but – list of metadata that must be excluded from the projection.
  • new_attr_dict – an optional dictionary of the form {‘new_field_1’: function1, ‘new_field_2’: function2, …} in which every function computes the new field based on the values of the others
Returns:

a new GMQLDataset

Notice that if attr_list is specified, all_but cannot be specified and viceversa.

Examples:

new_dataset = dataset.meta_project(attr_list=['antibody', 'ID'],
                                   new_attr_dict={'new_meta': dataset['ID'] + 100})
reg_project(field_list=None, all_but=None, new_field_dict=None)[source]

Wrapper of PROJECT

Project the region data based on a list of field names

Parameters:
  • field_list – list of the fields to select
  • all_but – keep only the region fields different from the ones specified
  • new_field_dict – an optional dictionary of the form {‘new_field_1’: function1, ‘new_field_2’: function2, …} in which every function computes the new field based on the values of the others
Returns:

a new GMQLDataset

An example of usage:

new_dataset = dataset.reg_project(['pValue', 'name'],
                                {'new_field': dataset.pValue / 2})

new_dataset = dataset.reg_project(field_list=['peak', 'pvalue'],
                                  new_field_dict={'new_field': dataset.pvalue * dataset['cell_age', 'float']})

Notice that you can use metadata attributes for building new region fields. The only thing to remember when doing so is to define also the type of the output region field in the metadata attribute definition (for example, dataset['cell_age', 'float'] is required for defining the new attribute new_field as float). In particular, the following type names are accepted: ‘string’, ‘char’, ‘long’, ‘integer’, ‘boolean’, ‘float’, ‘double’.

extend(new_attr_dict)[source]

Wrapper of EXTEND

For each sample in an input dataset, the EXTEND operator builds new metadata attributes, assigns their values as the result of arithmetic and/or aggregate functions calculated on sample region attributes, and adds them to the existing metadata attribute-value pairs of the sample. Sample number and their genomic regions, with their attributes and values, remain unchanged in the output dataset.

Parameters:new_attr_dict – a dictionary of the type {‘new_metadata’ : AGGREGATE_FUNCTION(‘field’), …}
Returns:new GMQLDataset

An example of usage, in which we count the number of regions in each sample and the minimum value of the pValue field and export it respectively as metadata attributes regionCount and minPValue:

import gmql as gl

dataset = gl.get_example_dataset("Example_Dataset_1")
new_dataset = dataset.extend({'regionCount' : gl.COUNT(),
                              'minPValue' : gl.MIN('pValue')})
cover(minAcc, maxAcc, groupBy=None, new_reg_fields=None, cover_type='normal')[source]

Wrapper of COVER

COVER is a GMQL operator that takes as input a dataset (of usually, but not necessarily, multiple samples) and returns another dataset (with a single sample, if no groupby option is specified) by “collapsing” the input samples and their regions according to certain rules specified by the COVER parameters. The attributes of the output regions are only the region coordinates, plus in case, when aggregate functions are specified, new attributes with aggregate values over attribute values of the contributing input regions; output metadata are the union of the input ones, plus the metadata attributes JaccardIntersect and JaccardResult, representing global Jaccard Indexes for the considered dataset, computed as the correspondent region Jaccard Indexes but on the whole sample regions.

Parameters:
  • cover_type – the kind of cover variant you want [‘normal’, ‘flat’, ‘summit’, ‘histogram’]
  • minAcc – minimum accumulation value, i.e. the minimum number of overlapping regions to be considered during COVER execution. It can be any positive number or the strings {‘ALL’, ‘ANY’}.
  • maxAcc – maximum accumulation value, i.e. the maximum number of overlapping regions to be considered during COVER execution. It can be any positive number or the strings {‘ALL’, ‘ANY’}.
  • groupBy – optional list of metadata attributes
  • new_reg_fields – dictionary of the type {‘new_region_attribute’ : AGGREGATE_FUNCTION(‘field’), …}
Returns:

a new GMQLDataset

An example of usage:

cell_tf = narrow_peak.cover("normal", minAcc=1, maxAcc="Any", 
                                groupBy=['cell', 'antibody_target'])    
normal_cover(minAcc, maxAcc, groupBy=None, new_reg_fields=None)[source]

Wrapper of COVER

The normal cover operation as described in cover(). Equivalent to calling:

dataset.cover("normal", ...)
flat_cover(minAcc, maxAcc, groupBy=None, new_reg_fields=None)[source]

Wrapper of COVER

Variant of the function cover() that returns the union of all the regions which contribute to the COVER. More precisely, it returns the contiguous regions that start from the first end and stop at the last end of the regions which would contribute to each region of a COVER.

Equivalent to calling:

cover("flat", ...)
summit_cover(minAcc, maxAcc, groupBy=None, new_reg_fields=None)[source]

Wrapper of COVER

Variant of the function cover() that returns only those portions of the COVER result where the maximum number of regions overlap (this is done by returning only regions that start from a position after which the number of overlaps does not increase, and stop at a position where either the number of overlapping regions decreases or violates the maximum accumulation index).

Equivalent to calling:

cover("summit", ...)
histogram_cover(minAcc, maxAcc, groupBy=None, new_reg_fields=None)[source]

Wrapper of COVER

Variant of the function cover() that returns all regions contributing to the COVER divided in different (contiguous) parts according to their accumulation index value (one part for each different accumulation value), which is assigned to the AccIndex region attribute.

Equivalent to calling:

cover("histogram", ...)
join(experiment, genometric_predicate, output='LEFT', joinBy=None, refName='REF', expName='EXP', left_on=None, right_on=None)[source]

Wrapper of JOIN

The JOIN operator takes in input two datasets, respectively known as anchor (the first/left one) and experiment (the second/right one) and returns a dataset of samples consisting of regions extracted from the operands according to the specified condition (known as genometric predicate). The number of generated output samples is the Cartesian product of the number of samples in the anchor and in the experiment dataset (if no joinby close if specified). The attributes (and their values) of the regions in the output dataset are the union of the region attributes (with their values) in the input datasets; homonymous attributes are disambiguated by prefixing their name with their dataset name. The output metadata are the union of the input metadata, with their attribute names prefixed with their input dataset name.

Parameters:
  • experiment – an other GMQLDataset
  • genometric_predicate – a list of Genometric atomic conditions. For an explanation of each of them go to the respective page.
  • output

    one of four different values that declare which region is given in output for each input pair of anchor and experiment regions satisfying the genometric predicate:

    • ’LEFT’: outputs the anchor regions from the anchor dataset that satisfy the genometric predicate
    • ’RIGHT’: outputs the anchor regions from the experiment dataset that satisfy the genometric predicate
    • ’INT’: outputs the overlapping part (intersection) of the anchor and experiment regions that satisfy the genometric predicate; if the intersection is empty, no output is produced
    • ’CONTIG’: outputs the concatenation between the anchor and experiment regions that satisfy the genometric predicate, i.e. the output region is defined as having left (right) coordinates equal to the minimum (maximum) of the corresponding coordinate values in the anchor and experiment regions satisfying the genometric predicate
  • joinBy – list of metadata attributes
  • refName – name that you want to assign to the reference dataset
  • expName – name that you want to assign to the experiment dataset
  • left_on – list of region fields of the reference on which the join must be performed
  • right_on – list of region fields of the experiment on which the join must be performed
Returns:

a new GMQLDataset

An example of usage, in which we perform the join operation between Example_Dataset_1 and Example_Dataset_2 specifying than we want to join the regions of the former with the first regions at a minimim distance of 120Kb of the latter and finally we want to output the regions of Example_Dataset_2 matching the criteria:

import gmql as gl

d1 = gl.get_example_dataset("Example_Dataset_1")
d2 = gl.get_example_dataset("Example_Dataset_2")

result_dataset = d1.join(experiment=d2,
                         genometric_predicate=[gl.MD(1), gl.DGE(120000)],
                         output="right")
map(experiment, new_reg_fields=None, joinBy=None, refName='REF', expName='EXP')[source]

Wrapper of MAP

MAP is a non-symmetric operator over two datasets, respectively called reference and experiment. The operation computes, for each sample in the experiment dataset, aggregates over the values of the experiment regions that intersect with a region in a reference sample, for each region of each sample in the reference dataset; we say that experiment regions are mapped to the reference regions. The number of generated output samples is the Cartesian product of the samples in the two input datasets; each output sample has the same regions as the related input reference sample, with their attributes and values, plus the attributes computed as aggregates over experiment region values. Output sample metadata are the union of the related input sample metadata, whose attribute names are prefixed with their input dataset name. For each reference sample, the MAP operation produces a matrix like structure, called genomic space, where each experiment sample is associated with a row, each reference region with a column, and each matrix row is a vector of numbers - the aggregates computed during MAP execution. When the features of the reference regions are unknown, the MAP helps in extracting the most interesting regions out of many candidates.

Parameters:
  • experiment – a GMQLDataset
  • new_reg_fields – an optional dictionary of the form {‘new_field_1’: AGGREGATE_FUNCTION(field), …}
  • joinBy – optional list of metadata
  • refName – name that you want to assign to the reference dataset
  • expName – name that you want to assign to the experiment dataset
Returns:

a new GMQLDataset

In the following example, we map the regions of Example_Dataset_2 on the ones of Example_Dataset_1 and for each region of Example_Dataset_1 we ouput the average Pvalue and number of mapped regions of Example_Dataset_2. In addition we specify that the output region fields and metadata attributes will have the D1 and D2 suffixes respectively for attributes and fields belonging the Example_Dataset_1 and Example_Dataset_2:

import gmql as gl

d1 = gl.get_example_dataset("Example_Dataset_1")
d2 = gl.get_example_dataset("Example_Dataset_2")

result = d1.map(experiment=d2, refName="D1", expName="D2",
                new_reg_fields={"avg_pValue": gl.AVG("pvalue")})
order(meta=None, meta_ascending=None, meta_top=None, meta_k=None, regs=None, regs_ascending=None, region_top=None, region_k=None)[source]

Wrapper of ORDER

The ORDER operator is used to order either samples, sample regions, or both, in a dataset according to a set of metadata and/or region attributes, and/or region coordinates. The number of samples and their regions in the output dataset is as in the input dataset, as well as their metadata and region attributes and values, but a new ordering metadata and/or region attribute is added with the sample or region ordering value, respectively.

Parameters:
  • meta – list of metadata attributes
  • meta_ascending – list of boolean values (True = ascending, False = descending)
  • meta_top – “top”, “topq” or “topp” or None
  • meta_k – a number specifying how many results to be retained
  • regs – list of region attributes
  • regs_ascending – list of boolean values (True = ascending, False = descending)
  • region_top – “top”, “topq” or “topp” or None
  • region_k – a number specifying how many results to be retained
Returns:

a new GMQLDataset

Example of usage. We order Example_Dataset_1 metadata by ascending antibody and descending antibody_class keeping only the first sample. We also order the resulting regions based on the score field in descending order, keeping only the first one also in this case:

import gmql as gl

d1 = gl.get_example_dataset("Example_Dataset_1")

result = d1.order(meta=["antibody", "antibody_targetClass"],
                  meta_ascending=[True, False], meta_top="top", meta_k=1,
                  regs=['score'], regs_ascending=[False],
                  region_top="top", region_k=1)
difference(other, joinBy=None, exact=False)[source]

Wrapper of DIFFERENCE

DIFFERENCE is a binary, non-symmetric operator that produces one sample in the result for each sample of the first operand, by keeping the same metadata of the first operand sample and only those regions (with their schema and values) of the first operand sample which do not intersect with any region in the second operand sample (also known as negative regions)

Parameters:
  • other – GMQLDataset
  • joinBy – (optional) list of metadata attributes. It is used to extract subsets of samples on which to apply the operator: only those samples in the current and other dataset that have the same value for each specified attribute are considered when performing the operation
  • exact – boolean. If true, the the regions are considered as intersecting only if their coordinates are exactly the same
Returns:

a new GMQLDataset

Example of usage. We compute the exact difference between Example_Dataset_1 and Example_Dataset_2, considering only the samples with same antibody:

import gmql as gl

d1 = gl.get_example_dataset("Example_Dataset_1")
d2 = gl.get_example_dataset("Example_Dataset_2")

result = d1.difference(other=d2, exact=True, joinBy=['antibody'])
union(other, left_name='LEFT', right_name='RIGHT')[source]

Wrapper of UNION

The UNION operation is used to integrate homogeneous or heterogeneous samples of two datasets within a single dataset; for each sample of either one of the input datasets, a sample is created in the result as follows:

  • its metadata are the same as in the original sample;
  • its schema is the schema of the first (left) input dataset; new identifiers are assigned to each output sample;
  • its regions are the same (in coordinates and attribute values) as in the original sample. Region attributes which are missing in an input dataset sample (w.r.t. the merged schema) are set to null.
Parameters:
  • other – a GMQLDataset
  • left_name – name that you want to assign to the left dataset
  • right_name – name tha t you want to assign to the right dataset
Returns:

a new GMQLDataset

Example of usage:

import gmql as gl

d1 = gl.get_example_dataset("Example_Dataset_1")
d2 = gl.get_example_dataset("Example_Dataset_2")

result = d1.union(other=d2, left_name="D1", right_name="D2")
merge(groupBy=None)[source]

Wrapper of MERGE

The MERGE operator builds a new dataset consisting of a single sample having

  • as regions all the regions of all the input samples, with the same attributes and values
  • as metadata the union of all the metadata attribute-values of the input samples.

A groupby clause can be specified on metadata: the samples are then partitioned in groups, each with a distinct value of the grouping metadata attributes, and the MERGE operation is applied to each group separately, yielding to one sample in the result dataset for each group. Samples without the grouping metadata attributes are disregarded

Parameters:groupBy – list of metadata attributes
Returns:a new GMQLDataset

Example of usage:

import gmql as gl

d1 = gl.get_example_dataset("Example_Dataset_1")
result = d1.merge(['antibody'])
group(meta=None, meta_aggregates=None, regs=None, regs_aggregates=None, meta_group_name='_group')[source]

Wrapper of GROUP

The GROUP operator is used for grouping both regions and/or metadata of input dataset samples according to distinct values of certain attributes (known as grouping attributes); new grouping attributes are added to samples in the output dataset, storing the results of aggregate function evaluations over metadata and/or regions in each group of samples. Samples having missing values for any of the grouping attributes are discarded.

Parameters:
  • meta – (optional) a list of metadata attributes
  • meta_aggregates – (optional) {‘new_attr’: fun}
  • regs – (optional) a list of region fields
  • regs_aggregates – {‘new_attr’: fun}
  • meta_group_name – (optional) the name to give to the group attribute in the metadata
Returns:

a new GMQLDataset

Example of usage. We group samples by antibody and we aggregate the region pvalues taking the maximum value calling the new region field maxPvalue:

import gmql as gl

d1 = gl.get_example_dataset("Example_Dataset_1")
result = d1.group(meta=['antibody'], regs_aggregates={'maxPvalue': gl.MAX("pvalue")})
meta_group(meta, meta_aggregates=None)[source]

Wrapper of GROUP

Group operation only for metadata. For further information check group()

regs_group(regs, regs_aggregates=None)[source]

Wrapper of GROUP

Group operation only for region data. For further information check group()

materialize(output_path=None, output_name=None, all_load=True, mode=None)[source]

Wrapper of MATERIALIZE

Starts the execution of the operations for the GMQLDataset. PyGMQL implements lazy execution and no operation is performed until the materialization of the results is requestd. This operation can happen both locally or remotely.

  • Local mode: if the GMQLDataset is local (based on local data) the user can specify the
Parameters:
  • output_path – (Optional) If specified, the user can say where to locally save the results of the computations.
  • output_name – (Optional) Can be used only if the dataset is remote. It represents the name that the user wants to give to the resulting dataset on the server
  • all_load – (Optional) It specifies if the result dataset should be directly converted to a GDataframe (True) or to a GMQLDataset (False) for future local queries.
Returns:

A GDataframe or a GMQLDataset

head(n=5)[source]

Returns a small set of regions and metadata from a query. It is supposed to be used for debugging purposes or for data exploration.

Parameters:n – how many samples to retrieve
Returns:a GDataframe

Loading functions

You can create a GMQLDataset by loading the data using the following functions:

load_from_file(path, parser: gmql.dataset.parsers.RegionParser.RegionParser)[source]

Loads a GDM dataset from a single BED-like file.

Parameters:
  • path – location of the file
  • parser – RegionParser object specifying the parser of the file
Returns:

a GMQLDataset

load_from_path(local_path, parser=None)[source]

Loads the data from a local path into a GMQLDataset. The loading of the files is “lazy”, which means that the files are loaded only when the user does a materialization (see materialize() ). The user can force the materialization of the data (maybe for an initial data exploration on only the metadata) by setting the reg_load (load in memory the region data), meta_load (load in memory the metadata) or all_load (load both region and meta data in memory). If the user specifies this final parameter as True, a GDataframe is returned, otherwise a GMQLDataset is returned

Parameters:
  • local_path – local path of the dataset
  • parser – the parser to be used for reading the data
  • all_load – if set to True, both region and meta data are loaded in memory and an instance of GDataframe is returned
Returns:

A new GMQLDataset or a GDataframe

load_from_remote(remote_name, owner=None)[source]

Loads the data from a remote repository.

Parameters:
  • remote_name – The name of the dataset in the remote repository
  • owner – (optional) The owner of the dataset. If nothing is provided, the current user is used. For public datasets use ‘public’.
Returns:

A new GMQLDataset or a GDataframe

Building expressions

When doing a selection (using meta_select(), reg_select()) or a projection (using meta_project(), reg_project()) you are required to specify an expression on the metadata or region fields.

An expression can therefore use metadata attributes or region fields. Given a GMQLDataset dataset, one can access its region fields by typing:

dataset.field1
dataset.field2
dataset.chr
dataset.start
...

and one can access its metadata attributes by typing:

dataset['metadata_attribute_1']
dataset['metadata_attribute_2']
dataset['metadata_attribute_3']
...

The expressions in PyGMQL can be of two types:

  • Predicate: a logical condition that enables to select a portion of the dataset. This expression is used in selection. Some example of predicates follow:

    # region predicate
    (dataset.chr == 'chr1' || dataset.pValue < 0.9)
    # region predicate with access to metadata attributes
    dataset.score > dataset['size']
    

    It is possible, based on the function that requires a predicate, to mix region fields and metadata attributes in a region condition. Of course it is not possible to mix metadata and region conditions in a metadata selection (this is due to the fact that to each metadata attribute can be associated multiple values for each region field).

  • Extension: a mathematical expression describing how to build new metadata or region fields based on the existent ones. Some examples of expression follow:

    # region expression
    dataset.start + dataset.stop
    dataset.p_value / dataset.q_value
    # metadata expression
    dataset['size'] * 8.9
    dataset['score'] / dataset['size']
    

    It is possible to mix region fields and metadata attributes in region extensions:

    # region expression using metadata attributes
    (dataset.pvalue / 2) + dataset['metadata'] + 1
    

GDataframe

class GDataframe(regs=None, meta=None)[source]

Class holding the result of a materialization of a GMQLDataset. It is composed by two data structures:

  • A table with the region data
  • A table with the metadata corresponding to the regions
to_dataset_files(local_path=None, remote_path=None)[source]

Save the GDataframe to a local or remote location

Parameters:
  • local_path – a local path to the folder in which the data must be saved
  • remote_path – a remote dataset name that wants to be used for these data
Returns:

None

to_GMQLDataset(local_path=None, remote_path=None)[source]

Converts the GDataframe in a GMQLDataset for later local or remote computation

Returns:a GMQLDataset
project_meta(attributes)[source]

Projects the specified metadata attributes to new region fields

Parameters:attributes – a list of metadata attributes
Returns:a new GDataframe with additional region fields
to_matrix(index_regs=None, index_meta=None, columns_regs=None, columns_meta=None, values_regs=None, values_meta=None, **kwargs)[source]

Transforms the GDataframe to a pivot matrix having as index and columns the ones specified. This function is a wrapper around the pivot_table function of Pandas.

Parameters:
  • index_regs – list of region fields to use as index
  • index_meta – list of metadata attributes to use as index
  • columns_regs – list of region fields to use as columns
  • columns_meta – list of metadata attributes to use as columns
  • values_regs – list of region fields to use as values
  • values_meta – list of metadata attributes to use as values
  • kwargs – other parameters to pass to the pivot_table function
Returns:

a Pandas dataframe having as index the union of index_regs and index_meta, as columns the union of columns_regs and columns_meta and as values ths union of values_regs and values_meta

Remote data management

PyGMQL can be used in two different ways. The first one (and the most intuitive and classical one) is to use it like any other computational library.

PyGMQL also manages the execution through a remote server (or cluster). In order to use this feature the user needs to login to the remote service before.

The web service offered by the GeCo group at Politecnico di Milano can be found at http://gmql.eu/gmql-rest/

Loggin in

This can be done by firstly specifying the remote server address:

import gmql as gl
gl.set_remote_address("http://gmql.eu/gmql-rest/")

and then by logging in into the system:

gl.login()

From this point on the user will be logged into the remote system.

Guest users VS Authenticated users

GMQL and PyGMQL enable two different ways to interact with the remote service. The users can be logged as:

  • Guest user: the user doesn’t need to register to the service and a only a limited storage and computational power is available for the queries. The access token (which is automatically handled by the library) will expire after a certain period of inactivity.
  • Authenticated user: the user needs to register on the web interface before providing username, password and other information. The access token is stored and can be used for an unlimited amount of time.

By default, the sequence of operations that are shown above will log the user as guest.

In both cases a folder in the home directory of the user will be created with name .pygmql and inside of it there will be a sessions.xml file which will store all the active sessions for the user.

Logging as an authenticated user

Once you are registered in the web service with a username and password, in order to use the same credentials also in PyGMQL you have to use the pygmql_login tool. This tool is automatically installed when the library is downloaded and installed (both from github or pip).

On linux/MacOS:

pygmql.sh --login

On Windows:

pygmql_win --login

Once the tool is executed the following information will be asked:

  • The http address of the remote service you want to access
  • Username
  • Password

Library modes

The library mode can be setted in the following way:

gl.set_mode("remote") # remote processing of the following operations
gl.set_mode("local")  # local processing of the following operations

Notice that the set_mode() will act only on the following materialize() operations while the previous ones will be performed with the previous setted modality.

The default mode of PyGMQL is local.

The remote execution mode

When the user sets the remote mode and calls the materialize() operation, the following actions will be performed

  1. The local datasets that are used are uploaded to the remote service. Nothing is done to the remote datasets used in the query (if present) since they are already on the server.
  2. A compressed representation of the query is sent to the remote service, decoded and executed
  3. Once the execution is complete, the results are downloaded, stored and loaded into a GDataframe().
_images/remote.png

Library settings

The following functions define the behavior of the library

Logging and progress bars

set_progress(how)[source]

Enables or disables the progress bars for the loading, writing and downloading of datasets

Parameters:how – True if you want the progress bar, False otherwise
Returns:None

Example:

import gmql as gl

gl.set_progress(True)   # abilitates progress bars
# ....do something...
gl.set_progress(False)  # removes progress bars
# ....do something...

Execution Mode

set_mode(how)[source]

Sets the behavior of the API

Parameters:how – if ‘remote’ all the execution is performed on the remote server; if ‘local’ all it is executed locally. Default = ‘local’
Returns:None

Master Configuration

set_master(master: str)[source]

Set the master of the PyGMQL instance. It accepts any master configuration available in Spark.

Parameters:master – master configuration
Returns:None
get_configuration()[source]

Returns the configurations of the current PyGMQL instance

Returns:a Configuration object
set_spark_configs(d)[source]

Set Spark configurations to be used during the spark-submit. Works only when the master is different from local.

Parameters:d – a dictionary of {key: values}
Returns:None
set_local_java_options(options: list)[source]

When the mode is set to local, this function can be used to add JVM specific options before starting the backend. It accepts any Java option.

Parameters:options – list of string, one for each Java option
Returns:None

Remote Management

get_remote_manager()[source]

Returns the current remote manager

Returns:a RemoteManager
login()[source]

Enables the user to login to the remote GMQL service. If both username and password are None, the user will be connected as guest.

logout()[source]

The user can use this command to logout from the remote service

Returns:None
set_remote_address(address)[source]

Enables the user to set the address of the GMQL remote service

Parameters:address – a string representing the URL of GMQL remote service
Returns:None

Spark and system configurations

The configuration of the Java properties and the Spark environment can be done by getting the singleton instance of the configuration class as follows:

conf = gl.get_configuration()

Follows the description of this object:

class Configuration[source]

Class containing all the information regarding the system environment and the Spark environment

set_app_name(name)[source]

Sets the name of the application in spark, By default it is called “gmql_api”

Parameters:name – string
Returns:None
set_master(master)[source]

Set the master of the spark cluster By default it is “local[*]”

Parameters:master – string
Returns:None
set_spark_conf(key=None, value=None, d=None)[source]

Sets a spark property as a (‘key’, ‘value’) pair of using a dictionary {‘key’: ‘value’, …}

Parameters:
  • key – string
  • value – string
  • d – dictionary
Returns:

None

set_system_conf(key=None, value=None, d=None)[source]

Sets a java system property as a (‘key’, ‘value’) pair of using a dictionary {‘key’: ‘value’, …}

Parameters:
  • key – string
  • value – string
  • d – dictionary
Returns:

None

Tutorials

Tutorial 1: Simple example of local processing

01_A_simple_example

Simple example of local processing

In this first tutorial we will show a complete example of usage of the library using some example datasets provided with it.

Importing the library

Importing the library

In [1]:
import gmql as gl
[PyGMQL] Updating backend to latest version
95M [00:06, 14.85M/s]                         

Loading datasets

PyGMQL can work with BED and GTF files with arbitrary fields and schemas. In order to load a dataset into Python the user can use the following functions:

  • load_from_path: lazily loads a dataset into a GMQLDataset variable from the local file system
  • load_from_remote: lazily loads a dataset into a GMQLDataset variable from a remote GMQL service
  • from_pandas: lazily loads a dataset into a GMQLDataset variable from a Pandas DataFrame having at least the chromosome, start and stop columns

In addition to these functions we also provide a function called get_example_dataset which enables the user to load a sample dataset and play with it in order to get confidence with the library. Currently we provide two example datasets: Example_Dataset_1 and Example_Dataset_2.

In the following we will load two example datasets and play with them.

In [2]:
dataset1 = gl.get_example_dataset("Example_Dataset_1")
dataset2 = gl.get_example_dataset("Example_Dataset_2")

The GMQLDataset

The dataset variable defined above is a GMQLDataset, which represents a GMQL variable and on which it is possible to apply GMQL operators. It must be noticed that no data has been loaded in memory yet and the computation will only start when the query is triggered. We will see how to start the execution of a query in the following steps.

We can inspect the schema of the dataset with the following:

In [3]:
dataset1.schema
Out[3]:
['source',
 'feature',
 'score',
 'frame',
 'name',
 'signal',
 'pvalue',
 'qvalue',
 'peak',
 'chr',
 'start',
 'stop',
 'strand']
In [4]:
dataset2.schema
Out[4]:
['source',
 'feature',
 'score',
 'frame',
 'name',
 'signal',
 'pvalue',
 'qvalue',
 'peak',
 'chr',
 'start',
 'stop',
 'strand']

Filtering the dataset regions based on a predicate

The first operation we will do on dataset will be selecting only the genomic regions on the 3rd chromosome and with a start position greater than 30000.

In [5]:
filtered_dataset1 = dataset1.reg_select((dataset1.chr == 'chr3') & (dataset1.start >= 30000))

From this operation we can learn several things about the GMQLDataset data structure. Each GMQLDataset has a set of methods and fields which can be used to build GMQL queries. For example, in the previous statement we have:

  • the reg_select method, which enables us to filter the datasets on the basis of a predicate on the region positions and features
  • the chr and start fields, which enable the user to build predicates on the fields of the dataset.

Every GMQL operator has a relative method accessible from the GMQLDataset data structure, as well as any other field of the dataset.

Filtering a dataset based on a predicate on metadata

The Genomic Data Model enables us to work both with genomic regions and their relative metadata. Therefore we can filter dataset samples on the basis of predicates on metadata attributes. This can be done as follows:

In [6]:
filtered_dataset_2 = dataset2[dataset2['antibody_target'] == 'CTCF']

Notice that the notation for selecting the samples using metadata is the same as the one for filtering Pandas DataFrames.

Joining two datasets

It is not the focus of this tutorial to show all the possible operations which can be done on a GMQLDataset, they can be seen on the documentation page of the library.

For the sake of this example, let's show the JOIN operation between the two filtered datasets defined in the previous two steps. The JOIN operation semantics relies on the concept of reference and experiment datasets. The reference dataset is the one 'calling' the join function while the experiment dataset is the one 'on which' the function is called. The semantics of the function is

resulting_dataset = <reference>.join(<experiment>, <genometric predicate>, ...)
In [7]:
dataset_join = dataset1.join(dataset2, [gl.DLE(0)])

To understand the concept of genometric predicate please visit the documentation of the library.

Materialization of the results

As we have already said, no operation has beed effectively done up to this point. What we did up to now is to define the sequence of operations to apply on the data. In order to trigger the execution we have to apply the materialize function on the variable we want to compute.

In [8]:
query_result = dataset_join.materialize()
Collecting regions: 12it [00:03,  3.49it/s]
100%|██████████| 190/190 [00:02<00:00, 94.63it/s]

The GDataframe

The query_result variable holds the result of the previous GMQL query in the form of a GDataframe data structure. It holds the information about the regions and the metadata of the result, which can be respectively accessed through the regs and meta attributes.

Regions

In [10]:
query_result.regs.head()
Out[10]:
chr start stop strand REF.source REF.feature REF.score REF.frame REF.name REF.signal ... REF.peak EXP.source EXP.feature EXP.score EXP.frame EXP.name EXP.signal EXP.pvalue EXP.qvalue EXP.peak
id_sample
5758100849671408287 chr4 149 179 - GMQL Region 0.9 . . 15.0 ... -1.0 GMQL Region 240.0 . . 17.0000 11.1675 -1.0 -1.0
5758100849671408287 chr4 199 249 - GMQL Region 0.9 . . 15.0 ... -1.0 GMQL Region 240.0 . . 17.0000 11.1675 -1.0 -1.0
-7728122121608963789 chr2 99 150 * GMQL Region 10.0 . . 9.0 ... -1.0 GMQL Region 12.0 . . 20.0000 -1.0000 -1.0 -1.0
4670800706188226503 chr2 99 150 * GMQL Region 10.0 . . 9.0 ... -1.0 GMQL Region 5.0 . chr2.2538 0.0332 1.3000 -1.0 32.0
-7728122121608963789 chr2 199 240 * GMQL Region 9.0 . . 22.0 ... -1.0 GMQL Region 12.0 . . 20.0000 -1.0000 -1.0 -1.0

5 rows × 22 columns

Metadata

In [11]:
query_result.meta.head()
Out[11]:
EXP.ID EXP.antibody EXP.antibody_antibodyDescription EXP.antibody_lab EXP.antibody_label EXP.antibody_lots EXP.antibody_orderUrl EXP.antibody_tag EXP.antibody_target EXP.antibody_targetClass ... REF.treatment_label REF.treatment_tag REF.treatment_type REF.type REF.url REF.view REF.view_description REF.view_label REF.view_tag REF.view_type
id_sample
-8999039739831780787 [1891] [CTCF] [Rabbit polyclonal. Antibody Target: CTCF] [Myers, Hardison, Snyder] [CTCF (07-729)] [1350637 DAM1472197] [http://www.millipore.com/catalogue/item/07-729] [CTCF] [CTCF] [TFSS] ... [No treatment or prot] [NONE] [control] [narrowPeak] [http://hgdownload.cse.ucsc.edu/goldenPath/hg1... [Peaks] [Regions of enriched signal in experiment] [Peaks] [PKS] [view]
-8744938632702845394 [1460] [] [] [] [] [] [] [] [] [] ... [No treatment or prot] [NONE] [control] [narrowPeak] [http://hgdownload.cse.ucsc.edu/goldenPath/hg1... [Peaks] [Regions of enriched signal in experiment] [Peaks] [PKS] [view]
-8566469212180078812 [1892] [CTCF] [Rabbit polyclonal. Antibody Target: CTCF] [Myers, Hardison, Snyder] [CTCF (07-729)] [1350637 DAM1472197] [http://www.millipore.com/catalogue/item/07-729] [CTCF] [CTCF] [TFSS] ... [No treatment or prot] [NONE] [control] [narrowPeak] [http://hgdownload.cse.ucsc.edu/goldenPath/hg1... [Peaks] [Regions of enriched signal in experiment] [Peaks] [PKS] [view]
-7728122121608963789 [42] [] [] [] [] [] [] [] [] [] ... [No treatment or prot] [NONE] [control] [narrowPeak] [http://hgdownload.cse.ucsc.edu/goldenPath/hg1... [Peaks] [Regions of enriched signal in experiment] [Peaks] [PKS] [view]
-7571531587018517458 [1892] [CTCF] [Rabbit polyclonal. Antibody Target: CTCF] [Myers, Hardison, Snyder] [CTCF (07-729)] [1350637 DAM1472197] [http://www.millipore.com/catalogue/item/07-729] [CTCF] [CTCF] [TFSS] ... [No treatment or prot] [NONE] [control] [narrowPeak] [http://hgdownload.cse.ucsc.edu/goldenPath/hg1... [Peaks] [Regions of enriched signal in experiment] [Peaks] [PKS] [view]

5 rows × 189 columns

Tutorial 2: Mixing local and remote processing

02b_Mixing_Local_Remote_Processing_COMPLETE

Mixing local and remote processing

In this tutorial you will learn:

  1. How to connect to a remote GMQL instance and run remote queries
  2. How to interleave remote and local GMQL processing
  3. How to manipulate the GDataframe data structure to plot an heatmap
  4. How to perform a Principal Component Analysis using the GDataframe data structure
  5. How to transform a pure Pandas Dataframe back to a GMQLDataset to perform a new query

Connecting to a remote GMQL instance

PyGMQL can interact with a remote server hosting a GMQL instance and repository.

In order to connect to the remote instance the user has to:

  1. Set the IP address of the service
  2. Login to the server

It must be noticed that GMQL offers both authenticated and un-authenticated login: by default PyGMQL executes an un-authenticated login. This means that the remote instance will provide PyGMQL with limited resources and not permanent result storage. In order to login to the GMQL server as an authenticated user, one has to:

  1. Register on the remote service by going to its IP address from the web browser
  2. Use the pygmql_login tool, which is installed together with the library

In the following, we are going to use the publicly accessible GMQL service offered by the Genomic Computing group at Politecnico di Milano (accessible at http://gmql.eu/gmql-rest/). Since we did not registered to the service, when we the login call is encountered, the library will login as GUEST user with limited capabilities (in terms of computation and disk space).

In [1]:
import gmql as gl

gl.set_remote_address("http://gmql.eu/gmql-rest/")
gl.login()
[PyGMQL] Logging using stored authentication token

From this point on, all the remote calls (explained later) will be forwarded to the GMQL service hosted at the specified address as a guest user.

Remote execution

With PyGMQL the user can decide if the next query (sequence of statements on GMQLDatasets ended with a materialization) will be executed locally or on the remote service. This can be done using the set_mode function. If the user sets the mode to remote the following will happen:

  1. The local datasets that are used are uploaded to the remote service. Nothing is done to the remote datasets used in the query (if present) since they are already on the server.
  2. A compressed representation of the query is sent to the remote service, decoded and executed
  3. Once the execution is complete, the results are downloaded, stored and loaded into a GDataframe .

The following image describes the steps of remote processing with PyGMQL:

The example

We will showcase these features through a simple example.

We want to understand the enrichment of histone modifications on promotorial regions of REFSEQ genes. In particular:

  • The gene RefSeq annotation dataset holds the information about gene symbols and coordinates and it is stored in the remote server
  • The Chip-Seq data about histone modifications will come from ENCODE and it is as well stored in the remote system.

The result of the analysis will be an heatmap $genes \times histone\_mark$ displaying the average signal value of each HM for each of the selected genes.

Finding the promotorial regions

We will use as definition of promoter the region situated at $\left[gene_{start} - 2000; gene_{start} + 2000\right]$.

We will perform this operation remotely.

In [2]:
gl.set_mode("remote")

We "load" the remote RefSeq dataset using the load_from_remote function. Since this dataset is available to all the users of the system, we will specify public as the ownership of the datasets. In the case in which we are loading our private datasets stored on the service, we can omit the owner parameter.

In [3]:
genes = gl.load_from_remote("HG19_ANNOTATION_REFSEQ", owner="public")

Now we have to:

  1. Select only the samples from the HG19_ANNOTATION_REFSEQ dataset specific of genes
  2. Select only the regions related to the genes (TP53, MYC, MAX, MAZ, PTEN)

This operation can be done using the select function on the gene GMQLDataset variable.

In [4]:
genes = genes.select(meta_predicate=genes["annotation_type"] == 'gene')

To extract the promotorial regions we will use the reg_project (region projection) operation. In particular the following call will:

  1. Return a dataset with only the chr, start, stop, gene_symbol region fields
  2. For each gene, return the genomic position at $\left[gene_{start} - 2000; gene_{start} + 2000\right]$ taking also into account the strand of the gene.
In [6]:
promoters = genes.reg_project(field_list=['gene_symbol'], 
                              new_field_dict={'start': genes.start - 2000, 
                                              'stop': genes.start + 2000})

Since we are interested in keeping the result of this part of the query for future usage, we will materialize the result locally.

In [7]:
promoters = promoters.materialize("./promoters", all_load=False)
RUNNING
100%|██████████| 1/1 [00:08<00:00,  8.94s/it]

Selecting the histone marks

Next we want to select the H3K27me3, H3K36me3, H3K4me1, H3K79me2, H3K9ac and H3K9me3 from the HG19_ENCODE_BROAD_NOV_2017 dataset, also in this case stored remotely. As before, we will load the dataset using load_from_remote. The following operation is a meta selection, which means that the selection will be done on the metadata attributes of the dataset. The resulting dataset will have all the samples for which the metadata satisfy the specified predicate.

In PyGMQL the metadata selection follows the same syntax as the Pandas Python data analysis library

selected_dataset = dataset[dataset['metadata_attribute'] == 'value']

Since we want to test for multiple histone marks we can use the isin() operator.

In [8]:
hms = gl.load_from_remote("HG19_ENCODE_BROAD_NOV_2017", owner="public")
hms = hms[hms['experiment_target'].isin(['H3K27me3-human', 'H3K36me3-human', 
                                         "H3K4me1-human", "H3K79me2-human", 
                                         "H3K9ac-human", "H3K9me3-human"])]

Grouping replicates

The dataset resulting from the previous selection will have several duplicates for each region, due to the several biological and technical replicates offered by ENCODE. Therefore we need a way to group together the overlapping regions and to aggregate the relative signal values.

This can be done using the normal_cover operation. In particular this function enables you to specify:

  1. The number of minimum overlappings for a region to be kept
  2. The maximum number of overlappings for a region to be kept
  3. If you want to groupby the resulting samples by some metadata attribute
  4. The aggregation function for the input region fields to be applied to the region groupings

In our case we will keep all the regions (minimum overallpping = 1, maximum overlapping = "ANY") and group by the experiment_target metadata attribute. The signals of overlapping regions will be averaged.

In [9]:
hms = hms.normal_cover(1, "ANY", groupBy=['experiment_target'], 
                       new_reg_fields={'avg_signal': gl.AVG("signal")})

Also in this case we want to keep the resulting dataset for later processing.

In [10]:
hms = hms.materialize("./hms", all_load=False)
RUNNING
100%|██████████| 6/6 [01:21<00:00, 20.57s/it]

Mapping the histone marks signal to the promotorial regions

Now that we have our set of gene promoters and our set of histone marks, we want to map the signal of the latter to the regions of the former. We can do that using the map operation. In particular, since on a promotorial region can be mapped multiple HMs, we will aggregate the signal by averaging it.

We decide to perform the following operations on the local machine.

In [11]:
gl.set_mode("local")
In [12]:
mapping = promoters.map(hms, refName='prom', 
                        expName="hm", 
                        new_reg_fields={'avg_signal': gl.AVG("avg_signal")})

We finally materialize the result and load it in a GDataframe.

In [13]:
mapping = mapping.materialize("./mapping")
Collecting regions: 282180it [00:09, 29036.10it/s]
100%|██████████| 46/46 [00:00<00:00, 405.64it/s]
In [14]:
mapping.regs.head()
Out[14]:
chr start stop strand gene_symbol count_prom_hm avg_signal
id_sample
-4961556509145986263 chr15 35690830 35694830 + HNRNPA1P45 0.0 NaN
-4961556509145986263 chrX 79588817 79592817 - CHMP1B2P 0.0 NaN
-4961556509145986263 chr1 152626807 152630807 + LCEP3 0.0 NaN
-4961556509145986263 chr12 112017097 112021097 + LOC100101246 1.0 7.43431
-4961556509145986263 chr1 154472694 154476694 + TDRD10 0.0 NaN
In [15]:
mapping.meta.head()
Out[15]:
hm.antibody_accession hm.assay hm.assembly hm.audit_error hm.audit_internal_action hm.audit_not_compliant hm.audit_warning hm.biological_replicate_s_ hm.biosample_age hm.biosample_life_stage ... prom.data_format_description prom.download_link prom.full_name prom.genome_build prom.genome_build_accession prom.original_format prom.original_url prom.provider prom.release_date prom.release_version
id_sample
-8047918399829123478 [ENCAB000ANB] [ChIP-seq] [hg19] [] [missing derived_from, missing unfiltered alig... [control insufficient read depth, insufficient... [inconsistent platforms, low read length, mild... [1, 2] [31 year, 53 year, unknown] [adult, embryonic] ... [http://biomirror.aarnet.edu.au/biomirror/ncbi... [ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a... [RefSeq Reference Genome Annotation] [GRCh37.p13] [GCF_000001405.25] [gff] [https://www.ncbi.nlm.nih.gov/projects/genome/... [RefSeq] [2013-06-28] [13]
-4961556509145986263 [ENCAB000ANH] [ChIP-seq] [hg19] [] [missing derived_from, missing unfiltered alig... [control insufficient read depth, insufficient... [inconsistent platforms, low read length] [] [53 year] [adult] ... [http://biomirror.aarnet.edu.au/biomirror/ncbi... [ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a... [RefSeq Reference Genome Annotation] [GRCh37.p13] [GCF_000001405.25] [gff] [https://www.ncbi.nlm.nih.gov/projects/genome/... [RefSeq] [2013-06-28] [13]
-4790703078920733789 [ENCAB000ADW] [ChIP-seq] [hg19] [extremely low read depth] [biological replicates with identical biosampl... [control insufficient read depth, insufficient... [low read length, mild to moderate bottlenecking] [] [58 year, unknown] [adult, newborn, unknown] ... [http://biomirror.aarnet.edu.au/biomirror/ncbi... [ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a... [RefSeq Reference Genome Annotation] [GRCh37.p13] [GCF_000001405.25] [gff] [https://www.ncbi.nlm.nih.gov/projects/genome/... [RefSeq] [2013-06-28] [13]
-4315937178640546765 [ENCAB000ADU] [ChIP-seq] [hg19] [control extremely low read depth, extremely l... [biological replicates with identical biosampl... [control insufficient read depth, insufficient... [inconsistent platforms, low read length, mild... [] [31 year, 53 year, unknown] [adult, newborn, unknown] ... [http://biomirror.aarnet.edu.au/biomirror/ncbi... [ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a... [RefSeq Reference Genome Annotation] [GRCh37.p13] [GCF_000001405.25] [gff] [https://www.ncbi.nlm.nih.gov/projects/genome/... [RefSeq] [2013-06-28] [13]
-4215283275073074168 [ENCAB000ANX] [ChIP-seq] [hg19] [] [biological replicates with identical biosampl... [control insufficient read depth, insufficient... [inconsistent platforms, low read depth, low r... [] [53 year, unknown] [adult, embryonic] ... [http://biomirror.aarnet.edu.au/biomirror/ncbi... [ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a... [RefSeq Reference Genome Annotation] [GRCh37.p13] [GCF_000001405.25] [gff] [https://www.ncbi.nlm.nih.gov/projects/genome/... [RefSeq] [2013-06-28] [13]

5 rows × 45 columns

Visualizing the HM signal heatmap on the gene promoters

Finally we visualize the heatmap (in the form of clusters) with the assistance of the Seaborn visualization package. In order to visualize the data we have to transform the GDataframe data structure in the mapping variable in a matricial form where we have $genes$ on the rows and $histone\_marks$ on the columns. This can be done with the to_matrix function offered by the GDataframe.

The to_matrix method is particularly versatile since it enables the user to project region and metadata at the same time. For example, in our case the information about the histone modification name is stored as a metadata attribute hm.experiment_target while we want to use the avg_signal region field as matrix value. The to_matrix methods therefore enable us to aggregate the information of both regions and metadata in a single structure.

Underneeth this method leverages on the pivot_table Pandas function and can accept all its parameter together with its specific ones.

In [16]:
!pip install seaborn
!pip install scikit-learn
Collecting seaborn
  Downloading https://files.pythonhosted.org/packages/a8/76/220ba4420459d9c4c9c9587c6ce607bf56c25b3d3d2de62056efe482dadc/seaborn-0.9.0-py3-none-any.whl (208kB)
    100% |████████████████████████████████| 215kB 11.2MB/s 
Requirement already satisfied: numpy>=1.9.3 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from seaborn) (1.16.3)
Requirement already satisfied: pandas>=0.15.2 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from seaborn) (0.24.2)
Collecting scipy>=0.14.0 (from seaborn)
  Downloading https://files.pythonhosted.org/packages/5d/bd/c0feba81fb60e231cf40fc8a322ed5873c90ef7711795508692b1481a4ae/scipy-1.3.0-cp37-cp37m-manylinux1_x86_64.whl (25.2MB)
    100% |████████████████████████████████| 25.2MB 269kB/s 
Collecting matplotlib>=1.4.3 (from seaborn)
  Downloading https://files.pythonhosted.org/packages/dc/cb/a34046e75c9a4ecaf426ae0d0eada97078c8ce4bbe3250940b1a312a1385/matplotlib-3.1.0-cp37-cp37m-manylinux1_x86_64.whl (13.1MB)
    100% |████████████████████████████████| 13.1MB 551kB/s 
Requirement already satisfied: pytz>=2011k in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from pandas>=0.15.2->seaborn) (2019.1)
Requirement already satisfied: python-dateutil>=2.5.0 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from pandas>=0.15.2->seaborn) (2.8.0)
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib>=1.4.3->seaborn)
  Downloading https://files.pythonhosted.org/packages/dd/d9/3ec19e966301a6e25769976999bd7bbe552016f0d32b577dc9d63d2e0c49/pyparsing-2.4.0-py2.py3-none-any.whl (62kB)
    100% |████████████████████████████████| 71kB 23.9MB/s 
Collecting cycler>=0.10 (from matplotlib>=1.4.3->seaborn)
  Downloading https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl
Collecting kiwisolver>=1.0.1 (from matplotlib>=1.4.3->seaborn)
  Downloading https://files.pythonhosted.org/packages/93/f8/518fb0bb89860eea6ff1b96483fbd9236d5ee991485d0f3eceff1770f654/kiwisolver-1.1.0-cp37-cp37m-manylinux1_x86_64.whl (90kB)
    100% |████████████████████████████████| 92kB 29.7MB/s 
Requirement already satisfied: six>=1.5 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from python-dateutil>=2.5.0->pandas>=0.15.2->seaborn) (1.12.0)
Requirement already satisfied: setuptools in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib>=1.4.3->seaborn) (41.0.0)
Installing collected packages: scipy, pyparsing, cycler, kiwisolver, matplotlib, seaborn
Successfully installed cycler-0.10.0 kiwisolver-1.1.0 matplotlib-3.1.0 pyparsing-2.4.0 scipy-1.3.0 seaborn-0.9.0
Collecting scikit-learn
  Downloading https://files.pythonhosted.org/packages/ef/52/3254e511ef1fc88d31edf457d90ecfd531931d4202f1b8ee0c949e9478f6/scikit_learn-0.21.1-cp37-cp37m-manylinux1_x86_64.whl (6.7MB)
    100% |████████████████████████████████| 6.7MB 1.1MB/s 
Requirement already satisfied: scipy>=0.17.0 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from scikit-learn) (1.3.0)
Collecting joblib>=0.11 (from scikit-learn)
  Downloading https://files.pythonhosted.org/packages/cd/c1/50a758e8247561e58cb87305b1e90b171b8c767b15b12a1734001f41d356/joblib-0.13.2-py2.py3-none-any.whl (278kB)
    100% |████████████████████████████████| 286kB 23.8MB/s 
Requirement already satisfied: numpy>=1.11.0 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from scikit-learn) (1.16.3)
Installing collected packages: joblib, scikit-learn
Successfully installed joblib-0.13.2 scikit-learn-0.21.1
In [17]:
import matplotlib.pyplot as plt
import seaborn as sns
[PyGMQL] generated new fontManager
In [18]:
plt.figure()
sns.clustermap(mapping.to_matrix(index_regs=['gene_symbol'], 
                                 columns_meta=['hm.experiment_target'], 
                                 values_regs=['avg_signal'], fill_value=0, dropna=False).sample(100))
plt.show()
<Figure size 432x288 with 0 Axes>

Performing Principal Component Analysis

In [19]:
from sklearn.decomposition import PCA
import pandas as pd
In [20]:
pca = PCA(n_components=2)
In [21]:
promoter_region_VS_hm = mapping.to_matrix(index_regs=['chr', 'start', 'stop', 'gene_symbol'], 
                                                 columns_meta=['hm.experiment_target'], 
                                                 values_regs=['avg_signal'], fill_value=0)
In [22]:
components = pca.fit_transform(promoter_region_VS_hm.values)
In [23]:
PCA_regions = pd.DataFrame(index=promoter_region_VS_hm.index, columns=['1st_component', '2nd_component'], data=components).reset_index()
In [24]:
plt.figure(figsize=(10, 10))
sns.regplot(data=PCA_regions, x='1st_component', y='2nd_component', fit_reg=False)
plt.show()

Tutorial 3: GWAS on a cloud

03b_GWAS_HDFS

Finding the most representative GWAS associated with cell-specific enhancers

(Execution on Hadoop)

In this tutorial we are going to use a GWAS dataset (accessible from this link) together with the whole ENCODE BroadPeak dataset to find which mutations (and their associated traits) are most represented in enhancer regions which are present in a limited set of cell lines.

As first thing let's download the data and reformat into a bed-like file.

In [1]:
%%bash

wget -q https://www.ebi.ac.uk/gwas/api/search/downloads/full -O tmp.tsv
cat tmp.tsv | awk 'BEGIN {FS="\t";OFS="\t"} {chrom=$12; gsub(chrom,"chr"chrom,$12)}{print $0}' | sed s/,//g > gwas.tsv
rm tmp.tsv

In order to run the query on HDFS, we have to move the file there.

In [2]:
!hdfs dfs -put ./gwas.tsv hdfs:///
put: `hdfs:///gwas.tsv': File exists

Library imports

In [3]:
import gmql as gl
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

Setting the master of the cluster

In this example, the data reside in the HDFS of a cluster. Let's say that the cluster is managed by the YARN resource manager. We have therefore to tell PyGMQL to use it.

In [4]:
gl.set_master("yarn")

Loading of the GWAS dataset

In this example, we have loaded the GMQL repository on the HDFS. It is convenient to store in a variable the path of the repository.

In [5]:
gmql_repository = "hdfs:///"

The GWAS data comes from a single TSV file. Therefore we can import it using the load_from_file function. Notice that we have to specify a parser to properly load our data. Therefore it is wise to take a look at the schema of the downloaded file.

In [6]:
! head -1 gwas.tsv | tr "\t" "\n"
DATE ADDED TO CATALOG
PUBMEDID
FIRST AUTHOR
DATE
JOURNAL
LINK
STUDY
DISEASE/TRAIT
INITIAL SAMPLE SIZE
REPLICATION SAMPLE SIZE
REGION
chrCHR_ID
CHR_POS
REPORTED GENE(S)
MAPPED_GENE
UPSTREAM_GENE_ID
DOWNSTREAM_GENE_ID
SNP_GENE_IDS
UPSTREAM_GENE_DISTANCE
DOWNSTREAM_GENE_DISTANCE
STRONGEST SNP-RISK ALLELE
SNPS
MERGED
SNP_ID_CURRENT
CONTEXT
INTERGENIC
RISK ALLELE FREQUENCY
P-VALUE
PVALUE_MLOG
P-VALUE (TEXT)
OR or BETA
95% CI (TEXT)
PLATFORM [SNPS PASSING QC]
CNV

We are mainly interested in the mutation position (11-th and 12-th columns) and the associated trait (7-th column).

In [7]:
gwas = gl.load_from_file(gmql_repository + "gwas.tsv", 
                         parser=gl.parsers.RegionParser(chrPos=11, 
                                                        startPos=12, 
                                                        stopPos=12, 
                                                        otherPos=[(7, "trait", 'string')]))

Inspecting the dataset

We can load a tiny part of the dataset to make sense of the data types and schema. You can inspect the dataset using the head function. This function returns a GDataframe object, which enables the access to regions (regs) and metadata (meta)

In [8]:
gwas.head().regs
Collecting regions: 5it [00:00, 511.69it/s]
100%|██████████| 2/2 [00:00<00:00, 868.03it/s]
Out[8]:
chr start stop strand trait
id_sample
8292134920041699140 chr14 75539214 75539214 * Multiple sclerosis
8292134920041699140 chr1 200912467 200912467 * Multiple sclerosis
8292134920041699140 chr3 119501087 119501087 * Multiple sclerosis
8292134920041699140 chr11 61064810 61064810 * Multiple sclerosis
8292134920041699140 chr1 116558335 116558335 * Multiple sclerosis

We can also simply look at the schema

In [9]:
gwas.schema
Out[9]:
['trait', 'chr', 'start', 'stop', 'strand']

Plotting the traits

We want to get an idea of the trait distribution. In order to do that we have to load the data in memory. Thereofre we can call the materialize function and take the regions.

In [10]:
gwas_data = gwas.materialize().regs
Collecting regions: 134578it [00:02, 51530.17it/s]
100%|██████████| 2/2 [00:00<00:00, 972.59it/s]

We now plot the number of regions for each of the top 30 represented traits.

In [11]:
plt.figure(figsize=(20,5))
sns.countplot(data=gwas_data[gwas_data.trait.isin(gwas_data.trait.value_counts().iloc[:30].index)], x='trait')
plt.xticks(rotation=90)
plt.title("Top represented GWAS traits", fontsize=20)
plt.show()

Loading of the ENCODE BroadPeak dataset

We now load the ENCODE BroadPeak dataset

If the data come already in the GDM format, they can be loaded using the load_from_path function. A GDM dataset is stored as a folder having the following structure:

/path/to/dataset/:
    - sample1.gdm
    - sample1.gdm.meta
    - sample2.gdm
    - sample2.gdm.meta
    - ...
    - schema.xml

In this case, the parser for the data is automatically inferred by the library from the schema.xml file.

In [12]:
broad = gl.load_from_path(gmql_repository + "HG19_ENCODE_BROAD")
In [13]:
broad.schema
Out[13]:
['name',
 'score',
 'signal',
 'pvalue',
 'qvalue',
 'chr',
 'start',
 'stop',
 'strand']

Getting the enhancers

We next identify active enhancers for each cell line as regions of the genome having a peak of H3K27ac. Therefore, we first select all the tracks of interest from the broad dataset filtering on the experiment_target metadata attribute.

In [14]:
acetyl = broad[broad['experiment_target'] == 'H3K27ac-human']

We get the peak region of the Chip-Seq using the reg_project function. The peak position (peak) is given by the center of the region.

$$ peak = \frac{right + left}{2} $$

In [15]:
peaked = acetyl.reg_project(new_field_dict={'peak': (acetyl.right + acetyl.left)/2})

Once we have the peak, we extend the search region to $\pm 1500 bp$. We use again reg_project

In [16]:
enlarge = peaked.reg_project(new_field_dict={'left': peaked.peak - 1500, 'right': peaked.peak + 1500})

Grouping by cell line and aggregating the signals

We are interested in enhancers which are cell line specific. Therefore it is important to group our data by cell line. In addition to this we merge the signals coming from different tracks for the same cell line. We can do both of these actions using the normal_cover function.

As output of the following command, we have a dataset that contains a single track for each cell line. The track is computed merging the replicas of the experiment targeting H3K9ac for the same cell line.

In [17]:
enhancers_by_cell_line = enlarge.normal_cover(1, "ANY", groupBy=['biosample_term_name'])

To select only the cell line specific enhancers we can now apply again normal_cover and constraining the maximum number of overlaps between the regions to be a selected threshold. In this case we select a threshold of 2.

In this case the output contains a single sample with all the enhancers which are present in at most max_overlapping cell lines.

In [18]:
max_overlapping = 2
cell_specific_enhancers = enhancers_by_cell_line.normal_cover(1, max_overlapping)
In [19]:
cell_specific_enhancers.schema
Out[19]:
['AccIndex',
 'JaccardIntersect',
 'JaccardResult',
 'chr',
 'start',
 'stop',
 'strand']

Finally, we need to re-associate every cell specific enhancers in cell_specific_enhancers to all the max_overlapping cell lines in which it is present.

Therefore, we used a join to select, for each cell line, only those enchancers that overlap a region in the cell_specific_enhancers.

In [20]:
cell_specific_enhancers_by_cell_line = enhancers_by_cell_line.join(cell_specific_enhancers, [gl.DLE(0)], 'left', refName="en", expName="csen")

Mapping mutations to cell specific enhancers

We now map the mutations in the GWAS dataset on the enhancer regions. We store the list of traits associated to each enhancer using the gl.BAG expression.

In [21]:
gwas.schema
Out[21]:
['trait', 'chr', 'start', 'stop', 'strand']
In [22]:
enhancer_gwas = cell_specific_enhancers_by_cell_line.map(gwas, refName="csen", expName="gwas", new_reg_fields={'traits': gl.BAG('trait')})
enhancer_gwas = enhancer_gwas.reg_project(["count_csen_gwas", "traits"],new_field_dict={'cell_line': enhancer_gwas['csen.en.biosample_term_name', 'string']})

Materializing the result

We now can call the materialize function to execute the full query. The result will be collected in a GDataframe object.

In [23]:
enhancer_gwas = enhancer_gwas.materialize()
Collecting regions: 840000it [01:32, 32088.41it/s]
  0%|          | 0/70 [00:00<?, ?it/s]
 40%|████      | 28/70 [00:00<00:00, 275.93it/s]
 80%|████████  | 56/70 [00:00<00:00, 275.65it/s]
100%|██████████| 70/70 [00:00<00:00, 278.28it/s]

The traits column of the resulting region is the list of traits associated with the cell specific enhancer. The data comes in the form of a string of trait names. We convert the string to a list.

In [24]:
enhancer_gwas.regs['traits'] = enhancer_gwas.regs.traits.map(lambda x: x.split(",") if pd.notnull(x) else x)

Analysis

The final part of the analysis regards the matching of cell lines and traits. We want to understand if a cell line (which is represented by its specific enhancers) has some particular mutation trait associated. The analysis is performed in Pandas using the result region attributes traits and cell_line.

We build an association matrix between cell lines and traits by firstly converting the result to a list of (cell_line, trait), converting it to a Pandas DataFrame, and finally using the crosstab Pandas function to extract the matrix.

In [25]:
cell_trait = pd.DataFrame.from_records([(k, v) for k, vs in enhancer_gwas.regs[enhancer_gwas.regs.count_csen_gwas > 0].groupby("cell_line").traits.sum().to_dict().items() for v in vs], 
                                       columns=['cell_line', 'trait'])
Collecting regions: 843859it [01:45, 32088.41it/s]
In [26]:
cross = pd.crosstab(cell_trait.cell_line, cell_trait.trait)

We finally plot the result as an heatmap.

In [27]:
plt.figure(figsize=(50, 15))
sns.heatmap(cross[cross.sum(0).sort_values(ascending=False).iloc[:100].index], cmap='Reds', vmax=70, linewidths=1, annot=True, cbar=False)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)

plt.xlabel("Trait", fontsize=30)
plt.ylabel("Cell line", fontsize=30)
plt.show()

Data structures and functions

Dataset structures

GMQLDataset.GMQLDataset The main abstraction of the library.
GDataframe.GDataframe Class holding the result of a materialization of a GMQLDataset.

Dataset loading functions

load_from_path Loads the data from a local path into a GMQLDataset.
load_from_remote Loads the data from a remote repository.

Parsing

For the list of the available parsers go to:

Parsers

Predefined parsers

class BedParser[source]

Standard Full BED Parser of 10 Columns

class ANNParser[source]

Annotation Parser, 6 columns

class BasicParser[source]

Parser for Chr, Start, Stop only (no Strand)

class NarrowPeakParser[source]

Narrow Peaks Parser. 10 columns

class RnaSeqParser[source]

Standard Full BED Parser of 10 Columns

class BedScoreParser[source]

Standard Full BED Parser of 10 Columns

Customizable parser

All the parsers in PyGMQL extend the RegionParser

class RegionParser(gmql_parser=None, chrPos=None, startPos=None, stopPos=None, strandPos=None, otherPos=None, delimiter='t', coordinate_system='0-based', schema_format='del', parser_name='parser')[source]

Creates a custom region dataset

Parameters:
  • chrPos – position of the chromosome column
  • startPos – position of the start column
  • stopPos – position of the stop column
  • strandPos – (optional) position of the strand column. Default is None
  • otherPos – (optional) list of tuples of the type [(pos, attr_name, typeFun), …]. Default is None
  • delimiter – (optional) delimiter of the columns of the file. Default ” “
  • coordinate_system – can be {‘0-based’, ‘1-based’, ‘default’}. Default is ‘0-based’
  • schema_format – (optional) type of file. Can be {‘tab’, ‘gtf’, ‘vcf’, ‘del’}. Default is ‘del’
  • parser_name – (optional) name of the parser. Default is ‘parser’
get_gmql_parser()[source]

Gets the Scala implementation of the parser

Returns:a Java Object
static parse_strand(strand)[source]

Defines how to parse the strand column

Parameters:strand – a string representing the strand
Returns:the parsed result
parse_regions(path)[source]

Given a file path, it loads it into memory as a Pandas dataframe

Parameters:path – file path
Returns:a Pandas Dataframe
get_attributes()[source]

Returns the unordered list of attributes

Returns:list of strings
get_ordered_attributes()[source]

Returns the ordered list of attributes

Returns:list of strings
get_types()[source]

Returns the unordered list of data types

Returns:list of data types
get_name_type_dict()[source]

Returns a dictionary of the type {‘column_name’: data_type, …}

Returns:dict
get_ordered_types()[source]

Returns the ordered list of data types

Returns:list of data types

Aggregates operators

COUNT() Counts the number of regions in the group.
SUM(argument) Computes the sum of the values of the specified attribute
MIN(argument) Gets the minimum value in the aggregation group for the specified attribute
MAX(argument) Gets the maximum value in the aggregation group for the specified attribute
AVG(argument) Gets the average value in the aggregation group for the specified attribute
BAG(argument) Creates space-separated string of attribute values for the specified attribute.
STD(argument) Gets the standard deviation of the aggregation group for the specified attribute
MEDIAN(argument) Gets the median value of the aggregation group for the specified attribute
Q1(argument) Gets the first quartile for the specified attribute
Q2(argument) Gets the second quartile for the specified attribute
Q3(argument) Gets the third quartile for the specified attribute

Genometric predicates

MD(number) Denotes the minimum distance clause, which selects the first K regions of an experiment sample at minimal distance from an anchor region of an anchor dataset sample.
DLE(limit) Denotes the less-equal distance clause, which selects all the regions of the experiment such that their distance from the anchor region is less than, or equal to, N bases.
DL(limit) Less than distance clause, which selects all the regions of the experiment such that their distance from the anchor region is less than N bases
DGE(limit) Greater-equal distance clause, which selects all the regions of the experiment such that their distance from the anchor region is greater than, or equal to, N bases
DG(limit) Greater than distance clause, which selects all the regions of the experiment such that their distance from the anchor region is greater than N bases
UP() Upstream.
DOWN() Downstream.

Mathematical operators

SQRT(argument) Computes the square matrix of the argument