Tutorial 2: Mixing local and remote processing


Mixing local and remote processing

In this tutorial you will learn:

  1. How to connect to a remote GMQL instance and run remote queries
  2. How to interleave remote and local GMQL processing
  3. How to manipulate the GDataframe data structure to plot an heatmap
  4. How to perform a Principal Component Analysis using the GDataframe data structure
  5. How to transform a pure Pandas Dataframe back to a GMQLDataset to perform a new query

Connecting to a remote GMQL instance

PyGMQL can interact with a remote server hosting a GMQL instance and repository.

In order to connect to the remote instance the user has to:

  1. Set the IP address of the service
  2. Login to the server

It must be noticed that GMQL offers both authenticated and un-authenticated login: by default PyGMQL executes an un-authenticated login. This means that the remote instance will provide PyGMQL with limited resources and not permanent result storage. In order to login to the GMQL server as an authenticated user, one has to:

  1. Register on the remote service by going to its IP address from the web browser
  2. Use the pygmql_login tool, which is installed together with the library

In the following, we are going to use the publicly accessible GMQL service offered by the Genomic Computing group at Politecnico di Milano (accessible at http://gmql.eu/gmql-rest/). Since we did not registered to the service, when we the login call is encountered, the library will login as GUEST user with limited capabilities (in terms of computation and disk space).

In [1]:
import gmql as gl

[PyGMQL] Logging using stored authentication token

From this point on, all the remote calls (explained later) will be forwarded to the GMQL service hosted at the specified address as a guest user.

Remote execution

With PyGMQL the user can decide if the next query (sequence of statements on GMQLDatasets ended with a materialization) will be executed locally or on the remote service. This can be done using the set_mode function. If the user sets the mode to remote the following will happen:

  1. The local datasets that are used are uploaded to the remote service. Nothing is done to the remote datasets used in the query (if present) since they are already on the server.
  2. A compressed representation of the query is sent to the remote service, decoded and executed
  3. Once the execution is complete, the results are downloaded, stored and loaded into a GDataframe .

The following image describes the steps of remote processing with PyGMQL:

The example

We will showcase these features through a simple example.

We want to understand the enrichment of histone modifications on promotorial regions of REFSEQ genes. In particular:

  • The gene RefSeq annotation dataset holds the information about gene symbols and coordinates and it is stored in the remote server
  • The Chip-Seq data about histone modifications will come from ENCODE and it is as well stored in the remote system.

The result of the analysis will be an heatmap $genes \times histone\_mark$ displaying the average signal value of each HM for each of the selected genes.

Finding the promotorial regions

We will use as definition of promoter the region situated at $\left[gene_{start} - 2000; gene_{start} + 2000\right]$.

We will perform this operation remotely.

In [2]:

We "load" the remote RefSeq dataset using the load_from_remote function. Since this dataset is available to all the users of the system, we will specify public as the ownership of the datasets. In the case in which we are loading our private datasets stored on the service, we can omit the owner parameter.

In [3]:
genes = gl.load_from_remote("HG19_ANNOTATION_REFSEQ", owner="public")

Now we have to:

  1. Select only the samples from the HG19_ANNOTATION_REFSEQ dataset specific of genes
  2. Select only the regions related to the genes (TP53, MYC, MAX, MAZ, PTEN)

This operation can be done using the select function on the gene GMQLDataset variable.

In [4]:
genes = genes.select(meta_predicate=genes["annotation_type"] == 'gene')

To extract the promotorial regions we will use the reg_project (region projection) operation. In particular the following call will:

  1. Return a dataset with only the chr, start, stop, gene_symbol region fields
  2. For each gene, return the genomic position at $\left[gene_{start} - 2000; gene_{start} + 2000\right]$ taking also into account the strand of the gene.
In [6]:
promoters = genes.reg_project(field_list=['gene_symbol'], 
                              new_field_dict={'start': genes.start - 2000, 
                                              'stop': genes.start + 2000})

Since we are interested in keeping the result of this part of the query for future usage, we will materialize the result locally.

In [7]:
promoters = promoters.materialize("./promoters", all_load=False)
100%|██████████| 1/1 [00:08<00:00,  8.94s/it]

Selecting the histone marks

Next we want to select the H3K27me3, H3K36me3, H3K4me1, H3K79me2, H3K9ac and H3K9me3 from the HG19_ENCODE_BROAD_NOV_2017 dataset, also in this case stored remotely. As before, we will load the dataset using load_from_remote. The following operation is a meta selection, which means that the selection will be done on the metadata attributes of the dataset. The resulting dataset will have all the samples for which the metadata satisfy the specified predicate.

In PyGMQL the metadata selection follows the same syntax as the Pandas Python data analysis library

selected_dataset = dataset[dataset['metadata_attribute'] == 'value']

Since we want to test for multiple histone marks we can use the isin() operator.

In [8]:
hms = gl.load_from_remote("HG19_ENCODE_BROAD_NOV_2017", owner="public")
hms = hms[hms['experiment_target'].isin(['H3K27me3-human', 'H3K36me3-human', 
                                         "H3K4me1-human", "H3K79me2-human", 
                                         "H3K9ac-human", "H3K9me3-human"])]

Grouping replicates

The dataset resulting from the previous selection will have several duplicates for each region, due to the several biological and technical replicates offered by ENCODE. Therefore we need a way to group together the overlapping regions and to aggregate the relative signal values.

This can be done using the normal_cover operation. In particular this function enables you to specify:

  1. The number of minimum overlappings for a region to be kept
  2. The maximum number of overlappings for a region to be kept
  3. If you want to groupby the resulting samples by some metadata attribute
  4. The aggregation function for the input region fields to be applied to the region groupings

In our case we will keep all the regions (minimum overallpping = 1, maximum overlapping = "ANY") and group by the experiment_target metadata attribute. The signals of overlapping regions will be averaged.

In [9]:
hms = hms.normal_cover(1, "ANY", groupBy=['experiment_target'], 
                       new_reg_fields={'avg_signal': gl.AVG("signal")})

Also in this case we want to keep the resulting dataset for later processing.

In [10]:
hms = hms.materialize("./hms", all_load=False)
100%|██████████| 6/6 [01:21<00:00, 20.57s/it]

Mapping the histone marks signal to the promotorial regions

Now that we have our set of gene promoters and our set of histone marks, we want to map the signal of the latter to the regions of the former. We can do that using the map operation. In particular, since on a promotorial region can be mapped multiple HMs, we will aggregate the signal by averaging it.

We decide to perform the following operations on the local machine.

In [11]:
In [12]:
mapping = promoters.map(hms, refName='prom', 
                        new_reg_fields={'avg_signal': gl.AVG("avg_signal")})

We finally materialize the result and load it in a GDataframe.

In [13]:
mapping = mapping.materialize("./mapping")
Collecting regions: 282180it [00:09, 29036.10it/s]
100%|██████████| 46/46 [00:00<00:00, 405.64it/s]
In [14]:
chr start stop strand gene_symbol count_prom_hm avg_signal
-4961556509145986263 chr15 35690830 35694830 + HNRNPA1P45 0.0 NaN
-4961556509145986263 chrX 79588817 79592817 - CHMP1B2P 0.0 NaN
-4961556509145986263 chr1 152626807 152630807 + LCEP3 0.0 NaN
-4961556509145986263 chr12 112017097 112021097 + LOC100101246 1.0 7.43431
-4961556509145986263 chr1 154472694 154476694 + TDRD10 0.0 NaN
In [15]:
hm.antibody_accession hm.assay hm.assembly hm.audit_error hm.audit_internal_action hm.audit_not_compliant hm.audit_warning hm.biological_replicate_s_ hm.biosample_age hm.biosample_life_stage ... prom.data_format_description prom.download_link prom.full_name prom.genome_build prom.genome_build_accession prom.original_format prom.original_url prom.provider prom.release_date prom.release_version
-8047918399829123478 [ENCAB000ANB] [ChIP-seq] [hg19] [] [missing derived_from, missing unfiltered alig... [control insufficient read depth, insufficient... [inconsistent platforms, low read length, mild... [1, 2] [31 year, 53 year, unknown] [adult, embryonic] ... [http://biomirror.aarnet.edu.au/biomirror/ncbi... [ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a... [RefSeq Reference Genome Annotation] [GRCh37.p13] [GCF_000001405.25] [gff] [https://www.ncbi.nlm.nih.gov/projects/genome/... [RefSeq] [2013-06-28] [13]
-4961556509145986263 [ENCAB000ANH] [ChIP-seq] [hg19] [] [missing derived_from, missing unfiltered alig... [control insufficient read depth, insufficient... [inconsistent platforms, low read length] [] [53 year] [adult] ... [http://biomirror.aarnet.edu.au/biomirror/ncbi... [ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a... [RefSeq Reference Genome Annotation] [GRCh37.p13] [GCF_000001405.25] [gff] [https://www.ncbi.nlm.nih.gov/projects/genome/... [RefSeq] [2013-06-28] [13]
-4790703078920733789 [ENCAB000ADW] [ChIP-seq] [hg19] [extremely low read depth] [biological replicates with identical biosampl... [control insufficient read depth, insufficient... [low read length, mild to moderate bottlenecking] [] [58 year, unknown] [adult, newborn, unknown] ... [http://biomirror.aarnet.edu.au/biomirror/ncbi... [ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a... [RefSeq Reference Genome Annotation] [GRCh37.p13] [GCF_000001405.25] [gff] [https://www.ncbi.nlm.nih.gov/projects/genome/... [RefSeq] [2013-06-28] [13]
-4315937178640546765 [ENCAB000ADU] [ChIP-seq] [hg19] [control extremely low read depth, extremely l... [biological replicates with identical biosampl... [control insufficient read depth, insufficient... [inconsistent platforms, low read length, mild... [] [31 year, 53 year, unknown] [adult, newborn, unknown] ... [http://biomirror.aarnet.edu.au/biomirror/ncbi... [ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a... [RefSeq Reference Genome Annotation] [GRCh37.p13] [GCF_000001405.25] [gff] [https://www.ncbi.nlm.nih.gov/projects/genome/... [RefSeq] [2013-06-28] [13]
-4215283275073074168 [ENCAB000ANX] [ChIP-seq] [hg19] [] [biological replicates with identical biosampl... [control insufficient read depth, insufficient... [inconsistent platforms, low read depth, low r... [] [53 year, unknown] [adult, embryonic] ... [http://biomirror.aarnet.edu.au/biomirror/ncbi... [ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/a... [RefSeq Reference Genome Annotation] [GRCh37.p13] [GCF_000001405.25] [gff] [https://www.ncbi.nlm.nih.gov/projects/genome/... [RefSeq] [2013-06-28] [13]

5 rows × 45 columns

Visualizing the HM signal heatmap on the gene promoters

Finally we visualize the heatmap (in the form of clusters) with the assistance of the Seaborn visualization package. In order to visualize the data we have to transform the GDataframe data structure in the mapping variable in a matricial form where we have $genes$ on the rows and $histone\_marks$ on the columns. This can be done with the to_matrix function offered by the GDataframe.

The to_matrix method is particularly versatile since it enables the user to project region and metadata at the same time. For example, in our case the information about the histone modification name is stored as a metadata attribute hm.experiment_target while we want to use the avg_signal region field as matrix value. The to_matrix methods therefore enable us to aggregate the information of both regions and metadata in a single structure.

Underneeth this method leverages on the pivot_table Pandas function and can accept all its parameter together with its specific ones.

In [16]:
!pip install seaborn
!pip install scikit-learn
Collecting seaborn
  Downloading https://files.pythonhosted.org/packages/a8/76/220ba4420459d9c4c9c9587c6ce607bf56c25b3d3d2de62056efe482dadc/seaborn-0.9.0-py3-none-any.whl (208kB)
    100% |████████████████████████████████| 215kB 11.2MB/s 
Requirement already satisfied: numpy>=1.9.3 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from seaborn) (1.16.3)
Requirement already satisfied: pandas>=0.15.2 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from seaborn) (0.24.2)
Collecting scipy>=0.14.0 (from seaborn)
  Downloading https://files.pythonhosted.org/packages/5d/bd/c0feba81fb60e231cf40fc8a322ed5873c90ef7711795508692b1481a4ae/scipy-1.3.0-cp37-cp37m-manylinux1_x86_64.whl (25.2MB)
    100% |████████████████████████████████| 25.2MB 269kB/s 
Collecting matplotlib>=1.4.3 (from seaborn)
  Downloading https://files.pythonhosted.org/packages/dc/cb/a34046e75c9a4ecaf426ae0d0eada97078c8ce4bbe3250940b1a312a1385/matplotlib-3.1.0-cp37-cp37m-manylinux1_x86_64.whl (13.1MB)
    100% |████████████████████████████████| 13.1MB 551kB/s 
Requirement already satisfied: pytz>=2011k in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from pandas>=0.15.2->seaborn) (2019.1)
Requirement already satisfied: python-dateutil>=2.5.0 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from pandas>=0.15.2->seaborn) (2.8.0)
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib>=1.4.3->seaborn)
  Downloading https://files.pythonhosted.org/packages/dd/d9/3ec19e966301a6e25769976999bd7bbe552016f0d32b577dc9d63d2e0c49/pyparsing-2.4.0-py2.py3-none-any.whl (62kB)
    100% |████████████████████████████████| 71kB 23.9MB/s 
Collecting cycler>=0.10 (from matplotlib>=1.4.3->seaborn)
  Downloading https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl
Collecting kiwisolver>=1.0.1 (from matplotlib>=1.4.3->seaborn)
  Downloading https://files.pythonhosted.org/packages/93/f8/518fb0bb89860eea6ff1b96483fbd9236d5ee991485d0f3eceff1770f654/kiwisolver-1.1.0-cp37-cp37m-manylinux1_x86_64.whl (90kB)
    100% |████████████████████████████████| 92kB 29.7MB/s 
Requirement already satisfied: six>=1.5 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from python-dateutil>=2.5.0->pandas>=0.15.2->seaborn) (1.12.0)
Requirement already satisfied: setuptools in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib>=1.4.3->seaborn) (41.0.0)
Installing collected packages: scipy, pyparsing, cycler, kiwisolver, matplotlib, seaborn
Successfully installed cycler-0.10.0 kiwisolver-1.1.0 matplotlib-3.1.0 pyparsing-2.4.0 scipy-1.3.0 seaborn-0.9.0
Collecting scikit-learn
  Downloading https://files.pythonhosted.org/packages/ef/52/3254e511ef1fc88d31edf457d90ecfd531931d4202f1b8ee0c949e9478f6/scikit_learn-0.21.1-cp37-cp37m-manylinux1_x86_64.whl (6.7MB)
    100% |████████████████████████████████| 6.7MB 1.1MB/s 
Requirement already satisfied: scipy>=0.17.0 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from scikit-learn) (1.3.0)
Collecting joblib>=0.11 (from scikit-learn)
  Downloading https://files.pythonhosted.org/packages/cd/c1/50a758e8247561e58cb87305b1e90b171b8c767b15b12a1734001f41d356/joblib-0.13.2-py2.py3-none-any.whl (278kB)
    100% |████████████████████████████████| 286kB 23.8MB/s 
Requirement already satisfied: numpy>=1.11.0 in /home/nanni/software/miniconda3/envs/pygmql/lib/python3.7/site-packages (from scikit-learn) (1.16.3)
Installing collected packages: joblib, scikit-learn
Successfully installed joblib-0.13.2 scikit-learn-0.21.1
In [17]:
import matplotlib.pyplot as plt
import seaborn as sns
[PyGMQL] generated new fontManager
In [18]:
                                 values_regs=['avg_signal'], fill_value=0, dropna=False).sample(100))
<Figure size 432x288 with 0 Axes>

Performing Principal Component Analysis

In [19]:
from sklearn.decomposition import PCA
import pandas as pd
In [20]:
pca = PCA(n_components=2)
In [21]:
promoter_region_VS_hm = mapping.to_matrix(index_regs=['chr', 'start', 'stop', 'gene_symbol'], 
                                                 values_regs=['avg_signal'], fill_value=0)
In [22]:
components = pca.fit_transform(promoter_region_VS_hm.values)
In [23]:
PCA_regions = pd.DataFrame(index=promoter_region_VS_hm.index, columns=['1st_component', '2nd_component'], data=components).reset_index()
In [24]:
plt.figure(figsize=(10, 10))
sns.regplot(data=PCA_regions, x='1st_component', y='2nd_component', fit_reg=False)