Tutorial 2: Mixing local and remote processingΒΆ
Mixing local and remote processing¶
In this tutorial you will learn:
- How to connect to a remote GMQL instance and run remote queries
- How to interleave remote and local GMQL processing
- How to manipulate the GDataframe data structure to plot an heatmap
- How to perform a Principal Component Analysis using the GDataframe data structure
- How to transform a pure Pandas Dataframe back to a GMQLDataset to perform a new query
Connecting to a remote GMQL instance¶
PyGMQL can interact with a remote server hosting a GMQL instance and repository.
In order to connect to the remote instance the user has to:
- Set the IP address of the service
- Login to the server
It must be noticed that GMQL offers both authenticated and un-authenticated login: by default PyGMQL executes an un-authenticated login. This means that the remote instance will provide PyGMQL with limited resources and not permanent result storage. In order to login to the GMQL server as an authenticated user, one has to:
- Register on the remote service by going to its IP address from the web browser
- Use the
pygmql_login
tool, which is installed together with the library
In the following, we are going to use the publicly accessible GMQL service offered by the Genomic Computing group at Politecnico di Milano (accessible at http://gmql.eu/gmql-rest/). Since we did not registered to the service, when we the login
call is encountered, the library will login as GUEST user with limited capabilities (in terms of computation and disk space).
import gmql as gl
gl.set_remote_address("http://gmql.eu/gmql-rest/")
gl.login()
From this point on, all the remote calls (explained later) will be forwarded to the GMQL service hosted at the specified address as a guest user.
Remote execution¶
With PyGMQL the user can decide if the next query (sequence of statements on GMQLDatasets ended with a materialization) will be executed locally or on the remote service.
This can be done using the set_mode
function. If the user sets the mode to remote
the following will happen:
- The local datasets that are used are uploaded to the remote service. Nothing is done to the remote datasets used in the query (if present) since they are already on the server.
- A compressed representation of the query is sent to the remote service, decoded and executed
- Once the execution is complete, the results are downloaded, stored and loaded into a
GDataframe
.
The following image describes the steps of remote processing with PyGMQL:
The example¶
We will showcase these features through a simple example.
We want to understand the enrichment of histone modifications on promotorial regions of REFSEQ genes. In particular:
- The gene RefSeq annotation dataset holds the information about gene symbols and coordinates and it is stored in the remote server
- The Chip-Seq data about histone modifications will come from ENCODE and it is as well stored in the remote system.
The result of the analysis will be an heatmap $genes \times histone\_mark$ displaying the average signal value of each HM for each of the selected genes.
Finding the promotorial regions¶
We will use as definition of promoter the region situated at $\left[gene_{start} - 2000; gene_{start} + 2000\right]$.
We will perform this operation remotely.
gl.set_mode("remote")
We "load" the remote RefSeq dataset using the load_from_remote
function. Since this dataset is available to all the users of the system, we will specify public
as the ownership of the datasets. In the case in which we are loading our private datasets stored on the service, we can omit the owner
parameter.
genes = gl.load_from_remote("HG19_ANNOTATION_REFSEQ", owner="public")
Now we have to:
- Select only the samples from the
HG19_ANNOTATION_REFSEQ
dataset specific of genes - Select only the regions related to the genes (TP53, MYC, MAX, MAZ, PTEN)
This operation can be done using the select
function on the gene
GMQLDataset variable.
genes = genes.select(meta_predicate=genes["annotation_type"] == 'gene')
To extract the promotorial regions we will use the reg_project
(region projection) operation. In particular the following call will:
- Return a dataset with only the
chr, start, stop, gene_symbol
region fields - For each gene, return the genomic position at $\left[gene_{start} - 2000; gene_{start} + 2000\right]$ taking also into account the strand of the gene.
promoters = genes.reg_project(field_list=['gene_symbol'],
new_field_dict={'start': genes.start - 2000,
'stop': genes.start + 2000})
Since we are interested in keeping the result of this part of the query for future usage, we will materialize the result locally.
promoters = promoters.materialize("./promoters", all_load=False)
Selecting the histone marks¶
Next we want to select the H3K27me3, H3K36me3, H3K4me1, H3K79me2, H3K9ac and H3K9me3 from the HG19_ENCODE_BROAD_NOV_2017
dataset, also in this case stored remotely.
As before, we will load the dataset using load_from_remote
. The following operation is a meta selection, which means that the selection will be done on the metadata attributes of the dataset. The resulting dataset will have all the samples for which the metadata satisfy the specified predicate.
In PyGMQL the metadata selection follows the same syntax as the Pandas Python data analysis library
selected_dataset = dataset[dataset['metadata_attribute'] == 'value']
Since we want to test for multiple histone marks we can use the isin()
operator.
hms = gl.load_from_remote("HG19_ENCODE_BROAD_NOV_2017", owner="public")
hms = hms[hms['experiment_target'].isin(['H3K27me3-human', 'H3K36me3-human',
"H3K4me1-human", "H3K79me2-human",
"H3K9ac-human", "H3K9me3-human"])]
Grouping replicates¶
The dataset resulting from the previous selection will have several duplicates for each region, due to the several biological and technical replicates offered by ENCODE. Therefore we need a way to group together the overlapping regions and to aggregate the relative signal values.
This can be done using the normal_cover
operation. In particular this function enables you to specify:
- The number of minimum overlappings for a region to be kept
- The maximum number of overlappings for a region to be kept
- If you want to groupby the resulting samples by some metadata attribute
- The aggregation function for the input region fields to be applied to the region groupings
In our case we will keep all the regions (minimum overallpping = 1, maximum overlapping = "ANY") and group by the experiment_target
metadata attribute. The signals of overlapping regions will be averaged.
hms = hms.normal_cover(1, "ANY", groupBy=['experiment_target'],
new_reg_fields={'avg_signal': gl.AVG("signal")})
Also in this case we want to keep the resulting dataset for later processing.
hms = hms.materialize("./hms", all_load=False)
Mapping the histone marks signal to the promotorial regions¶
Now that we have our set of gene promoters and our set of histone marks, we want to map the signal of the latter to the regions of the former. We can do that using the map
operation. In particular, since on a promotorial region can be mapped multiple HMs, we will aggregate the signal by averaging it.
We decide to perform the following operations on the local machine.
gl.set_mode("local")
mapping = promoters.map(hms, refName='prom',
expName="hm",
new_reg_fields={'avg_signal': gl.AVG("avg_signal")})
We finally materialize the result and load it in a GDataframe.
mapping = mapping.materialize("./mapping")
mapping.regs.head()
mapping.meta.head()
Visualizing the HM signal heatmap on the gene promoters¶
Finally we visualize the heatmap (in the form of clusters) with the assistance of the Seaborn visualization package.
In order to visualize the data we have to transform the GDataframe data structure in the mapping
variable in a matricial form where we have $genes$ on the rows and $histone\_marks$ on the columns. This can be done with the to_matrix
function offered by the GDataframe.
The to_matrix
method is particularly versatile since it enables the user to project region and metadata at the same time. For example, in our case the information about the histone modification name is stored as a metadata attribute hm.experiment_target
while we want to use the avg_signal
region field as matrix value. The to_matrix
methods therefore enable us to aggregate the information of both regions and metadata in a single structure.
Underneeth this method leverages on the pivot_table
Pandas function and can accept all its parameter together with its specific ones.
!pip install seaborn
!pip install scikit-learn
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure()
sns.clustermap(mapping.to_matrix(index_regs=['gene_symbol'],
columns_meta=['hm.experiment_target'],
values_regs=['avg_signal'], fill_value=0, dropna=False).sample(100))
plt.show()
Performing Principal Component Analysis¶
from sklearn.decomposition import PCA
import pandas as pd
pca = PCA(n_components=2)
promoter_region_VS_hm = mapping.to_matrix(index_regs=['chr', 'start', 'stop', 'gene_symbol'],
columns_meta=['hm.experiment_target'],
values_regs=['avg_signal'], fill_value=0)
components = pca.fit_transform(promoter_region_VS_hm.values)
PCA_regions = pd.DataFrame(index=promoter_region_VS_hm.index, columns=['1st_component', '2nd_component'], data=components).reset_index()
plt.figure(figsize=(10, 10))
sns.regplot(data=PCA_regions, x='1st_component', y='2nd_component', fit_reg=False)
plt.show()