Tutorial 1: Simple example of local processing¶

01_A_simple_example

Simple example of local processing¶

In this first tutorial we will show a complete example of usage of the library using some example datasets provided with it.

Importing the library¶

In [1]:

import gmql as gl

[PyGMQL] Updating backend to latest version
95M [00:06, 14.85M/s]

Loading datasets¶

PyGMQL can work with BED and GTF files with arbitrary fields and schemas. In order to load a dataset into Python the user can use the following functions:

load_from_path: lazily loads a dataset into a GMQLDataset variable from the local file system
load_from_remote: lazily loads a dataset into a GMQLDataset variable from a remote GMQL service
from_pandas: lazily loads a dataset into a GMQLDataset variable from a Pandas DataFrame having at least the chromosome, start and stop columns

In addition to these functions we also provide a function called get_example_dataset which enables the user to load a sample dataset and play with it in order to get confidence with the library. Currently we provide two example datasets: Example_Dataset_1 and Example_Dataset_2.

In the following we will load two example datasets and play with them.

In [2]:

dataset1 = gl.get_example_dataset("Example_Dataset_1")
dataset2 = gl.get_example_dataset("Example_Dataset_2")

The `GMQLDataset`¶

The dataset variable defined above is a GMQLDataset, which represents a GMQL variable and on which it is possible to apply GMQL operators. It must be noticed that no data has been loaded in memory yet and the computation will only start when the query is triggered. We will see how to start the execution of a query in the following steps.

We can inspect the schema of the dataset with the following:

In [3]:

dataset1.schema

Out[3]:

['source',
 'feature',
 'score',
 'frame',
 'name',
 'signal',
 'pvalue',
 'qvalue',
 'peak',
 'chr',
 'start',
 'stop',
 'strand']

In [4]:

dataset2.schema

Out[4]:

['source',
 'feature',
 'score',
 'frame',
 'name',
 'signal',
 'pvalue',
 'qvalue',
 'peak',
 'chr',
 'start',
 'stop',
 'strand']

Filtering the dataset regions based on a predicate¶

The first operation we will do on dataset will be selecting only the genomic regions on the 3rd chromosome and with a start position greater than 30000.

In [5]:

filtered_dataset1 = dataset1.reg_select((dataset1.chr == 'chr3') & (dataset1.start >= 30000))

From this operation we can learn several things about the GMQLDataset data structure. Each GMQLDataset has a set of methods and fields which can be used to build GMQL queries. For example, in the previous statement we have:

the reg_select method, which enables us to filter the datasets on the basis of a predicate on the region positions and features
the chr and start fields, which enable the user to build predicates on the fields of the dataset.

Every GMQL operator has a relative method accessible from the GMQLDataset data structure, as well as any other field of the dataset.

Filtering a dataset based on a predicate on metadata¶

The Genomic Data Model enables us to work both with genomic regions and their relative metadata. Therefore we can filter dataset samples on the basis of predicates on metadata attributes. This can be done as follows:

In [6]:

filtered_dataset_2 = dataset2[dataset2['antibody_target'] == 'CTCF']

Notice that the notation for selecting the samples using metadata is the same as the one for filtering Pandas DataFrames.

Joining two datasets¶

It is not the focus of this tutorial to show all the possible operations which can be done on a GMQLDataset, they can be seen on the documentation page of the library.

For the sake of this example, let's show the JOIN operation between the two filtered datasets defined in the previous two steps. The JOIN operation semantics relies on the concept of reference and experiment datasets. The reference dataset is the one 'calling' the join function while the experiment dataset is the one 'on which' the function is called. The semantics of the function is

resulting_dataset = <reference>.join(<experiment>, <genometric predicate>, ...)

In [7]:

dataset_join = dataset1.join(dataset2, [gl.DLE(0)])

To understand the concept of genometric predicate please visit the documentation of the library.

Materialization of the results¶

As we have already said, no operation has beed effectively done up to this point. What we did up to now is to define the sequence of operations to apply on the data. In order to trigger the execution we have to apply the materialize function on the variable we want to compute.

In [8]:

query_result = dataset_join.materialize()

Collecting regions: 12it [00:03,  3.49it/s]
100%|██████████| 190/190 [00:02<00:00, 94.63it/s]

The `GDataframe`¶

The query_result variable holds the result of the previous GMQL query in the form of a GDataframe data structure. It holds the information about the regions and the metadata of the result, which can be respectively accessed through the regs and meta attributes.

Regions¶

In [10]:

query_result.regs.head()

Out[10]:

	chr	start	stop	strand	REF.source	REF.feature	REF.score	REF.frame	REF.name	REF.signal	...	REF.peak	EXP.source	EXP.feature	EXP.score	EXP.frame	EXP.name	EXP.signal	EXP.pvalue	EXP.qvalue	EXP.peak
id_sample
5758100849671408287	chr4	149	179	-	GMQL	Region	0.9	.	.	15.0	...	-1.0	GMQL	Region	240.0	.	.	17.0000	11.1675	-1.0	-1.0
5758100849671408287	chr4	199	249	-	GMQL	Region	0.9	.	.	15.0	...	-1.0	GMQL	Region	240.0	.	.	17.0000	11.1675	-1.0	-1.0
-7728122121608963789	chr2	99	150	*	GMQL	Region	10.0	.	.	9.0	...	-1.0	GMQL	Region	12.0	.	.	20.0000	-1.0000	-1.0	-1.0
4670800706188226503	chr2	99	150	*	GMQL	Region	10.0	.	.	9.0	...	-1.0	GMQL	Region	5.0	.	chr2.2538	0.0332	1.3000	-1.0	32.0
-7728122121608963789	chr2	199	240	*	GMQL	Region	9.0	.	.	22.0	...	-1.0	GMQL	Region	12.0	.	.	20.0000	-1.0000	-1.0	-1.0

5 rows × 22 columns

Metadata¶

In [11]:

query_result.meta.head()

Out[11]:

	EXP.ID	EXP.antibody	EXP.antibody_antibodyDescription	EXP.antibody_lab	EXP.antibody_label	EXP.antibody_lots	EXP.antibody_orderUrl	EXP.antibody_tag	EXP.antibody_target	EXP.antibody_targetClass	...	REF.treatment_label	REF.treatment_tag	REF.treatment_type	REF.type	REF.url	REF.view	REF.view_description	REF.view_label	REF.view_tag	REF.view_type
id_sample
-8999039739831780787	[1891]	[CTCF]	[Rabbit polyclonal. Antibody Target: CTCF]	[Myers, Hardison, Snyder]	[CTCF (07-729)]	[1350637 DAM1472197]	[http://www.millipore.com/catalogue/item/07-729]	[CTCF]	[CTCF]	[TFSS]	...	[No treatment or prot]	[NONE]	[control]	[narrowPeak]	[http://hgdownload.cse.ucsc.edu/goldenPath/hg1...	[Peaks]	[Regions of enriched signal in experiment]	[Peaks]	[PKS]	[view]
-8744938632702845394	[1460]	[]	[]	[]	[]	[]	[]	[]	[]	[]	...	[No treatment or prot]	[NONE]	[control]	[narrowPeak]	[http://hgdownload.cse.ucsc.edu/goldenPath/hg1...	[Peaks]	[Regions of enriched signal in experiment]	[Peaks]	[PKS]	[view]
-8566469212180078812	[1892]	[CTCF]	[Rabbit polyclonal. Antibody Target: CTCF]	[Myers, Hardison, Snyder]	[CTCF (07-729)]	[1350637 DAM1472197]	[http://www.millipore.com/catalogue/item/07-729]	[CTCF]	[CTCF]	[TFSS]	...	[No treatment or prot]	[NONE]	[control]	[narrowPeak]	[http://hgdownload.cse.ucsc.edu/goldenPath/hg1...	[Peaks]	[Regions of enriched signal in experiment]	[Peaks]	[PKS]	[view]
-7728122121608963789	[42]	[]	[]	[]	[]	[]	[]	[]	[]	[]	...	[No treatment or prot]	[NONE]	[control]	[narrowPeak]	[http://hgdownload.cse.ucsc.edu/goldenPath/hg1...	[Peaks]	[Regions of enriched signal in experiment]	[Peaks]	[PKS]	[view]
-7571531587018517458	[1892]	[CTCF]	[Rabbit polyclonal. Antibody Target: CTCF]	[Myers, Hardison, Snyder]	[CTCF (07-729)]	[1350637 DAM1472197]	[http://www.millipore.com/catalogue/item/07-729]	[CTCF]	[CTCF]	[TFSS]	...	[No treatment or prot]	[NONE]	[control]	[narrowPeak]	[http://hgdownload.cse.ucsc.edu/goldenPath/hg1...	[Peaks]	[Regions of enriched signal in experiment]	[Peaks]	[PKS]	[view]

5 rows × 189 columns

Tutorial 1: Simple example of local processing¶

Simple example of local processing¶

Importing the library¶

Importing the library¶

Loading datasets¶

The GMQLDataset¶

Filtering the dataset regions based on a predicate¶

Filtering a dataset based on a predicate on metadata¶

Joining two datasets¶

Materialization of the results¶

The GDataframe¶

Regions¶

Metadata¶

The `GMQLDataset`¶

The `GDataframe`¶