Tutorial 1: Simple example of local processing

01_A_simple_example

Simple example of local processing

In this first tutorial we will show a complete example of usage of the library using some example datasets provided with it.

Importing the library

Importing the library

In [1]:
import gmql as gl
[PyGMQL] Updating backend to latest version
95M [00:06, 14.85M/s]                         

Loading datasets

PyGMQL can work with BED and GTF files with arbitrary fields and schemas. In order to load a dataset into Python the user can use the following functions:

  • load_from_path: lazily loads a dataset into a GMQLDataset variable from the local file system
  • load_from_remote: lazily loads a dataset into a GMQLDataset variable from a remote GMQL service
  • from_pandas: lazily loads a dataset into a GMQLDataset variable from a Pandas DataFrame having at least the chromosome, start and stop columns

In addition to these functions we also provide a function called get_example_dataset which enables the user to load a sample dataset and play with it in order to get confidence with the library. Currently we provide two example datasets: Example_Dataset_1 and Example_Dataset_2.

In the following we will load two example datasets and play with them.

In [2]:
dataset1 = gl.get_example_dataset("Example_Dataset_1")
dataset2 = gl.get_example_dataset("Example_Dataset_2")

The GMQLDataset

The dataset variable defined above is a GMQLDataset, which represents a GMQL variable and on which it is possible to apply GMQL operators. It must be noticed that no data has been loaded in memory yet and the computation will only start when the query is triggered. We will see how to start the execution of a query in the following steps.

We can inspect the schema of the dataset with the following:

In [3]:
dataset1.schema
Out[3]:
['source',
 'feature',
 'score',
 'frame',
 'name',
 'signal',
 'pvalue',
 'qvalue',
 'peak',
 'chr',
 'start',
 'stop',
 'strand']
In [4]:
dataset2.schema
Out[4]:
['source',
 'feature',
 'score',
 'frame',
 'name',
 'signal',
 'pvalue',
 'qvalue',
 'peak',
 'chr',
 'start',
 'stop',
 'strand']

Filtering the dataset regions based on a predicate

The first operation we will do on dataset will be selecting only the genomic regions on the 3rd chromosome and with a start position greater than 30000.

In [5]:
filtered_dataset1 = dataset1.reg_select((dataset1.chr == 'chr3') & (dataset1.start >= 30000))

From this operation we can learn several things about the GMQLDataset data structure. Each GMQLDataset has a set of methods and fields which can be used to build GMQL queries. For example, in the previous statement we have:

  • the reg_select method, which enables us to filter the datasets on the basis of a predicate on the region positions and features
  • the chr and start fields, which enable the user to build predicates on the fields of the dataset.

Every GMQL operator has a relative method accessible from the GMQLDataset data structure, as well as any other field of the dataset.

Filtering a dataset based on a predicate on metadata

The Genomic Data Model enables us to work both with genomic regions and their relative metadata. Therefore we can filter dataset samples on the basis of predicates on metadata attributes. This can be done as follows:

In [6]:
filtered_dataset_2 = dataset2[dataset2['antibody_target'] == 'CTCF']

Notice that the notation for selecting the samples using metadata is the same as the one for filtering Pandas DataFrames.

Joining two datasets

It is not the focus of this tutorial to show all the possible operations which can be done on a GMQLDataset, they can be seen on the documentation page of the library.

For the sake of this example, let's show the JOIN operation between the two filtered datasets defined in the previous two steps. The JOIN operation semantics relies on the concept of reference and experiment datasets. The reference dataset is the one 'calling' the join function while the experiment dataset is the one 'on which' the function is called. The semantics of the function is

resulting_dataset = <reference>.join(<experiment>, <genometric predicate>, ...)
In [7]:
dataset_join = dataset1.join(dataset2, [gl.DLE(0)])

To understand the concept of genometric predicate please visit the documentation of the library.

Materialization of the results

As we have already said, no operation has beed effectively done up to this point. What we did up to now is to define the sequence of operations to apply on the data. In order to trigger the execution we have to apply the materialize function on the variable we want to compute.

In [8]:
query_result = dataset_join.materialize()
Collecting regions: 12it [00:03,  3.49it/s]
100%|██████████| 190/190 [00:02<00:00, 94.63it/s]

The GDataframe

The query_result variable holds the result of the previous GMQL query in the form of a GDataframe data structure. It holds the information about the regions and the metadata of the result, which can be respectively accessed through the regs and meta attributes.

Regions

In [10]:
query_result.regs.head()
Out[10]:
chr start stop strand REF.source REF.feature REF.score REF.frame REF.name REF.signal ... REF.peak EXP.source EXP.feature EXP.score EXP.frame EXP.name EXP.signal EXP.pvalue EXP.qvalue EXP.peak
id_sample
5758100849671408287 chr4 149 179 - GMQL Region 0.9 . . 15.0 ... -1.0 GMQL Region 240.0 . . 17.0000 11.1675 -1.0 -1.0
5758100849671408287 chr4 199 249 - GMQL Region 0.9 . . 15.0 ... -1.0 GMQL Region 240.0 . . 17.0000 11.1675 -1.0 -1.0
-7728122121608963789 chr2 99 150 * GMQL Region 10.0 . . 9.0 ... -1.0 GMQL Region 12.0 . . 20.0000 -1.0000 -1.0 -1.0
4670800706188226503 chr2 99 150 * GMQL Region 10.0 . . 9.0 ... -1.0 GMQL Region 5.0 . chr2.2538 0.0332 1.3000 -1.0 32.0
-7728122121608963789 chr2 199 240 * GMQL Region 9.0 . . 22.0 ... -1.0 GMQL Region 12.0 . . 20.0000 -1.0000 -1.0 -1.0

5 rows × 22 columns

Metadata

In [11]:
query_result.meta.head()
Out[11]:
EXP.ID EXP.antibody EXP.antibody_antibodyDescription EXP.antibody_lab EXP.antibody_label EXP.antibody_lots EXP.antibody_orderUrl EXP.antibody_tag EXP.antibody_target EXP.antibody_targetClass ... REF.treatment_label REF.treatment_tag REF.treatment_type REF.type REF.url REF.view REF.view_description REF.view_label REF.view_tag REF.view_type
id_sample
-8999039739831780787 [1891] [CTCF] [Rabbit polyclonal. Antibody Target: CTCF] [Myers, Hardison, Snyder] [CTCF (07-729)] [1350637 DAM1472197] [http://www.millipore.com/catalogue/item/07-729] [CTCF] [CTCF] [TFSS] ... [No treatment or prot] [NONE] [control] [narrowPeak] [http://hgdownload.cse.ucsc.edu/goldenPath/hg1... [Peaks] [Regions of enriched signal in experiment] [Peaks] [PKS] [view]
-8744938632702845394 [1460] [] [] [] [] [] [] [] [] [] ... [No treatment or prot] [NONE] [control] [narrowPeak] [http://hgdownload.cse.ucsc.edu/goldenPath/hg1... [Peaks] [Regions of enriched signal in experiment] [Peaks] [PKS] [view]
-8566469212180078812 [1892] [CTCF] [Rabbit polyclonal. Antibody Target: CTCF] [Myers, Hardison, Snyder] [CTCF (07-729)] [1350637 DAM1472197] [http://www.millipore.com/catalogue/item/07-729] [CTCF] [CTCF] [TFSS] ... [No treatment or prot] [NONE] [control] [narrowPeak] [http://hgdownload.cse.ucsc.edu/goldenPath/hg1... [Peaks] [Regions of enriched signal in experiment] [Peaks] [PKS] [view]
-7728122121608963789 [42] [] [] [] [] [] [] [] [] [] ... [No treatment or prot] [NONE] [control] [narrowPeak] [http://hgdownload.cse.ucsc.edu/goldenPath/hg1... [Peaks] [Regions of enriched signal in experiment] [Peaks] [PKS] [view]
-7571531587018517458 [1892] [CTCF] [Rabbit polyclonal. Antibody Target: CTCF] [Myers, Hardison, Snyder] [CTCF (07-729)] [1350637 DAM1472197] [http://www.millipore.com/catalogue/item/07-729] [CTCF] [CTCF] [TFSS] ... [No treatment or prot] [NONE] [control] [narrowPeak] [http://hgdownload.cse.ucsc.edu/goldenPath/hg1... [Peaks] [Regions of enriched signal in experiment] [Peaks] [PKS] [view]

5 rows × 189 columns