Tutorial 1: Simple example of local processing¶
Simple example of local processing¶
In this first tutorial we will show a complete example of usage of the library using some example datasets provided with it.
Importing the library¶
Importing the library¶
import gmql as gl
Loading datasets¶
PyGMQL can work with BED and GTF files with arbitrary fields and schemas. In order to load a dataset into Python the user can use the following functions:
load_from_path
: lazily loads a dataset into a GMQLDataset variable from the local file systemload_from_remote
: lazily loads a dataset into a GMQLDataset variable from a remote GMQL servicefrom_pandas
: lazily loads a dataset into a GMQLDataset variable from a Pandas DataFrame having at least the chromosome, start and stop columns
In addition to these functions we also provide a function called get_example_dataset
which enables the user to load a sample dataset and play with it in order to get confidence with the library. Currently we provide two example datasets: Example_Dataset_1
and Example_Dataset_2
.
In the following we will load two example datasets and play with them.
dataset1 = gl.get_example_dataset("Example_Dataset_1")
dataset2 = gl.get_example_dataset("Example_Dataset_2")
The GMQLDataset
¶
The dataset
variable defined above is a GMQLDataset
, which represents a GMQL variable and on which it is possible to apply GMQL operators. It must be noticed that no data has been loaded in memory yet and the computation will only start when the query is triggered. We will see how to start the execution of a query in the following steps.
We can inspect the schema of the dataset with the following:
dataset1.schema
dataset2.schema
Filtering the dataset regions based on a predicate¶
The first operation we will do on dataset
will be selecting only the genomic regions on the 3rd chromosome and with a start position greater than 30000.
filtered_dataset1 = dataset1.reg_select((dataset1.chr == 'chr3') & (dataset1.start >= 30000))
From this operation we can learn several things about the GMQLDataset
data structure. Each GMQLDataset
has a set of methods and fields which can be used to build GMQL queries. For example, in the previous statement we have:
- the
reg_select
method, which enables us to filter the datasets on the basis of a predicate on the region positions and features - the
chr
andstart
fields, which enable the user to build predicates on the fields of the dataset.
Every GMQL operator has a relative method accessible from the GMQLDataset
data structure, as well as any other field of the dataset.
Filtering a dataset based on a predicate on metadata¶
The Genomic Data Model enables us to work both with genomic regions and their relative metadata. Therefore we can filter dataset samples on the basis of predicates on metadata attributes. This can be done as follows:
filtered_dataset_2 = dataset2[dataset2['antibody_target'] == 'CTCF']
Notice that the notation for selecting the samples using metadata is the same as the one for filtering Pandas DataFrames.
Joining two datasets¶
It is not the focus of this tutorial to show all the possible operations which can be done on a GMQLDataset
, they can be seen on the documentation page of the library.
For the sake of this example, let's show the JOIN operation between the two filtered datasets defined in the previous two steps. The JOIN operation semantics relies on the concept of reference and experiment datasets. The reference dataset is the one 'calling' the join function while the experiment dataset is the one 'on which' the function is called. The semantics of the function is
resulting_dataset = <reference>.join(<experiment>, <genometric predicate>, ...)
dataset_join = dataset1.join(dataset2, [gl.DLE(0)])
To understand the concept of genometric predicate please visit the documentation of the library.
Materialization of the results¶
As we have already said, no operation has beed effectively done up to this point. What we did up to now is to define the sequence of operations to apply on the data. In order to trigger the execution we have to apply the materialize
function on the variable we want to compute.
query_result = dataset_join.materialize()
The GDataframe
¶
The query_result
variable holds the result of the previous GMQL query in the form of a GDataframe
data structure. It holds the information about the regions and the metadata of the result, which can be respectively accessed through the regs
and meta
attributes.
Regions¶
query_result.regs.head()
Metadata¶
query_result.meta.head()