Welcome to PyGMQL’s documentation!

PyGMQL is a python module that enables the user to perform operation on genomic data in a scalable way.

This library is part of the bigger project GMQL which aims at designing and developing a genomic data management and analysis software on top of big data engines for helping biologists, researchers and data scientists.

GMQL is a declarative language with a SQL-like syntax. PyGMQL translates this paradigm to the interactive and script-oriented world of python, enabling the integration of genomic data with classical Python packages for machine learning and data science.

Data structures and functions

Dataset structures

GMQLDataset.GMQLDataset The main abstraction of the library.
GDataframe.GDataframe Class holding the result of a materialization of a GMQLDataset.

Dataset loading functions

load_from_path Loads the data from a local path into a GMQLDataset.
load_from_remote Loads the data from a remote repository.

Parsing

For the list of the available parsers go to:

Aggregates operators

COUNT() Counts the number of regions in the group.
SUM(argument) Computes the sum of the values of the specified attribute
MIN(argument) Gets the minimum value in the aggregation group for the specified attribute
MAX(argument) Gets the maximum value in the aggregation group for the specified attribute
AVG(argument) Gets the average value in the aggregation group for the specified attribute
BAG(argument) Creates space-separated string of attribute values for the specified attribute.
STD(argument) Gets the standard deviation of the aggregation group for the specified attribute
MEDIAN(argument) Gets the median value of the aggregation group for the specified attribute
Q1(argument) Gets the first quartile for the specified attribute
Q2(argument) Gets the second quartile for the specified attribute
Q3(argument) Gets the third quartile for the specified attribute

Genometric predicates

MD(number) Denotes the minimum distance clause, which selects the first K regions of an experiment sample at minimal distance from an anchor region of an anchor dataset sample.
DLE(limit) Denotes the less-equal distance clause, which selects all the regions of the experiment such that their distance from the anchor region is less than, or equal to, N bases.
DL(limit) Less than distance clause, which selects all the regions of the experiment such that their distance from the anchor region is less than N bases
DGE(limit) Greater-equal distance clause, which selects all the regions of the experiment such that their distance from the anchor region is greater than, or equal to, N bases
DG(limit) Greater than distance clause, which selects all the regions of the experiment such that their distance from the anchor region is greater than N bases
UP() Upstream.
DOWN() Downstream.

Mathematical operators

SQRT(argument) Computes the square matrix of the argument