Welcome to PyGMQL’s documentation!¶
PyGMQL is a python module that enables the user to perform operation on genomic data in a scalable way.
This library is part of the bigger project GMQL which aims at designing and developing a genomic data management and analysis software on top of big data engines for helping biologists, researchers and data scientists.
GMQL is a declarative language with a SQL-like syntax. PyGMQL translates this paradigm to the interactive and script-oriented world of python, enabling the integration of genomic data with classical Python packages for machine learning and data science.
Tutorials¶
Data structures and functions¶
Dataset structures¶
GMQLDataset.GMQLDataset |
The main abstraction of the library. |
GDataframe.GDataframe |
Class holding the result of a materialization of a GMQLDataset. |
Dataset loading functions¶
load_from_path |
Loads the data from a local path into a GMQLDataset. |
load_from_remote |
Loads the data from a remote repository. |
Parsing¶
For the list of the available parsers go to:
Aggregates operators¶
COUNT () |
Counts the number of regions in the group. |
SUM (argument) |
Computes the sum of the values of the specified attribute |
MIN (argument) |
Gets the minimum value in the aggregation group for the specified attribute |
MAX (argument) |
Gets the maximum value in the aggregation group for the specified attribute |
AVG (argument) |
Gets the average value in the aggregation group for the specified attribute |
BAG (argument) |
Creates space-separated string of attribute values for the specified attribute. |
STD (argument) |
Gets the standard deviation of the aggregation group for the specified attribute |
MEDIAN (argument) |
Gets the median value of the aggregation group for the specified attribute |
Q1 (argument) |
Gets the first quartile for the specified attribute |
Q2 (argument) |
Gets the second quartile for the specified attribute |
Q3 (argument) |
Gets the third quartile for the specified attribute |
Genometric predicates¶
MD (number) |
Denotes the minimum distance clause, which selects the first K regions of an experiment sample at minimal distance from an anchor region of an anchor dataset sample. |
DLE (limit) |
Denotes the less-equal distance clause, which selects all the regions of the experiment such that their distance from the anchor region is less than, or equal to, N bases. |
DL (limit) |
Less than distance clause, which selects all the regions of the experiment such that their distance from the anchor region is less than N bases |
DGE (limit) |
Greater-equal distance clause, which selects all the regions of the experiment such that their distance from the anchor region is greater than, or equal to, N bases |
DG (limit) |
Greater than distance clause, which selects all the regions of the experiment such that their distance from the anchor region is greater than N bases |
UP () |
Upstream. |
DOWN () |
Downstream. |