Welcome to PyGMQL’s documentation!¶

PyGMQL is a python module that enables the user to perform operation on genomic data in a scalable way.

This library is part of the bigger project GMQL which aims at designing and developing a genomic data management and analysis software on top of big data engines for helping biologists, researchers and data scientists.

GMQL is a declarative language with a SQL-like syntax. PyGMQL translates this paradigm to the interactive and script-oriented world of python, enabling the integration of genomic data with classical Python packages for machine learning and data science.

Contents:

Tutorials¶

Contents:

Data structures and functions¶

Dataset structures¶

`GMQLDataset.GMQLDataset`	The main abstraction of the library.
`GDataframe.GDataframe`	Class holding the result of a materialization of a GMQLDataset.

Dataset loading functions¶

`load_from_path`	Loads the data from a local path into a GMQLDataset.
`load_from_remote`	Loads the data from a remote repository.

Parsing¶

For the list of the available parsers go to:

Contents:

Parsers
- Predefined parsers
- Customizable parser

Aggregates operators¶

`COUNT`()	Counts the number of regions in the group.
`SUM`(argument)	Computes the sum of the values of the specified attribute
`MIN`(argument)	Gets the minimum value in the aggregation group for the specified attribute
`MAX`(argument)	Gets the maximum value in the aggregation group for the specified attribute
`AVG`(argument)	Gets the average value in the aggregation group for the specified attribute
`BAG`(argument)	Creates space-separated string of attribute values for the specified attribute.
`STD`(argument)	Gets the standard deviation of the aggregation group for the specified attribute
`MEDIAN`(argument)	Gets the median value of the aggregation group for the specified attribute
`Q1`(argument)	Gets the first quartile for the specified attribute
`Q2`(argument)	Gets the second quartile for the specified attribute
`Q3`(argument)	Gets the third quartile for the specified attribute

Genometric predicates¶

`MD`(number)	Denotes the minimum distance clause, which selects the first K regions of an experiment sample at minimal distance from an anchor region of an anchor dataset sample.
`DLE`(limit)	Denotes the less-equal distance clause, which selects all the regions of the experiment such that their distance from the anchor region is less than, or equal to, N bases.
`DL`(limit)	Less than distance clause, which selects all the regions of the experiment such that their distance from the anchor region is less than N bases
`DGE`(limit)	Greater-equal distance clause, which selects all the regions of the experiment such that their distance from the anchor region is greater than, or equal to, N bases
`DG`(limit)	Greater than distance clause, which selects all the regions of the experiment such that their distance from the anchor region is greater than N bases
`UP`()	Upstream.
`DOWN`()	Downstream.

Mathematical operators¶

SQRT(argument) Computes the square matrix of the argument