03b_GWAS_HDFS

# Finding the most representative GWAS associated with cell-specific enhancers¶

In this tutorial we are going to use a GWAS dataset (accessible from this link) together with the whole ENCODE BroadPeak dataset to find which mutations (and their associated traits) are most represented in enhancer regions which are present in a limited set of cell lines.

As first thing let's download the data and reformat into a bed-like file.

In [1]:
%%bash

cat tmp.tsv | awk 'BEGIN {FS="\t";OFS="\t"} {chrom=$12; gsub(chrom,"chr"chrom,$12)}{print $0}' | sed s/,//g > gwas.tsv rm tmp.tsv  In order to run the query on HDFS, we have to move the file there. In [2]: !hdfs dfs -put ./gwas.tsv hdfs:///  put: hdfs:///gwas.tsv': File exists  ## Library imports¶ In [3]: import gmql as gl import matplotlib.pyplot as plt import seaborn as sns import pandas as pd  ## Setting the master of the cluster¶ In this example, the data reside in the HDFS of a cluster. Let's say that the cluster is managed by the YARN resource manager. We have therefore to tell PyGMQL to use it. In [4]: gl.set_master("yarn")  ## Loading of the GWAS dataset¶ In this example, we have loaded the GMQL repository on the HDFS. It is convenient to store in a variable the path of the repository. In [5]: gmql_repository = "hdfs:///"  The GWAS data comes from a single TSV file. Therefore we can import it using the load_from_file function. Notice that we have to specify a parser to properly load our data. Therefore it is wise to take a look at the schema of the downloaded file. In [6]: ! head -1 gwas.tsv | tr "\t" "\n"  DATE ADDED TO CATALOG PUBMEDID FIRST AUTHOR DATE JOURNAL LINK STUDY DISEASE/TRAIT INITIAL SAMPLE SIZE REPLICATION SAMPLE SIZE REGION chrCHR_ID CHR_POS REPORTED GENE(S) MAPPED_GENE UPSTREAM_GENE_ID DOWNSTREAM_GENE_ID SNP_GENE_IDS UPSTREAM_GENE_DISTANCE DOWNSTREAM_GENE_DISTANCE STRONGEST SNP-RISK ALLELE SNPS MERGED SNP_ID_CURRENT CONTEXT INTERGENIC RISK ALLELE FREQUENCY P-VALUE PVALUE_MLOG P-VALUE (TEXT) OR or BETA 95% CI (TEXT) PLATFORM [SNPS PASSING QC] CNV  We are mainly interested in the mutation position (11-th and 12-th columns) and the associated trait (7-th column). In [7]: gwas = gl.load_from_file(gmql_repository + "gwas.tsv", parser=gl.parsers.RegionParser(chrPos=11, startPos=12, stopPos=12, otherPos=[(7, "trait", 'string')]))  ### Inspecting the dataset¶ We can load a tiny part of the dataset to make sense of the data types and schema. You can inspect the dataset using the head function. This function returns a GDataframe object, which enables the access to regions (regs) and metadata (meta) In [8]: gwas.head().regs  Collecting regions: 5it [00:00, 511.69it/s] 100%|██████████| 2/2 [00:00<00:00, 868.03it/s]  Out[8]: chr start stop strand trait id_sample 8292134920041699140 chr14 75539214 75539214 * Multiple sclerosis 8292134920041699140 chr1 200912467 200912467 * Multiple sclerosis 8292134920041699140 chr3 119501087 119501087 * Multiple sclerosis 8292134920041699140 chr11 61064810 61064810 * Multiple sclerosis 8292134920041699140 chr1 116558335 116558335 * Multiple sclerosis We can also simply look at the schema In [9]: gwas.schema  Out[9]: ['trait', 'chr', 'start', 'stop', 'strand'] ### Plotting the traits¶ We want to get an idea of the trait distribution. In order to do that we have to load the data in memory. Thereofre we can call the materialize function and take the regions. In [10]: gwas_data = gwas.materialize().regs  Collecting regions: 134578it [00:02, 51530.17it/s] 100%|██████████| 2/2 [00:00<00:00, 972.59it/s]  We now plot the number of regions for each of the top 30 represented traits. In [11]: plt.figure(figsize=(20,5)) sns.countplot(data=gwas_data[gwas_data.trait.isin(gwas_data.trait.value_counts().iloc[:30].index)], x='trait') plt.xticks(rotation=90) plt.title("Top represented GWAS traits", fontsize=20) plt.show()  ## Loading of the ENCODE BroadPeak dataset¶ We now load the ENCODE BroadPeak dataset If the data come already in the GDM format, they can be loaded using the load_from_path function. A GDM dataset is stored as a folder having the following structure: /path/to/dataset/: - sample1.gdm - sample1.gdm.meta - sample2.gdm - sample2.gdm.meta - ... - schema.xml In this case, the parser for the data is automatically inferred by the library from the schema.xml file. In [12]: broad = gl.load_from_path(gmql_repository + "HG19_ENCODE_BROAD")  In [13]: broad.schema  Out[13]: ['name', 'score', 'signal', 'pvalue', 'qvalue', 'chr', 'start', 'stop', 'strand'] ### Getting the enhancers¶ We next identify active enhancers for each cell line as regions of the genome having a peak of H3K27ac. Therefore, we first select all the tracks of interest from the broad dataset filtering on the experiment_target metadata attribute. In [14]: acetyl = broad[broad['experiment_target'] == 'H3K27ac-human']  We get the peak region of the Chip-Seq using the reg_project function. The peak position (peak) is given by the center of the region. $$peak = \frac{right + left}{2}$$ In [15]: peaked = acetyl.reg_project(new_field_dict={'peak': (acetyl.right + acetyl.left)/2})  Once we have the peak, we extend the search region to$\pm 1500 bp\$. We use again reg_project

In [16]:
enlarge = peaked.reg_project(new_field_dict={'left': peaked.peak - 1500, 'right': peaked.peak + 1500})


### Grouping by cell line and aggregating the signals¶

We are interested in enhancers which are cell line specific. Therefore it is important to group our data by cell line. In addition to this we merge the signals coming from different tracks for the same cell line. We can do both of these actions using the normal_cover function.

As output of the following command, we have a dataset that contains a single track for each cell line. The track is computed merging the replicas of the experiment targeting H3K9ac for the same cell line.

In [17]:
enhancers_by_cell_line = enlarge.normal_cover(1, "ANY", groupBy=['biosample_term_name'])


To select only the cell line specific enhancers we can now apply again normal_cover and constraining the maximum number of overlaps between the regions to be a selected threshold. In this case we select a threshold of 2.

In this case the output contains a single sample with all the enhancers which are present in at most max_overlapping cell lines.

In [18]:
max_overlapping = 2
cell_specific_enhancers = enhancers_by_cell_line.normal_cover(1, max_overlapping)

In [19]:
cell_specific_enhancers.schema

Out[19]:
['AccIndex',
'JaccardIntersect',
'JaccardResult',
'chr',
'start',
'stop',
'strand']

Finally, we need to re-associate every cell specific enhancers in cell_specific_enhancers to all the max_overlapping cell lines in which it is present.

Therefore, we used a join to select, for each cell line, only those enchancers that overlap a region in the cell_specific_enhancers.

In [20]:
cell_specific_enhancers_by_cell_line = enhancers_by_cell_line.join(cell_specific_enhancers, [gl.DLE(0)], 'left', refName="en", expName="csen")


## Mapping mutations to cell specific enhancers¶

We now map the mutations in the GWAS dataset on the enhancer regions. We store the list of traits associated to each enhancer using the gl.BAG expression.

In [21]:
gwas.schema

Out[21]:
['trait', 'chr', 'start', 'stop', 'strand']
In [22]:
enhancer_gwas = cell_specific_enhancers_by_cell_line.map(gwas, refName="csen", expName="gwas", new_reg_fields={'traits': gl.BAG('trait')})
enhancer_gwas = enhancer_gwas.reg_project(["count_csen_gwas", "traits"],new_field_dict={'cell_line': enhancer_gwas['csen.en.biosample_term_name', 'string']})


## Materializing the result¶

We now can call the materialize function to execute the full query. The result will be collected in a GDataframe object.

In [23]:
enhancer_gwas = enhancer_gwas.materialize()

Collecting regions: 840000it [01:32, 32088.41it/s]
0%|          | 0/70 [00:00<?, ?it/s]
40%|████      | 28/70 [00:00<00:00, 275.93it/s]
80%|████████  | 56/70 [00:00<00:00, 275.65it/s]
100%|██████████| 70/70 [00:00<00:00, 278.28it/s]

The traits column of the resulting region is the list of traits associated with the cell specific enhancer. The data comes in the form of a string of trait names. We convert the string to a list.

In [24]:
enhancer_gwas.regs['traits'] = enhancer_gwas.regs.traits.map(lambda x: x.split(",") if pd.notnull(x) else x)


## Analysis¶

The final part of the analysis regards the matching of cell lines and traits. We want to understand if a cell line (which is represented by its specific enhancers) has some particular mutation trait associated. The analysis is performed in Pandas using the result region attributes traits and cell_line.

We build an association matrix between cell lines and traits by firstly converting the result to a list of (cell_line, trait), converting it to a Pandas DataFrame, and finally using the crosstab Pandas function to extract the matrix.

In [25]:
cell_trait = pd.DataFrame.from_records([(k, v) for k, vs in enhancer_gwas.regs[enhancer_gwas.regs.count_csen_gwas > 0].groupby("cell_line").traits.sum().to_dict().items() for v in vs],
columns=['cell_line', 'trait'])

Collecting regions: 843859it [01:45, 32088.41it/s]
In [26]:
cross = pd.crosstab(cell_trait.cell_line, cell_trait.trait)


We finally plot the result as an heatmap.

In [27]:
plt.figure(figsize=(50, 15))
sns.heatmap(cross[cross.sum(0).sort_values(ascending=False).iloc[:100].index], cmap='Reds', vmax=70, linewidths=1, annot=True, cbar=False)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)

plt.xlabel("Trait", fontsize=30)
plt.ylabel("Cell line", fontsize=30)
plt.show()
`