Welcome to MICTI’s documentation!¶
Recent advances in single-cell gene expression profiling technology have revolutionized the understanding of molecular processes underlying developmental cell and tissue differentiation, enabling the discovery of novel cell types and molecular markers that characterize developmental trajectories. Common approaches for identifying marker genes are based on pairwise statistical testing for differential gene expression between cell types in heterogeneous cell populations, which is challenging due to unequal sample sizes and variance between groups resulting in little statistical power and inflated type I errors.
We developed an alternative feature extraction method, Marker gene Identification for Cell Type Identity (MICTI), that encodes the cell-type specific expression information to each gene in every single cell. This approach identifies features (genes) that are cell-type specific for a given cell-type in heterogeneous cell population.
Contents:
Installation¶
- Obtain Python 3.5 and virturalenv.
- Create a virtual environment somewhere on your disk, and then activate it.
$ virtualenv --no-site-packages --python=python3.5 micti_env $ source micti_env/bin/activate
- Download the source code and install the requirements.
$ pip install MICTIpip will install the following packages:
Import MICTI.
$from MICTI import *
User’s Guide¶
Creat MICTI Object¶
$MICTI(data,geneNames,cellNames,k=None,cluster_label=None,cluster_assignment=None, th=0,seed=None, ensembel=False, organisum=”hsapiens”)
Input¶
data¶
Input data as sparce or dense matrix where the rows are cells and the columns are genes
geneNames¶
List of gene names
cellNames¶
List of cell names
k¶
The number of clusters or cell types
cluster_label¶
List of cluster lablees /cell types names
cluster_assignment¶
An aaray of cluster assignment for each of cells
th¶
The treshold gene expression value to consider a certain gene is expressed or not
ensembel¶
A boolian value indicating the given gene name is ENSEBEL gene Id or not
organisum¶
The organisum where dataset belong eg. hsapiens or mmusculus
Output¶
The output is the MICTI object
Data visualisation¶
$MICTI.get_Visualization(dim=2,method="tsne")
Input¶
dim¶
The number of dimension for visualisation dim=2 or dim=3
method¶
The method used for low dimensional visualisation, method=”PCA” or method=”tsne”
Output¶
Returns none. Desplays the lower dimensional representation of the dataset
Clustering cells¶
$MICTI.cluster_cells(numberOfCluster, method="kmeans", maxiter=500)
Input¶
numberOfCluster¶
The number of clusters
method¶
The method used for clustering. There are two options, ie. method=”kmeans” for kmeans clustering or method=”GM” gaussian mixture model for clustering
maxiter¶
The maximum iteration that the k-means or Gaussian mixture algorithm takes in the clustering process.
Output¶
Returns None, assigning each cells into k clusters
Cell-type marker genes¶
$MICTI.marker_gene_FDR_p_value(clusterNo)
Input¶
clusterNo¶
The cluster number. Each clusters are identified by number. For example, if there are six clusters/cell-types, the cluster numbers are from 0-5.
Output¶
Returns a table with Z-score, p-value and FDR p-value for each of the genes.
significant cluster markers¶
$MICTI.get_markers_by_Pvalues_and_Zscore(clusterNo,threshold_pvalue=.01, threshold_z_score=0)
Input¶
clusterNo¶
The cluster number. Each clusters are identified by number. For example, if there are six clusters/cell-types, the cluster numbers are from 0-5.
threshold_pvalue¶
The threshold FDR p-value. Genes/Markers with less than the threshold FDR p-value are selected.
threshold_z_score¶
The threshold Z-scores. Genes/markers with greater than the threshold z-score are selected.
Output¶
Returns a table with Z-score, p-value and FDR p-value of significantlly cell-type/cluster marker genes filtered by FDR Pvalue and Z-score.
Tutorials¶
We developed an alternative feature extraction method, Marker gene Identification for Cell Type Identity (MICTI), that encodes the cell-type specific expression information to each gene in every single cell. This approach identifies features (genes) that are cell-type specific for a given cell-type in heterogeneous cell population.
Import MICTI module¶
$from MICTI import *
Import data¶
We collected single-cell RNA-Seq dataset from six different immune cell types. We performed TPM normaization for each of samples.
$import pandas as pa
$datamatrix=pa.read_csv("dataset.txt", sep="\t", index_col="genes")
Genes | GSM2181141 | GSM2181122 | GSM2181113 | GSM2180862 | GSM2181258 | GSM2181201 | GSM2180840 | GSM2181133 | GSM2181089 | GSM2180853 |
---|---|---|---|---|---|---|---|---|---|---|
A1BG | 0.000000 | 0.043549 | 0.054509 | 0.000000 | 0.000000 | 0.066542 | 0.605715 | 0.651164 | 0.095305 | 0.000000 |
A1CF | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
A2M | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
A2ML1 | 0.046830 | 0.071208 | 0.018045 | 0.000000 | 0.000000 | 0.023222 | 0.531418 | 0.050903 | 0.098627 | 0.000000 |
A4GALT | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
AAAS | 39.244719 | 4.173193 | 28.947780 | 0.000000 | 67.050516 | 97.502654 | 0.000000 | 2.375844 | 88.972850 | 341.262077 |
AACS | 0.623697 | 0.401357 | 0.362420 | 0.777686 | 0.270946 | 0.893264 | 0.860927 | 0.546757 | 1.002484 | 0.000000 |
AADACL3 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
AADAT | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
AAED1 | 8.078604 | 8.696563 | 6.825583 | 4.692559 | 0.904554 | 0.456029 | 6.191677 | 12.625448 | 11.592398 | 10.103919 |
More information about the samples can be found from the metadata information. Metadata information contains disease stages, tissue catagory, sample source and other important information about the sample/cell. From the metadata table we extracted cell types/sample source in order to classify our cells according to cell type.
$metadata=pa.read_csv("metadata.txt", sep="\t", index_col="SampleID")
SampleID | SubjectID | DiseaseCategory | TissueCategory | BamFileName | CellType | Description | DiseaseStage | DiseaseState | Ethnicity |
---|---|---|---|---|---|---|---|---|---|
GSM2181141 | No Info | hematologic cancer | hematopoietic system | EGAX00001437341.bam | lymphoblast | processed data file = cell_line_FPKM.csv | No Info | chronic myeloid leukemia (CML) | No Info |
GSM2181122 | No Info | hematologic cancer | hematopoietic system | EGAX00001437284.bam | lymphoblast | processed data file = cell_line_FPKM.csv | No Info | chronic myeloid leukemia (CML) | No Info |
GSM2181113 | No Info | hematologic cancer | hematopoietic system | EGAX00001437257.bam | lymphoblast | processed data file = cell_line_FPKM.csv | No Info | chronic myeloid leukemia (CML) | No Info |
GSM2180862 | No Info | hematologic cancer | hematopoietic system | EGAX00001437608.bam | B cell | processed data file = cell_line_FPKM.csv | No Info | B-cell lymphoma | No Info |
GSM2181258 | No Info | hematologic cancer | hematopoietic system | EGAX00001439870.bam | B cell | processed data file = cell_line_FPKM.csv | No Info | B-cell lymphoma |No Info |
Now we have cell-type information for each of our samples/cells from the metadata table. So we wanted to get markers for each of the cell-types using MICTI
$cell_type=list(metadata["CellType"])
$set(cell-type)
{'B cell',
'CD4+ memory T cell',
'CD8+ memory T cell',
'conventional dendritic cell',
'fibroblast',
'lymphoblast'}
$geneName=list(datamatrix.index)
$print(geneName[:10])
['A1BG', 'A1CF', 'A2M', 'A2ML1', 'A4GALT', 'AAAS', 'AACS', 'AADACL3', 'AADAT', 'AAED1']
$cellName=list(datamatrix.columns)
Creating MICTI object for known cell-types¶
$mictiObject=MICTI(datamatrix, geneName, cellName, cluster_assignment=cell_type, k=None, th=0, ensembel=False, organisum="hsapiens")
Lower dimensional data visualization¶
$mictiObject.get_Visualization(method="tsne")
Marker genes for each cluster¶
$mictiObject.marker_gene_FDR_p_value(0)
Genes | Z_scores | fdr | p_value |
---|---|---|---|
HLA-DRA | 40.605319 | 0.000000e+00 | 0.000000e+00 |
MS4A1 | 40.199070 | 0.000000e+00 | 0.000000e+00 |
TUBB | 15.099339 | 0.000000e+00 | 0.000000e+00 |
HLA-DPA1 | 14.701781 | 0.000000e+00 | 0.000000e+00 |
RPS18 | 61.131416 | 0.000000e+00 | 0.000000e+00 |
Marker genes for each cluster by P-value and Z-Score threshold¶
$mictiObject.get_markers_by_Pvalues_and_Zscore(1, threshold_pvalue=.01,threshold_z_score=0)
Genes | Z_scores | fdr | p_value |
CSF2 | 20.313988 | 0.000000e+00 | 0.000000e+00 |
IL2RG | 12.560409 | 0.000000e+00 | 0.000000e+00 |
ATP9B | 28.123272 | 0.000000e+00 | 0.000000e+00 |
HIST1H2BK | 9.118146 | 0.000000e+00 | 0.000000e+00 |
PATL2 | 9.055203 | 0.000000e+00 | 0.000000e+00 |
CTLA4 | 8.523849 | 0.000000e+00 | 0.000000e+00 |
CCL20 | 11.984467 | 0.000000e+00 | 0.000000e+00 |
MAP3K14 | 32.571130 | 0.000000e+00 | 0.000000e+00 |
GZMB | 17.080777 | 0.000000e+00 | 0.000000e+00 |
GPR171 | 10.677701 | 0.000000e+00 | 0.000000e+00 |
Enrichment analysis for identified marker genes¶
Get gene-over representation enrichmentlysis result for cel-type marker genes in all clusters of cell type
$enrechment_table=mictiObject.get_sig_gene_over_representation()
$enrechment_table[1]
#CD4+ cells
Creating MICTI object for clustering cells into pre-defined k clusters¶
In case, if the cell-type information for each cells is not known, we can perform unsupervided clustering to differentiate cells into predifined k clusters. Here, we use K-means and Gaussian mexture mode for clustering.
Creat MICTI object¶
$mictiObject_1=MICTI(datamatrix, geneName, cellName, cluster_assignment=None, th=0, ensembel=False, organisum="hsapiens")
Cluster cells into k clusters¶
Cluster cells into k=6 clusters using Gaussian mixture model- method=”GM”, and k-means - method=”kmeans”
$mictiObject_1.cluster_cells(6, method="GM", maxiter=10e3)
Marker genes per each cluster¶
#markers for the third cluster
$mictiObject_1.get_markers_by_Pvalues_and_Zscore(2, threshold_pvalue=.01, threshold_z_score=0)
Gene-list Enrichment analysis for cluster marker genes¶
$enrechment_table=mictiObject_1.get_sig_gene_over_representation()
$enrechment_table[0]# Enrichment result for the first cluster