Software

Bayesian Analysis for Spatial Segmentation (BASS)

BASS is a method for multi-scale and multi-sample analysis in spatial transcriptomics. BASS performs multi-scale transcriptomic analyses in the form of joint cell type clustering and spatial domain detection, with the two analytic tasks carried out simultaneously within a Bayesian hierarchical modeling framework. For both analyses, BASS properly accounts for the spatial correlation structure and seamlessly integrates gene expression information with spatial localization information to improve their performance. In addition, BASS is capable of multi-sample analysis that jointly models multiple tissue sections/samples, facilitating the integration of spatial transcriptomic data across tissue samples.

Conditional AutoRegressive model based Deconvolution (CARD)

CARD is the software that leverages cell type specific expression information from single cell RNA sequencing (scRNA-seq) for the deconvolution of spatial transcriptomics. A unique feature of CARD is its ability to model the spatial correlation in cell type composition across tissue locations, thus enabling spatially informed cell type deconvolution. Modeling spatial correlation allows us to borrow the cell type composition information across locations on the entire tissue to accurately infer the cell type composition on each individual location, achieve robust deconvolution performance in the presence of mismatched scRNA-seq reference, impute cell type compositions and gene expression levels on unmeasured tissue locations, and facilitate the construction of a refined spatial tissue map with a resolution much higher than that measured in the original study.

CELl type-specific spatially variable gene IdentificatioN Analysis (CELINA)

CELINA is the software that can be used to systematically identify cell type specific spatially variable genes (ct-SVGs) across a variety of spatial transcriptomics platforms. Celina examines one gene at a time and uses a spatially varying coefficient model to explicitly and accurately model gene’s spatial expression pattern in relation to the cell type distribution across tissue locations. As a result, Celina provides effective type I error control and high statistical power in both single cell and spot resolution spatial transcriptomics.

COmposite likelihood-based COvariance regression NETwork model (CoCoNet)

CoCoNet is a composite likelihood-based covariance regression network model for identifying trait-relevant tissues or cell types. CoCoNet integrates tissue-specific gene co-expression networks constructed from either bulk or single cell RNA sequencing studies with association summary statistics from genome-wide association studies. CoCoNet relies on a covariance regression network model to express gene-level effect sizes for the given GWAS trait as a function of the tissue-specific co-expression adjacency matrix. With a composite likelihood-based inference algorithm, CoCoNet is scalable to tens of thousands of genes.

Deterministic Bayesian Sparse Linear Mixed Model (DBSLMM)

DBSLMM is an accurate and scalable method for constructing polygenic scores in large biobank scale data sets. DBSLMM relies on a flexible modeling assumption on the effect size distribution to achieve robust and accurate prediction performance across a range of genetic architectures. DBSLMM also relies on a simple deterministic search algorithm to yield an approximate analytic estimation solution using summary statistics only, which, when paired with further algebraic innovations, resulting in substantial computational savings.

latent Dirichlet Process Regression (DPR)

DPR is a software package implementing the latent Dirichlet process regression method for genetic prediction of complex traits. DPR relies on the Dirichlet process to assign a prior on the effect size distribution itself and is thus capable of inferring an effect size distribution from the data at hand. Effectively, DPR uses infinitely many parameters a priori to character the effect size distribution, and with such a flexible modeling assumption, DPR is capable of adapting to a broad spectrum of genetic architectures and achieves robust predictive performance across a wide range of complex traits.

Effect size Correlation for COnfounding determination (ECCO)

ECCO is computationally efficient approach for determining the optimal number of PEER factors for eQTL mapping analysis. ECCO requires the availability of an outcome phenotype in addition to the usual genotype and expression data required for eQTL mapping studies. With the outcome phenotype, ECCO estimates the gene expression effect on the phenotype for one gene at a time through two different analyses: a differential expression regression analysis and a Mendelian randomization (MR) analysis. By computing and examining the correlation between the estimated effect sizes from the two different analyses, ECCO can subsequently determine the optimal number of PEER factors for eQTL mapping analysis.

subcelluar Expression LocaLization Analysis (ELLA)

ELLA is a statistical method for modeling the subcellular localization of mRNAs and detecting genes that display spatial variation within cells in high-resolution spatial transcriptomics. ELLA utilizes a nonhomogeneous Poisson process to model the spatial count data within cells, creates a unified cellular coordinate system to anchor diverse shapes and morphologies across cells, and relies on an expression intensity function to capture the subcellular spatial distribution of mRNAs. ELLA can be applied to an arbitrary number of cells and detect a wide variety of subcellular localization patterns across diverse spatial transcriptomic techniques, while producing effective control of type I error and yielding high statistical power. With a computationally efficient algorithm, ELLA is scalable to tens of thousands of genes across tens of thousands of cells.

Fine-mApping of causal genes for BInary Outcomes (FABIO)

FABIO is a transcriptome-wide association study (TWAS) fine-mapping method specifically designed for binary traits that is capable of modeling all genes jointly on an entire chromosome. FABIO relies on a probit model to directly relate multiple GReX to binary outcome. Additionally, it jointly models all genes located on a chromosome to account for the correlation among GReX arising from cis-SNP LD and expression correlation across genomic regions. As a result, FABIO effectively controls false discoveries while offering substantial power gains over existing TWAS fine-mapping approaches.

Genetic and Environmental Covariance estimation by composite-liKelihood Optimization (GECKO)

GECKO is a computational method for estimating both genetic and environmental covariances using GWAS summary statistics. GECKO improves estimation accuracy of method of moments algorithms while keeping computation in check. GECKO relies on composite likelihood, is scalable computationally, uses only on summary statistics, provides accurate genetic and environmental covariance estimates across a range of scenarios, and accommodates SNP annotation stratified covariance estimation.

Genome-wide Efficient Mixed Model Association (GEMMA: LMM, mvLMM, BSLMM, and MQS)

GEMMA is the software implementing the Genome-wide Efficient Mixed Model Association algorithm for a standard linear mixed model and some of its close relatives for genome-wide association studies (GWAS):

It is computationally efficient for large scale GWAS and uses freely available open-source numerical libraries.

Gene-based Integrative Fine-mapping through conditional TWAS (GIFT)

GIFT is a Gene-based Integrative Fine-mapping for performing conditional TWAS analysis. GIFT examines one genomic region at a time, jointly models the GReX of all genes residing in the focal region, and carries out TWAS conditional analysis in a maximum likelihood framework. In the process, GIFT explicitly models the gene expression correlation and cis-SNP LD across different genes in the region and accounts for the uncertainty in the constructed GReX. As a result, GIFT provides effective type I error control, refines marginal TWAS associations into a much smaller set of putatively causal associations, and yields high statistical power with reduced false discoveries.

integrative Differential expression and gene set Enrichment Analysis (iDEA)

iDEA is a method for performing joint differential expression (DE) and gene set enrichment (GSE) analysis. iDEA builds upon a hierarchical Bayesian model for joint modeling of DE and GSE analyses. It uses only summary statistics as input, allowing for effective data modeling through complementing and pairing with various existing DE methods. It relies on an efficient expectation-maximization algorithm with internal Markov Chain Monte Carlo steps for scalable inference. By integrating DE and GSE analyses, iDEA can improve the power and consistency of DE analysis and the accuracy of GSE analysis over common existing approaches.

Integrative Methylation Association with GEnotypes (IMAGE)

IMAGE is a method that performs methylation quantitative trait locus (mQTL) mapping in bisulfite sequencing studies. IMAGE jointly accounts for both allele-specific methylation information from heterozygous individuals and non-allele-specific methylation information across all individuals, enabling powerful ASM-assisted mQTL mapping. In addition, IMAGE relies on an over-dispersed binomial mixed model to directly model count data, which naturally accounts for sample non-independence resulting from individual relatedness, population stratification, or batch effects that are commonly observed in sequencing studies. IMAGE relies on a penalized quasi-likelihood (PQL) approximation-based algorithm for scalable model inference.

integrative MApping of Pleiotropic association (iMAP)

iMAP performs integrative mapping of pleiotropic association and functional annotations using penalized Gaussian mixture models. iMAP relies on a multinomial logistic regression model to incorporate a large number of binary and continuous SNP annotations, and, with a sparsity-inducing penalty term, is capable of selecting a small, informative set of annotations. In addition, iMAP directly models summary statistics from GWASs and uses a multivariate Gaussian distribution to account for phenotypic correlation between traits. As a result, iMAP is capable of integrating both binary and continuous SNP annotations, selecting informative annotations from a large set of potentially non-informative ones and using GWAS summary statistics while simultaneously accounting for phenotypic correlation between traits.

Integrative and Reference-Informed tissue Segmentation (IRIS)

IRIS (Integrative and Reference-Informed tissue Segmentation) is a method for spatial domain detection in spatially resolved transcriptomics (SRT). IRIS models multiple tissue slices jointly and segments each tissue slice into multiple spatial domains. IRIS also accounts for the spatial correlation structure commonly observed across locations on each tissue slice and explicitly models the similarity in cell type composition underling similar spatial domains across tissue slices. A unique feature of IRIS is its ability to incorporate a scRNA-seq data to serve as the reference for domain detection, which allows IRIS to seamlessly integrate the cell type specific transcriptomic profiles from the scRNA-seq reference to the SRT dataset to substantially improve the accuracy in spatial domain detection. As a result, IRIS is accurate, scalable, and robust for spatial domain detection across a range of SRT technologies with distinct spatial resolutions.

Mixed model Association for Count data via data AUgmentation (MACAU)

MACAU is the software implementing the Mixed model Association for Count data via data AUgmentation algorithm. It fits a binomial mixed model to perform differential methylation analysis for bisulfite sequencing studies. It fits a Poisson mixed model to perform differential expression analysis for RNA sequencing studies. It is computationally efficient for large scale sequencing studies and uses freely available open-source numerical libraries.

MArginal ePIstasis Test (MAPIT)

MAPIT is the software implementing the new strategy for mapping epistasis: instead of directly identifying individual pairwise or higher-order interactions, MAPIT focuses on mapping variants that have non-zero marginal epistatic effects — the combined pairwise interaction effects between a given variant and all other variants. By testing marginal epistatic effects, MAPIT can identify candidate variants that are involved in epistasis without the need to identify the exact partners with which the variants interact, thus potentially alleviating much of the statistical and computational burden associated with standard epistatic mapping procedures. MAPIT is based on a variance component model, and relies on a recently developed variance component estimation method for efficient parameter inference and p-value computation.

Multi-ancestry Sum of the Single Effects Model (MESuSiE)

MESuSiE is a method for multi-ancestry fine-mapping analysis in genome-wide association studies. MESuSiE explicitly models both shared and ancestry-specific causal variants across ancestries, properly accounts for the diverse LD pattern observed in different ancestries, relies on GWAS marginal summary statistics as input, and extends the recent scalable variational inference algorithm SuSiE, which was developed for ancestry-specific fine-mapping, towards scalable multi-ancestry fine-mapping.

Multi-ancEstry TRanscriptOme-wide analysis (METRO)

METRO is a method that leverages expression data collected from multiple genetic ancestries to enhance the power of TWAS. METRO incorporates expression prediction models constructed from multiple ancestries through a joint likelihood-based inference framework, allowing us to account for the uncertainty in the prediction models constructed in each expression study. In addition, METRO is capable of inferring the contribution of expression prediction models in different genetic ancestries towards explaining and informing the gene-trait association, allowing us to interrogate the ancestry-dependent transcriptomic mechanisms underlying gene-trait association.

Mendelian Randomization with Automated Instrument Determination (MRAID)

MRAID is a software for carrying out Mendelian randomization analysis. MRAID borrows ideas from fine-mapping approaches to model an initial set of candidate SNP instruments that are in potentially high LD with each other and automatically selects among them the suitable instruments for MR analysis. MRAID also explicitly models two types of horizontal pleiotropic effects that are either uncorrelated or correlated with the instrumental effects on the exposure to ensure effective control of horizontal pleiotropy. MRAID achieves both analytic tasks through a joint likelihood inference framework and relies on a scalable sampling-based algorithm to compute calibrated p-values for causal inference. As a result, MRAID provides calibrated type I error control for causal effect testing in the presence of horizontal pleiotropy, reduces false positives and, as a by-product, estimates the proportion of SNPs exhibiting uncorrelated or correlated horizontal pleiotropy.

Multi-trait assisted Polygenic Scores (mtPGS)

mtPGS is a method that constructs accurate PGS for a target trait of interest through leveraging multiple traits relevant to the target trait. Specifically, mtPGS borrows SNP effect size similarity information between the target trait and its relevant traits to improve the effect size estimation on the target trait, thus achieving accurate PGS. In the process, mtPGS flexibly models the shared genetic architecture between the target and the relevant traits to achieve robust performance, while explicitly accounting for the environmental covariance among them to accommodate different study designs with various sample overlap patterns. In addition, mtPGS uses only summary statistics as input and relies on a deterministic algorithm with several algebraic techniques for scalable computation.

Omnigenic Mendelian Randomization (OMR)

OMR is the software that explores the benefits of the omnigenic architecture for MR analysis. OMR builds upon the omnigenic modeling assumption on SNP effect sizes and use all genome-wide SNPs to serve as instrumental variables without any instrumental variable pre-selection. OMR imposes a general modeling assumption on the horizontal pleiotropic effects and relies on a scalable composite likelihood framework for causal effect inference.

Probabilistic Mendelian Randomization for TWAS (PMR-Egger and moPMR-Egger)

The PMR software package implements two methods:

PMR-Egger, which is a method that fits probabilistic Mendelian randomization with an Egger regression assumption on horizontal pleiotropy for transcriptome-wide association studies (TWASs). PMR-Egger relies on a new MR likelihood framework that unifies many existing TWAS and MR methods, accommodates multiple correlated instruments, tests the causal effect of gene on trait in the presence of horizontal pleiotropy, directly performs genome-wide test of horizontal pleiotropy, and, with a newly developed parameter expansion version of the expectation maximization algorithm, is scalable to hundreds of thousands of individuals.

moPMR-Egger, which extends PMR-Egger towards analyzing multiple outcome traits in TWAS applications. moPMR-Egger examines one gene at a time, relies on its cis-SNPs that are in potential linkage disequilibrium with each other to serve as instrumental variables, and tests its causal effects on multiple traits jointly. A key feature of moPMR-Egger is its ability to test and control for potential horizontal pleiotropic effects from instruments, thus maximizing power while minimizing false associations for TWASs. moPMR-Egger provides calibrated type I error control for both causal effects testing and horizontal pleiotropic effects testing and is more powerful than existing univariate TWAS approaches in detecting causal associations.

Penalized QuasiLikelihood for genomic Sequencing count data (PQLseq)

PQLseq is a method that fits generalized linear mixed models for analyzing RNA sequencing and bisulfite sequencing data. It estimates gene expression or methylation heritability for count data. It performs differential expression analysis in the presence of individual relatedness or population stratificaiton.

PGS-based phenotype prediction interval (PredInterval)

PredInterval is designed to quantify phenotype prediction uncertainty through the construction of well-calibrated prediction intervals. PredInterval is non-parametric in natural and extracts information based on quantiles of phenotypic residuals through cross-validations, thus achieving well-calibrated coverage of true phenotypic values across a range of settings and traits with distinct genetic architecture. In addition, the PredInterval framework is general and can be paired with any PGS method.

Scalable Multiple Annotation integration for trait-Relevant Tissue identification (SMART)

SMART is a software implementing the Scalable Multiple Annotation integration for trait-Relevant Tissue identification and usage. It extends the commonly used linear mixed model to relate variant effect sizes to variant annotations by introducing variant specific variance components that are functions of multiple annotations. It quantifies and evaluates the joint contribution of multiple annotations to genetic effect sizes by performing parameter inference using the widely used generalized estimation equation (GEE). The GEE-based algorithm in SMART allows for the use of summary statistics and naturally accounts for the correlation among summary statistics due to linkage disequilibrium. With GEE statistics, SMART applies mixture models to classify tissues into two categories—those that are relevant to the trait and those that are not—thus formulating the task of identifying trait-relevant tissues into a classification problem.

Spatiall aware probabilistic Principal Component Analysis (SpatialPCA)

SpatialPCA is the software that perform spatially aware dimension reduction for spatial transcriptomics. SpatialPCA explicitly models the spatial correlation structure across tissue locations to preserve the neighboring similarity of the original data in the low dimensional manifold. The low dimensional components obtained from SpatialPCA thus contain valuable spatial correlation information and can be directly paired with existing computational tools developed in scRNA-seq for effective and novel downstream analysis in spatial transcriptomics. In particular, the low-dimensional components from SpatialPCA can be paired with scRNA-seq clustering methods to enable effective de novo tissue domain detection and can be paired with scRNA-seq trajectory inference methods to enable effective developmental trajectory inference on the tissue. Because of the data generative nature of SpatialPCA and its explicit modeling of spatial correlation, it can also be used to impute the low dimensional components on new and unmeasured spatial locations, facilitating the construction of a refined spatial map with a resolution much higher than that measured in the original study.

Spatial PAttern Recognition via Kernels (SPARK and SPARK-X)

SPARK and SPARK-X are methods for detecting genes with spatial expression patterns in spatially resolved transcriptomic studies. SPARK directly models count data generated from various spatial resolved transcriptomic techniques through generalized spatial linear models. While SPARK-X relies on a scalable non-parametric testing framework to model a wide variety of spatial transcriptomics data collected through different technologies. Both SPARK and SPARK-X rely on algebraic innovations for scalable computatation as well as newly developed statistical formulas for hypothesis testing, producing well-calibrated p-values and yielding high statistical power. Both SPARK and SPARK-X are implemented in the SPARK software.

Spatially Resolved Transcriptomics simulator (SRTsim)

SRTsim is a software simulator for generating synthetic spatially resolved transcriptomic (SRT) data based on a wide variety of SRT techniques. SRTsim incorporates spatial localization information to simulate SRT expression count data in a reproducible and scalable fashion, thus facilitating SRT experimental design and methodology development. A key benefit of SRTsim is its ability to not only maintain various location-wise and gene-wise SRT count properties but also preserve the spatial expression patterns of the SRT data on the tissue, thus making it feasible to evaluate SRT method performance for various SRT-specific analytic tasks using the synthetic data.

Variational Inference based Probabilistic Canonical Correlation Analysis (VIPCCA)

VIPCCA is a software that relies based on a non-linear probabilistic canonical correlation analysis, for effective and scalable single cell data alignment. VIPCCA leverages cutting-edge techniques from deep neural network for non-linear modeling of single cell data, thus allowing users to capture the complex biological structures from integration of multiple single-cell datasets across technologies, data types, conditions, and modalities. In addition, VIPCCA relies on variational inference for scalable computation, enabling efficient integration of large-scale single cell datasets with millions of cells. Importantly, VIPCCA can transform multi-modalities into lower dimensional space without any post-hoc data processing, a unique and desirable feature that is in direct contrast to existing alignment methods.

Variability preserving ImPutation for Expression Recovery (VIPER)

VIPER is a method that performs Variability Preserving ImPutation for Expression Recovery in single cell RNA sequencing studies. VIPER is based on nonnegative sparse regression models and is capable of progressively inferring a sparse set of local neighborhood cells that are most predictive of the expression levels of the cell of interest for imputation. A key feature of VIPER is its ability to preserve gene expression variability across cells after imputation.

Variant-set test INtegrative TWAS for GEne-based analysis (VINTAGE)

VINTAGE is a unified statistical framework for integrative analysis of GWAS and eQTL mapping studies to identify and decipher gene-trait associations. VINTAGE explicitly quantifies and tests the proportion of genetic effects on a trait potentially mediated through gene expression using a local genetic correlation test, and further leverages such information to guide the integration of gene expression mapping study towards gene association mapping in GWAS through a genetic variance test. The explicit quantification of local genetic correlation in VINTAGE allows its gene association test to unify two seemingly unrelated methods, SKAT and TWAS, into the same analytic framework and include both as special cases, thus achieving robust performance across a range of scenarios.

Paternity Inference from Low-Coverage Sequencing Data (WHODAD)

WHODAD is a software package implementing the WHODAD method for paternity inference from low-coverage sequencing data.