All API

normalisr.association

normalisr.association.association_test_1(vx, vy, dx, dy, dc, dci, dcr, dimreduce=0, lowmem=False)

Fast linear association testing in single-cell non-cohort settings with covariates.

Single threaded version to allow for parallel computing wrapper. Mainly used for naive differential expression and co-expression. Computes exact P-value and effect size (gamma) with the model for linear association testing between each vector x and vector y:

y=gamma*x+alpha*C+epsilon,

epsilon~i.i.d. N(0,sigma**2).

Test statistic: conditional R**2 (or proportion of variance explained) between x and y.

Null hypothesis: gamma=0.

Parameters
  • vx (any) – Starting indices of dx. Only used for information passing.

  • vy (any) – Starting indices of dy. Only used for information passing.

  • dx (numpy.ndarray(shape=(n_x,n_cell))) – Predictor matrix for a list of vector x to be tested, e.g. gene expression or grouping.

  • dy (numpy.ndarray(shape=(n_y,n_cell))) – Target matrix for a list of vector y to be tested, e.g. gene expression.

  • dc (numpy.ndarray(shape=(n_cov,n_cell))) – Covariate matrix as C.

  • dci (numpy.ndarray(shape=(n_cov,n_cov))) – Low-rank inverse matrix of dc*dc.T.

  • dcr (int) – Rank of dci.

  • dimreduce (numpy.ndarray(shape=(ny,),dtype='uint') or int.) – If each vector y doesn’t have full rank in the first place, this parameter is the loss of degree of freedom to allow for accurate P-value computation.

  • lowmem (bool) – Whether to save memory by neither computing nor returning alpha.

Returns

  • vx (any) – vx from input for information passing.

  • vy (any) – vy from input for information passing.

  • pv (numpy.ndarray(shape=(n_x,n_y))) – P-values of association testing (gamma==0).

  • gamma (numpy.ndarray(shape=(n_x,n_y))) – Maximum likelihood estimator of gamma in model.

  • alpha (numpy.ndarray(shape=(n_x,n_y,n_cov)) or None) – Maximum likelihood estimator of alpha in model if not lowmem else None.

  • var_x (numpy.ndarray(shape=(n_x,))) – Variance of dx unexplained by covariates C.

  • var_y (numpy.ndarray(shape=(n_y,))) – Variance of dy unexplained by covariates C.

normalisr.association.association_test_2(vx, vy, dx, dy, dc, sselectx, dimreduce=0, lowmem=False)

Like association_test_1, but takes a different subset of samples for each x.

See association_test_1 for additional details.

Parameters
  • vx (any) – Starting indices of dx. Only used for information passing.

  • vy (any) – Starting indices of dy. Only used for information passing.

  • dx (numpy.ndarray(shape=(n_x,n_cell))) – Predictor matrix for a list of vector x to be tested, e.g. gene expression or grouping.

  • dy (numpy.ndarray(shape=(n_y,n_cell))) – Target matrix for a list of vector y to be tested, e.g. gene expression.

  • dc (numpy.ndarray(shape=(n_cov,n_cell))) – Covariate matrix as C.

  • sselectx (numpy.ndarray(shape=(n_x,n_cell),dtype=bool)) – Subset of samples to use for each x.

  • dimreduce (numpy.ndarray(shape=(ny,),dtype='uint') or int.) – If each vector y doesn’t have full rank in the first place, this parameter is the loss of degree of freedom to allow for accurate P-value computation.

  • lowmem (bool) – Whether to save memory by neither computing nor returning alpha.

Returns

  • vx (any) – vx from input for information passing.

  • vy (any) – vy from input for information passing.

  • pv (numpy.ndarray(shape=(n_x,n_y))) – P-values of association testing (gamma==0).

  • gamma (numpy.ndarray(shape=(n_x,n_y))) – Maximum likelihood estimator of gamma in model.

  • alpha (numpy.ndarray(shape=(n_x,n_y,n_cov)) or None) – Maximum likelihood estimator of alpha in model if not lowmem else None.

  • var_x (numpy.ndarray(shape=(n_x,))) – Variance of dx unexplained by covariates C.

  • var_y (numpy.ndarray(shape=(n_x,n_y))) – Variance of dy unexplained by covariates C.

normalisr.association.association_test_4(vx, vy, prod, prody, prodyy, na, dimreduce=0, lowmem=False, **ka)

Like association_test_1, but regards all other (untested) x’s as covariates when testing each x. Also allows for dx==dy setting, where neither tested x or y is regarded as a covariate.

See association_test_1 for additional details. Other x’s are treated as covariates but their coefficients (alpha) would not be returned to reduce memory footprint.

Parameters
  • vx (any) – Starting indices of dx.

  • vy (any) – Starting indices of dy. Only used for information passing.

  • prod (numpy.ndarray(shape=(n_x+n_cov,n_x+n_cov))) – A@A.T, where A=numpy.block([dx,dc]).

  • prody (numpy.ndarray(shape=(n_x+n_cov,n_y)) or None) – A@dy.T, where A=numpy.block([dx,dc]). If None, indicating dx==dy and skipping tested y as a covariate.

  • prodyy (numpy.ndarray(shape=(n_y,)) or None) – (dy**2).sum(axis=1). If None, indicating dx==dy and skipping tested y as a covariate.

  • na (tuple) – (n_x,n_y,n_cov,n_cell,lenx). Numbers of (x’s, y’s, covariates, cells, x’s to compute association for)

  • dimreduce (numpy.ndarray(shape=(ny,),dtype='uint') or int.) – If each vector y doesn’t have full rank in the first place, this parameter is the loss of degree of freedom to allow for accurate P-value computation.

  • lowmem (bool) – Whether to save memory by neither computing nor returning alpha.

  • ka (dict) – Keyword arguments passed to inv_rank.

Returns

  • vx (any) – vx from input for information passing.

  • vy (any) – vy from input for information passing.

  • pv (numpy.ndarray(shape=(n_x,n_y))) – P-values of association testing (gamma==0).

  • gamma (numpy.ndarray(shape=(n_x,n_y))) – Maximum likelihood estimator of gamma in model.

  • alpha (numpy.ndarray(shape=(n_x,n_y,n_cov)) or None) – Maximum likelihood estimator of alpha in model if not lowmem else None.

  • var_x (numpy.ndarray(shape=(lenx,)) or None) – Variance of dx unexplained by covariates C if prody is not None else None.

  • var_y (numpy.ndarray(shape=(lenx,n_y))) – Variance of dy unexplained by covariates C or untested x.

normalisr.association.association_test_5(vx, vy, prod, prody, prodyy, na, mask, dimreduce=0, lowmem=False, **ka)

Like association_test_4, but uses mask to determine which X can affect which Y. Under development.

Parameters
  • vx (any) – Starting indices of dx.

  • vy (any) – Starting indices of dy. Only used for information passing.

  • prod (numpy.ndarray(shape=(n_x+n_cov,n_x+n_cov))) – A@A.T, where A=numpy.block([dx,dc]).

  • prody (numpy.ndarray(shape=(n_x+n_cov,n_y)) or None) – A@dy.T, where A=numpy.block([dx,dc]). If None, indicating dx==dy and skipping tested y as a covariate.

  • prodyy (numpy.ndarray(shape=(n_y,)) or None) – (dy**2).sum(axis=1). If None, indicating dx==dy and skipping tested y as a covariate.

  • na (tuple) – (n_x,n_y,n_cov,n_cell,lenx). Numbers of (x’s, y’s, covariates, cells, x’s to compute association for)

  • mask (numpy.ndarray(shape=(n_x,n_y),dtype=bool)) – Whether each X can affect each Y.

  • dimreduce (numpy.ndarray(shape=(ny,),dtype='uint') or int.) – If each vector y doesn’t have full rank in the first place, this parameter is the loss of degree of freedom to allow for accurate P-value computation.

  • lowmem (bool) – Whether to save memory by neither computing nor returning alpha.

  • ka (dict) – Keyword arguments passed to inv_rank.

Returns

  • vx (any) – vx from input for information passing.

  • vy (any) – vy from input for information passing.

  • pv (numpy.ndarray(shape=(n_x,n_y))) – P-values of association testing (gamma==0).

  • gamma (numpy.ndarray(shape=(n_x,n_y))) – Maximum likelihood estimator of gamma in model.

  • alpha (numpy.ndarray(shape=(n_x,n_y,n_cov)) or None) – Maximum likelihood estimator of alpha in model if not lowmem else None.

  • var_x (numpy.ndarray(shape=(lenx,n_y))) – Variance of dx unexplained by covariates C or untested x.

  • var_y (numpy.ndarray(shape=(lenx,n_y))) – Variance of dy unexplained by covariates C or untested x.

normalisr.association.association_tests(dx, dy, dc, bsx=0, bsy=0, nth=1, lowmem=True, return_dot=True, single=0, bs4=500, **ka)

Performs association tests between all pairs of two (or one) variables. Performs parallel computation with multiple processes on the same machine.

Allows multiple options to treat other/untested dx when testing on one (see parameter single).

Performs parallel computation with multiple processes on the same machine.

Model for differential expression between X and Y:

Y=gamma*X+alpha*C+epsilon,

epsilon~i.i.d. N(0,sigma**2).

Test statistic: conditional R**2 (or proportion of variance explained) between Y and X.

Null hypothesis: gamma=0.

Parameters
  • dx (numpy.ndarray(shape=(n_x,n_cell))) – Normalized matrix X.

  • dy (numpy.ndarray(shape=(n_y,n_cell),dtype=float) or None) – Normalized matrix Y. If None, indicates dy=dx, i.e. self-association between pairs within X.

  • dc (numpy.ndarray(shape=(n_cov,n_cell),dtype=float)) – Normalized covariate matrix C.

  • bsx (int) – Batch size, i.e. number of Xs in each computing batch. Defaults to 500.

  • bsy (int) – Batch size, i.e. number of Xs in each computing batch. Defaults to 500. Ignored if dy is None.

  • nth (int) – Number of parallel processes. Set to 0 for using automatically detected CPU counts.

  • lowmem (bool) – Whether to replace alpha in return value with None to save memory

  • return_dot (bool) – Whether to return dot product betwen dx and dy instead of coefficient gamma

  • single (int) –

    Type of association test to perform that determines which cells and covariates are used for each association test between X and Y. Accepts the following values:

    • 0: Simple pairwise association test between each X and Y across all cells.

    • 1: Association test for each X uses only cells that have all zeros in dx for all other Xs. A typical application is low-MOI CRISPR screen.

    • 4: Association test for each X uses all cells but regarding all other Xs as covariates that confound mean expression levels. This is suitable for high-MOI CRISPR screen.

    • 5: Similar with 4 but uses mask to determine which X can affect which Y. Under development.

  • bs4 (int) – Batch size for matrix product when single=4. Defaults to 500.

  • ka (dict) – Keyword arguments passed to normalisr.association.association_test_X. See below.

Returns

  • P-values (numpy.ndarray(shape=(n_x,n_y))) – Differential expression P-value matrix.

  • dot/gamma (numpy.ndarray(shape=(n_x,n_y))) – If return_dot, inner product between X and Y pairs after removing covariates. Otherwise, matrix gamma.

  • alpha (numpy.ndarray(shape=(n_x,n_y,n_cov)) or None) – Maximum likelihood estimators of alpha, separatly tested for each grouping if not lowmem else None.

  • varx (numpy.ndarray(shape=(n_x)) or numpy.ndarray(shape=(n_x,n_y)) or None) – Variance of grouping after removing covariates if dy is not None and single!=5 else None

  • vary (numpy.ndarray(shape=(n_y)) if single==0 else numpy.ndarray(shape=(n_x,n_y))) – Variance of gene expression after removing covariates. Its shape depends on parameter single.

Keyword Arguments
  • dimreduce (numpy.ndarray(shape=(n_y,),dtype=int) or int) – If dy doesn’t have full rank, such as due to prior covariate removal (although the recommended method is to leave covariates in dc), this parameter allows to specify the loss of ranks/degrees of freedom to allow for accurate P-value computation. Default is 0, indicating no rank loss.

  • mask (numpy.ndarray(shape=(n_x,n_y),dtype=bool)) – Whether each X can affect each Y. Only active for single==5.

normalisr.association.inv_rank(m, tol=1e-08, method='auto', logger=None, mpc=0, qr=0, **ka)

Computes matrix (pseudo-)inverse and rank with SVD.

Eigenvalues smaller than tol*largest eigenvalue are set to 0. Rank of inverted matrix is also returned. Provides to limit the number of eigenvalues to speed up computation. Broadcasts to the last 2 dimensions of the matrix.

Parameters
  • m (numpy.ndarray(shape=(..,n,n),dtype=float)) – 2-D or higher matrix to be inverted

  • tol (float) – Eigenvalues < tol*maximum eigenvalue are treated as zero.

  • method (str) –

    Method to compute eigenvalues:

    • auto: Uses scipy for n<mpc or mpc==0 and sklearn otherwise

    • scipy: Uses scipy.linalg.svd

    • scipys: NOT IMPLEMENTED. Uses scipy.sparse.linalg.svds

    • sklearn: Uses sklearn.decomposition.TruncatedSVD

  • logger (object) – Logger to output warning. Defaults (None) to logging module

  • mpc (int) – Maximum rank or number of eigenvalues/eigenvectors to consider. Defaults to 0 to disable limit. For very large input matrix, use a small value (e.g. 500) to save time at the cost of accuracy.

  • qr (int) – Whether to use QR decomposition for matrix inverse. Only effective when method=sklearn, or =auto that defaults to sklearn. * 0: No * 1: Yes with default settings * 2+: Yes with n_iter=qr for sklearn.utils.extmath.randomized_svd

  • ka (Keyword args passed to method) –

Returns

  • mi (numpy.ndarray(shape=(…,n,n),dtype=float)) – Pseudo-inverse matrices

  • r (numpy.ndarray(shape=(…),dtype=int) or int) – Matrix ranks

normalisr.association.prod1(vx, vy, dx, dy)

Pickleable function for matrix product that keeps information

Parameters
  • vx (any) – Information passed

  • vy (any) – Information passed

  • dx (numpy.ndarray(shape=(..,n))) – Matrix for multiplication

  • dy (numpy.ndarray(shape=(..,n))) – Matrix for multiplication

Returns

  • vx (any) – vx

  • vy (any) – vy

  • product (numpy.ndarray(shape=(…))) – dx@dy.T

normalisr.binnet

normalisr.binnet.bh(pv, weight=None)

Converts P-values to Q-values using Benjamini–Hochberg procedure.

Parameters
  • pv (numpy.ndarray(shape=(n,))) – P-values.

  • weight (numpy.ndarray(shape=(n,)) or None) – Weight of each P-value. Defaults (None) to equal.

Returns

Q-values.

Return type

numpy.ndarray(shape=(n,))

References

Controlling the false discovery rate: a practical and powerful approach to multiple testing, Benjamini and Hochberg. 1995

normalisr.binnet.binnet(net, qcut)

Binarizes P-value co-expresion network to thresholded Q-value network.

Q-values are computed separately per row to account for differences in the number of genes co-expressed, especially by master regulators, using Benjamini–Hochberg procedure. Co-expression Q-value matrix is thresholded for return.

Parameters
  • net (numpy.ndarray(shape=(n_gene,n_gene),dtype=float)) – Symmetric co-expression P-value matrix.

  • qcut (float) – Cutoff for Q-value network.

Returns

Binarized, assymmetric co-expression matrix.

Return type

numpy.ndarray(shape=(n_gene,n_gene),dtype=bool)

normalisr.binnet.nodiag(d, split=False)

Removes diagonals from 2D matrix.

Parameters
  • d (numpy.ndarray(shape=[n1,n2])) – Input matrix

  • split (bool) – Whether to output list of 1D arrays for each row instead of a single concatenated 1D array.

Returns

  • If split (List of numpy.ndarray(shape=[n2 or n2-1]) for each row.)

  • Else (numpy.ndarray(shape=(n1*n2-min(n1,n2),))) – Reduced 1D array

normalisr.binnet.rediag(d, fill=0, shape=None)

Converts a ‘nodiag’ed vector back to matrix of original shape.

Parameters
  • d (numpy.ndarray(shape=(n,))) – Output array from nodiag(split=False)

  • fill (any) – Values to fill in the diagonal entries

  • shape (tuple or None) – Shape of original matrix. If omitted (None), assume original is a square matrix.

Returns

Recovered original matrix.

Return type

numpy.ndarray(shape=shape)

normalisr.coex

normalisr.coex.coex(dt, dc, **ka)

Performs co-expression analyses for all gene pairs.

Performs parallel computation with multiple processes on the same machine.

Model for co-expression between genes i & j:

X_i=gamma*X_j+alpha*C+epsilon,

epsilon~i.i.d. N(0,sigma**2).

Test statistic: conditional R**2 (or proportion of variance explained) between X_i and X_j.

Null hypothesis: gamma=0.

Parameters
  • dt (numpy.ndarray(shape=(n_gene,n_cell),dtype=float)) – Normalized expression matrix X.

  • dc (numpy.ndarray(shape=(n_cov,n_cell),dtype=float)) – Normalized covariate matrix C.

  • ka (dict) – Keyword arguments passed to normalisr.association.association_tests. See below.

Returns

  • P-values (numpy.ndarray(shape=(n_gene,n_gene))) – Co-expression P-value matrix.

  • dot (numpy.ndarray(shape=(n_gene,n_gene))) – Inner product of expression between gene pairs, after removing covariates.

  • var (numpy.ndarray(shape=(n_gene))) – Variance of gene expression after removing covariates. Pearson R=(((dot/numpy.sqrt(var)).T)/numpy.sqrt(var)).T.

Keyword Arguments
  • bs (int) – Batch size, i.e. number of genes in each computing batch. Use 0 for default: Data transfer limited to 1GB, capped at bs=500.

  • nth (int) – Number of parallel processes. Set to 0 for using automatically detected CPU counts.

  • dimreduce (numpy.ndarray(shape=(n_gene,),dtype=int) or int) – If dt doesn’t have full rank, such as due to prior covariate removal (although the recommended method is to leave covariates in dc), this parameter allows to specify the loss of ranks/degrees of freedom to allow for accurate P-value computation. Default is 0, indicating no rank loss.

normalisr.de

normalisr.de.de(dg, dt, dc, bs=0, **ka)

Performs differential expression analyses for all genes against all groupings.

Allows multiple options to treat other groupings when testing on one grouping.

Performs parallel computation with multiple processes on the same machine.

Model for differential expression between gene Y and grouping X:

Y=gamma*X+alpha*C+epsilon,

epsilon~i.i.d. N(0,sigma**2).

Test statistic: conditional R**2 (or proportion of variance explained) between Y and X.

Null hypothesis: gamma=0.

Parameters
  • dg (numpy.ndarray(shape=(n_group,n_cell))) – Grouping matrix for a list of X to be tested, e.g. grouping by gene knock-out.

  • dt (numpy.ndarray(shape=(n_gene,n_cell),dtype=float)) – Normalized expression matrix Y.

  • dc (numpy.ndarray(shape=(n_cov,n_cell),dtype=float)) – Normalized covariate matrix C.

  • bs (int) – Batch size, i.e. number of groupings and genes in each computing batch. For single=0,1, splits groupings & genes. Defaults to 500. For single=4, defaults to splitting grouping by 10 and not on genes.

  • ka (dict) – Keyword arguments passed to normalisr.association.association_test_*. See below.

Returns

  • P-values (numpy.ndarray(shape=(n_group,n_gene))) – Differential expression P-value matrix.

  • gamma (numpy.ndarray(shape=(n_group,n_gene))) – Differential expression log fold change matrix.

  • alpha (numpy.ndarray(shape=(n_group,n_gene,n_cov)) or None) – Maximum likelihood estimators of alpha, separatly tested for each grouping if not lowmem else None.

  • varg (numpy.ndarray(shape=(n_group))) – Variance of grouping after removing covariates.

  • vart (numpy.ndarray(shape=(n_group,n_gene))) – Variance of gene expression after removing covariates. It can depend on the grouping being tested depending on parameter single.

Keyword Arguments
  • single (int) –

    Option to deal with other groupings when testing one groupings v.s. gene expression.

    • 0: Ignores other groupings (default).

    • 1: Excludes all cells belonging to any other grouping (value==1), assuming dg=0,1 only.

      This is suitable for low-MOI CRISPR screens.

    • 4: Treats other groupings as covariates for mean expression.

      This is suitable for high-MOI CRISPR screens.

  • lowmem (bool) – Whether to replace alpha in return value with None to save memory.

  • nth (int) – Number of parallel processes. Set to 0 for using automatically detected CPU counts.

  • dimreduce (numpy.ndarray(shape=(n_gene,),dtype=int) or int) – If dt doesn’t have full rank, such as due to prior covariate removal (although the recommended method is to leave covariates in dc), this parameter allows to specify the loss of ranks/degrees of freedom to allow for accurate P-value computation. Default is 0, indicating no rank loss.

  • method (str, only for single=4) –

    Method to compute eigenvalues in SVD-based matrix inverse (for removal of covariates):

    • auto: Uses scipy for n_matrix<mpc or mpc==0 and sklearn otherwise. Default.

    • scipy: Uses scipy.linalg.svd.

    • scipys: NOT IMPLEMENTED. Uses scipy.sparse.linalg.svds.

    • sklearn: Uses sklearn.decomposition.TruncatedSVD.

  • mpc (int, only for single=4) – Uses only the top mpc singular values as non-zero in SVD-based matrix inverse. Here effectively reduces covariates to their top principal components. This reduction is performed after including other groupings as additional covariates. Defaults to 0 to disable dimension reduction. For very large grouping matrix, use a small value (e.g. 100) to save time at the cost of accuracy.

  • qr (int, only for single=4) –

    Whether to use QR decomposition method for SVD in matrix inverse. Only effective when method=sklearn, or =auto and defaults to sklearn. Takes the following values:

    • 0: No (default).

    • 1: Yes with default settings.

    • 2+: Yes with n_iter=qr for sklearn.utils.extmath.randomized_svd.

  • tol (float, only for single=4) – Eigenvalues < tol*(maximum eigenvalue) are treated as zero in SVD-based matrix inverse. Default is 1E-8.

normalisr.gocovt

normalisr.gocovt.goe(genelist, go_file, goa_file, bg=None, nmin=5, conversion=None, evidence_set={'EXP', 'HDA', 'HGI', 'HMP', 'HTP', 'IBA', 'IBD', 'IDA', 'IGI', 'IKR', 'IMP', 'IPI', 'IRD', 'ISA', 'ISM', 'ISO', 'ISS'})

Finds GO enrichment with goatools (0.7.11 tested).

WARNING: This method is inexact for multi-maps in gene name conversion. However, it has a negligible effect in top GO component removal in single-cell co-expression.

Parameters
  • genelist (list of str) – Genes to search for enrichment.

  • go_file (str) – File path for GO DAG (downloadable at http://geneontology.org/docs/download-ontology/)).

  • goa_file (str) – File path for GO associations. See parameter conversion.

  • bg (list of str) – Background genes.

  • nmin (int) – Minimum number of principal genes required in GO.

  • conversion (tuple) –

    Conversion of gene ID system from gene list to the GO annotation.

  • evidence_set (set of str) – GO evidences to include. Defaults to non-expression based results to avoid circular reasoning bias.

Returns

  • goe (pandas.DataFrame) – GO enrichment.

  • gotop (str) – Top enriched GO ID

  • genes (list of str or None) – Intersection list of genes in gotop and also bg. None if bg is None.

normalisr.gocovt.gotop(net, namet, go_file, goa_file, n=100, **ka)

Finds the top variable GO enrichment of top principal genes in the binary co-expression network.

Principal genes are those with most co-expressed genes. They reflect the most variable pathways in the dataset. When the variable pathways are housekeeping related, they may conceal cell-type-specific co-expression patterns from being observed and understood. This function identifies the most variable pathway with gene ontology enrichment study of the top principal genes. Background genes are all genes provided.

Parameters
  • net (numpy.ndarray(shape=(n_gene,n_gene),dtype=bool)) – Binary co-expression network matrix.

  • namet (list of str) – Gene names matching the rows and columns of net.

  • go_file (str) – File path for GO DAG (downloadable at http://geneontology.org/docs/download-ontology/)).

  • goa_file (str) – File path of GO annotation. See parameter conversion in normalisr.gocovt.goe.

  • n (int) – Number of top principal genes to include for GO enrichment. Default is 100, giving good performance in general.

  • ka (dict) – IMPORTANT: Keyword arguments passed to normalisr.gocovt.goe to determine how to perform GO enrichment study. If you see no gene mapped, check your gene name conversion rule in conversion parameter of normalisr.gocovt.goe. GO annotation have a specific gene ID system.

Returns

  • principals (list of str) – List of principal genes.

  • goe (pandas.DataFrame) – GO enrichment results.

  • gotop (str) – Top enriched GO ID.

  • genes (list of str) – List of genes in the gotop GO ID.

normalisr.gocovt.pc1(d)

Computes top principal component.

Parameters

d (numpy.ndarray(shape=(n1,n2))) –

Returns

Return type

numpy.ndarray(shape=(n2,))

normalisr.gocovt.pccovt(dt, dc, namet, genes, condcov=True)

Introduces an extra covariate from the top principal component of given genes.

The extra covariate is the top principal component of normalized expressions of the selected genes. Adding a covariate from housekeeping pathway can reveal cell-type-specific activities in co-expression networks.

Parameters
  • dt (numpy.ndarray(shape=(n_gene,n_cell))) – Normalized expression matrix.

  • dc (numpy.ndarray(shape=(n_cov,n_cell))) – Existing normalized covariate matrix.

  • namet (list of str) – List of gene names for rows in dt.

  • genes (list of str) – List of gene names to include in finding top PC of their expression as an extra covariate.

  • condcov (bool) – Whether to condition on existing covariates before computing top PC. Default: True.

Returns

New normalized covariate matrix.

Return type

numpy.ndarray(shape=(n_cov+1,n_cell))

normalisr.gocovt.toratio(s)

Converts a pandas series of string to numeric ratios.

Parameters

s (pandas.Series of string) –

Returns

Return type

panda.Series of float

normalisr.lcpm

normalisr.lcpm.lcpm(reads, normalize=True, nth=0, ntot=None, varscale=0, seed=None, lowmem=True, nocov=False)

Computes Bayesian log CPM from raw read counts.

The technical sampling process is modelled as a Binomial distribution. The logCPM given read counts is a Bayesian inference problem and follows (shifted) Beta distribution. We use the expectation of posterior logCPM as the estimated expression levels. Resampling function is also provided to account for variances in the posterior distribution.

Warning: Modifying keyword arguments other than nth or seed is neither recommended nor supported for function ‘lcpm’. Do so at your own risk.

Parameters
  • reads (numpy.ndarray(shape=(n_gene,n_cell),dtype='uint')) – Read count matrix.

  • normalize (bool) – Whether to normalize output to logCPM per cell. Default: True.

  • nth (int) – Number of threads to use. Defaults to 0 to use all cores automatically detected.

  • ntot (int) – Manually sets value of total number of reads in binomial distribution. Since the posterior distribution stablizes quickly as ntot increases, a large number, e.g. 1E9 is good for general use. Defaults to None to disable manual value.

  • varscale (float) – Resamples estimated expression using the posterior Beta distribution. varscale sets the scale of variance than its actual value from the posterior distribution. Defaults to 0, to compute expectation with no variance.

  • seed (int) – Initial random seed if set.

  • lowmem (bool) – Low memory mode disable mean and var in Returns and therefore saves memory.

  • nocov (bool) – Whether to skip producing covariate variables. If True, output cov=None

Returns

  • lcpm (numpy.ndarray(shape=(n_gene,n_cell))) – Estimated expression as logCPM from read counts.

  • mean (numpy.ndarray(shape=(n_gene,n_cell)) or None) – Mean/Expectation of lcpm’s every entry’s posterior distribution. None if lowmem=True.

  • var (numpy.ndarray(shape=(n_gene,n_cell)) or None) – Variance of lcpm’s every entry’s posterior distribution. None if lowmem=True.

  • cov (numpy.ndarray(shape=(3,n_cell))) – Cellular summary covariates computed from read count matrix that may confound lcpm. Contains:

    • cov[0]: Log total read count per cell

    • cov[1]: Number of 0-read genes per cell

    • cov[2]: cov[0]**2

normalisr.lcpm.scaling_factor(dt, varname='nt0mean', v0=0, v1='max')

Computes scaling factor of variance normalization for every gene.

Lowly expressed genes need full variance normalization because of technical confounding from sequencing depth. Highly expressed genes do not need variance normalization because they are already accurately measured. The scaling factor operates as a exponential factor on the variance normalization scale for each gene. It should be maximum/minimum for genes with lowest/highest expression.

Warning: Modifying keyword arguments is neither recommended nor supported for function ‘scaling_factor’. Do so at your own risk.

Parameters
  • dt (numpy.ndarray(shape=(n_gene,n_cell),dtype='uint')) – Read count matrix.

  • varname (str) –

    Variable used to compute scaling factor for each gene. Can be:

    • logtpropmean: log(dt.mean(axis=1)/dt.mean(axis=1).sum())

    • logtmeanprop: log((dt/dt.sum(axis=0)).mean(axis=1))

    • nt0mean: (dt==0).mean(axis=1)

    • lognt0mean: log((dt==0).mean(axis=1))

    • log1-nt0mean: log(1-(dt==0).mean(axis=1))

    Defaults to nt0mean.

  • v0 (float) –

    Variable values to set scaling factor to 0 (for v0) and 1 (for v1). Linear assignment is applied for values inbetween. Can be:

    • max: max

    • min: min

    • any float: that float

  • v1 (float) –

    Variable values to set scaling factor to 0 (for v0) and 1 (for v1). Linear assignment is applied for values inbetween. Can be:

    • max: max

    • min: min

    • any float: that float

Returns

Scaling factor of variance normalization for each gene

Return type

numpy.ndarray(shape=(n_gene,))

normalisr.lcpm.trigamma(x)

Tri-gamma function .

Parameters

x (float or numpy.ndarray) – Input value(s)

Returns

Return type

float or numpy.ndarray

normalisr.norm

normalisr.norm.compute_var(dt, dc, stepmax=1, eps=1e-06)

Computes variance normalization scale for each cell.

Performs a log-linear fit of the variance of each cell with covariates. Optionally use Expectation-Maximization(EM)-like method to iteratively fit mean and variance. For EM-like method, early-stopping is suggested because of overfitting issues.

Parameters
  • dt (numpy.ndarray(shape=(n_gene,n_cell))) – Bayesian LogCPM expression level matrix.

  • dc (numpy.ndarray(shape=(n_cov,n_cell))) – Covariate matrix.

  • stepmax (int) – Maximum number of EM-like iterations of mean and variance normalization. Stop of iteration is also possible when relative accuracy target is reached. Defaults to 1, indicating no iterative normalization.

  • eps (float) – Relative accuracy target for early stopping. Constrains the maximum relative difference of fitted variance across cells compared to the last step. Defaults to 1E-6.

Returns

Inverse sqrt of fitted variance for each cell, i.e. the multiplier for variance normalization. For iterative normalization, the optimal step will be returned, defined as having the minimal max relative change across cells.

Return type

numpy.ndarray(shape=(n_cell,))

normalisr.norm.normcov(dc, c=True)

Normalizes each continuous covariate to 0 mean and unit variance.

Optionally introduces constant 1 covariate as intercept. Categorical covariates should be in binary/one-hot form, and will be left unchanged.

Parameters
  • dc (numpy.ndarray(shape=(n_cov,n_cell))) – Current covariate matrix. Use empty matrix with n_cov=0 if no covariate.

  • c (bool) – Whether to introduce a constant 1 covariate.

Returns

Processed covariate matrix.

Return type

numpy.ndarray(shape=(n_cov+1 if c else n_cov,n_cell))

normalisr.norm.normvar(dt, dc, w, wt, dextra=None, cat=1, nth=1, bs=500, keepvar=True, normmean=False)

Performs mean and variance normalizations.

Expression levels are normalized at mean and then at variance levels. Effectively each gene x is multiplied by w**wt[x] before removing covariates as dc*(w**wt[x]). Continuous covariates are normalized at variance levels. Effectively covariates are transformed to dc*w. Therefore, variance normalization for expression are scaled differently for each gene.

Parameters
  • dt (numpy.ndarray(shape=(n_gene,n_cell))) – Bayesian logCPM matrix.

  • dc (numpy.ndarray(shape=(n_cov,n_cell))) – Covariate matrix.

  • w (numpy.ndarray(shape=(n_cell,))) – Computed variance normalization multiplier.

  • wt (numpy.ndarray(shape=(n_gene,))) – Computed scaling factor for each gene.

  • dextra (numpy.ndarray(shape=(n_extra,n_cell))) – Extra data matrix also to be normalized like continuous covariates.

  • cat (int) –

    Whether to normalize categorical/binary covariates (those with only 0 or 1s). Defaults to 1.

    • 0: No

    • 1: No except constant-1 covariate (intercept)

    • 2: Yes

  • nth (int) – Number of parallel threads.

  • bs (int) – Batch size for each job.

  • keepvar (bool) – Whether to maintain the variance of each gene invariant in mean normalization step. If so, expression variances are scaled back to original after mean normalization and before variance normalization. This function only affects overall variance level and its downstreams (e.g. differential expression log fold change). This function would not affect P-value computation. Default: True.

  • normmean (bool) – Whether to remove covariates from expression at mean level. This is accounted for in hypothesis testing with linear models so this option makes no difference here. However, this can be helpful for other purposes of analyses.

Returns

  • (dtn,dcn) or (dtn,dcn,dextran) if dextra is not None

  • dtn (numpy.ndarray(shape=(n_gene,n_cell))) – Normalized gene expression matrix.

  • dcn (numpy.ndarray(shape=(n_cov,n_cell))) – Normalized covariate matrix.

  • dextran (numpy.ndarray(shape=(n_extra,n_cell))) – Normalized extra data matrix.

normalisr.norm.normvar1(dt, dc, w2=None)

A single-threaded function to normalize a subset of transcriptome.

Should not be invoked directly.

Parameters
  • dt (numpy.ndarray(shape=(n_gene,n_cell))) – Bayesian logCPM matrix subset.

  • dc (numpy.ndarray(shape=(n_cov,n_cell))) – Covariate matrix.

  • w2 (numpy.ndarray(shape=(n_gene,))) – Intemediate weight computed in normvar for subset.

Returns

Normalized gene expression matrix subset.

Return type

numpy.ndarray(shape=(n_gene,n_cell))

normalisr.parallel

normalisr.parallel.autopooler(n, it, *a, chunksize=1, dummy=False, return_iter=False, unordered=False, **ka)

Uses multiprocessing.Pool or multiprocessing.dummy.Pool to run iterator in parallel.

Parameters
  • n (int) – Number of parallel processes. Set to 0 to use auto detected CPU count.

  • it (iterator of (function,tuple,dict)) – Each iteration computes function(*tuple,**dict). function must be picklable, i.e. a base level function in a module or file.

  • a (tuple) – Arguments passed to Pool.

  • chunksize (int) – Number of iterations passed to each process each time.

  • dummy (bool) – Whether to use multiprocessing.dummy instead

  • return_iter (bool) – Not Implemented. Whether to return iterator of results instead. If not, return list of results.

  • unordered (bool) – Whether the order of output matters.

  • ka (dict) – Keyword arguments passed to Pool

Returns

Results returned by function(*tuple,**dict), in same order of the iterator if not unordered.

Return type

list (or iterator if return_iter) of any

normalisr.qc

normalisr.qc.qc_outlier(dw, pcut=1e-10, outrate=0.02)

Quality control by removing cell outliers by variance.

Fit normal distribution on the inverse sqrt variance to detect outliers. This is performed by iterative estimation of normal distribution with non-outliers and then determination of outliers with the normal distribution.

Parameters
  • dw (numpy.ndarray(shape=(n_cell,))) – Fitted inverse sqrt cell variance.

  • pcut (float) – Bonferroni P-value cutoff for asserting outliers in a normal distribution of fitted cell variance. Default: 1E-10.

  • outrate (float) – Upper bound of proportion of outliers on either side of variance distribution. Used for initial outlier assignment and final validity check. Default: 0.02.

Returns

Whether each cell passed QC

Return type

numpy.ndarray(shape=(n_cell,),dtype=bool)

normalisr.qc.qc_reads(reads, n_gene, nc_gene, ncp_gene, n_cell, nt_cell, ntp_cell)

Quality control by bounding read counts.

Quality control is perform separately on genes based on their cell statisics and on cells based on their gene statistics, iteratively until dataset remains unchanged. A gene or cell is removed if any of the QC criteria is violated at any time in the iteration. All QC parameters can be set to 0 to disable QC filtering for that criterion.

Parameters
  • reads (numpy.ndarray((n_gene,n_cell),dtype='uint')) – Read count matrix.

  • n_gene (int) – Lower bound on total read counts for gene QC.

  • nc_gene (int) – Lower bound on number of expressed cells for gene QC.

  • ncp_gene (float) – Lower bound on proportion of expressed cells for gene QC.

  • n_cell (int) – Lower bound on total read counts for cell QC.

  • nt_cell (int) – Lower bound on number of expressed genes for cell QC.

  • ntp_cell (float) – Lower bound on proportion of expressed genes for cell QC.

Returns

  • genes_select (numpy.ndarray(dtype=’uint’)) – Array of indices of genes passed QC.

  • cells_select (numpy.ndarray(dtype=’uint’)) – Array of indices of cells passed QC.