Operations¶

We do not recommend interacting with these functions directly. The core Pycytominer API uses these operations internally.

pycytominer.operations.correlation_threshold module¶

Returns list of features such that no two features have a correlation greater than a specified threshold

pycytominer.operations.correlation_threshold.correlation_threshold(population_df: DataFrame, features: str | list[str] = 'infer', samples: str = 'all', threshold: float = 0.9, method: str = 'pearson') → list[str]¶

Exclude features that have correlations above a certain threshold

Parameters:

population_df (pd.DataFrame) – DataFrame that includes metadata and observation features.
features (list, default "infer") – A list of strings corresponding to feature measurement column names in the population_df DataFrame. All features listed must be found in population_df. Defaults to “infer”. If “infer”, then assume CellProfiler features are those prefixed with “Cells”, “Nuclei”, or “Cytoplasm”.
samples (str, default "all") – List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is “Metadata_treatment == ‘control’” (include all quotes). If “all”, use all samples to calculate.
float (threshold -) – Must be between (0, 1) to exclude features
0.9 (default) – Must be between (0, 1) to exclude features
str (method -) – indicating which correlation metric to use to test cutoff
"pearson" (default) – indicating which correlation metric to use to test cutoff

Returns:

excluded_features – List of features to exclude from the population_df.

Return type:

list of str

pycytominer.operations.correlation_threshold.determine_high_cor_pair(correlation_row: Series, sorted_correlation_pairs: Index) → str¶

Select highest correlated variable given a correlation row with columns: [“pair_a”, “pair_b”, “correlation”]. For use in a pandas.apply().

Parameters:

correlation_row (pd.Series) – Pandas series of the specific feature in the pairwise_df
sorted_correlation_pairs (pd.Index) – A sorted object by total correlative sum to all other features

Returns:

The feature that has a lower total correlation sum with all other features

Return type:

str

pycytominer.operations.get_na_columns module¶

Remove variables with specified threshold of NA values Note: This was called drop_na_columns in cytominer for R

pycytominer.operations.get_na_columns.get_na_columns(population_df: DataFrame, features: str | list[str] = 'infer', samples: str = 'all', cutoff: float = 0.05) → list[str]¶

Get features that have more NA values than cutoff defined

Parameters:

population_df (pd.DataFrame) – DataFrame that includes metadata and observation features.
features (list, default "infer") – A list of strings corresponding to feature measurement column names in the profiles DataFrame. All features listed must be found in profiles. Defaults to “infer”. If “infer”, then assume CellProfiler features are those prefixed with “Cells”, “Nuclei”, or “Cytoplasm”.
samples (str, default "all") – List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is “Metadata_treatment == ‘control’” (include all quotes). If “all”, use all samples to calculate.
cutoff (float) – Exclude features that have a certain proportion of missingness

Returns:

excluded_features – List of features to exclude from the population_df.

Return type:

list of str

pycytominer.operations.transform module¶

Transform observation variables by specified groups.

References

class pycytominer.operations.transform.RobustMAD(epsilon: float = 1e-18)¶

Bases: BaseEstimator, TransformerMixin

Class to perform a “Robust” normalization with respect to median and mad

scaled = (x - median) / mad

epsilon¶

fudge factor parameter

Type:: float

Compute the median and mad to be used for later scaling.

Parameters:

X (pd.DataFrame) – dataframe to fit RobustMAD transform
y (None) – Has no effect; only used for consistency in sklearn transform API

Returns:

With computed median and mad attributes

Return type:

self

set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') → RobustMAD¶

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: copy (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for copy parameter in transform.
Returns:: self – The updated object.
Return type:: object

transform(X: DataFrame, copy: bool | None = None) → DataFrame¶

Apply the RobustMAD calculation

Parameters:: X (pd.DataFrame) – dataframe to fit RobustMAD transform
Returns:: RobustMAD transformed dataframe
Return type:: pd.DataFrame

class pycytominer.operations.transform.Spherize(epsilon: float = 1e-06, center: bool = True, method: str = 'ZCA', return_numpy: bool = False)¶

Bases: BaseEstimator, TransformerMixin

Class to apply a sphering transform (aka whitening) data in the base sklearn transform API. Note, this implementation is modified/inspired from the following sources: 1) A custom function written by Juan C. Caicedo 2) A custom ZCA function at https://github.com/mwv/zca 3) Notes from Niranj Chandrasekaran (https://github.com/cytomining/pycytominer/issues/90) 4) The R package “whitening” written by Strimmer et al (http://strimmerlab.org/software/whitening/) 5) Kessy et al. 2016 “Optimal Whitening and Decorrelation” [1]

epsilon¶

fudge factor parameter

Type:: float

center¶

option to center the input X matrix

Type:: bool

method¶

a string indicating which class of sphering to perform

Type:: str

Identify the sphering transform given self.X

Parameters:

X (pd.DataFrame) – dataframe to fit sphering transform
y (None) – Has no effect; only used for consistency in sklearn transform API

Returns:

With computed weights attribute

Return type:

self

Perform the sphering transform

Parameters:

X (pd.DataFrame) – Profile dataframe to be transformed using the precompiled weights
y (None) – Has no effect; only used for consistency in sklearn transform API

Returns:

Spherized dataframe

Return type:

pd.DataFrame

pycytominer.operations.variance_threshold module¶

Remove variables with near-zero variance. Modified from caret::nearZeroVar()

pycytominer.operations.variance_threshold.calculate_frequency(feature_column: Series, freq_cut: float) → str | float¶

Calculate frequency of second most common to most common feature. Used in pandas.apply()

Parameters:

feature_column (pd.Series) – Pandas series of the specific feature in the population_df
freq_cut (float, (suggested but unenforced default of 0.05)) – Ratio (2nd most common feature val / most common). Must range between 0 and 1. Remove features lower than freq_cut. A low freq_cut will remove features that have large difference between the most common feature and second most common feature. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, …])

Returns:

Feature name if it passes threshold, “NA” otherwise

Return type:

Union[str, float]

pycytominer.operations.variance_threshold.variance_threshold(population_df: DataFrame, features: str | list[str] = 'infer', samples: str = 'all', freq_cut: float = 0.05, unique_cut: float = 0.01) → list[str]¶

Exclude features that have low variance (low information content)

Parameters:

population_df (pd.DataFrame) – DataFrame that includes metadata and observation features.
features (list, default "infer") – A list of strings corresponding to feature measurement column names in the population_df DataFrame. All features listed must be found in population_df. Defaults to “infer”. If “infer”, then assume CellProfiler features are those prefixed with “Cells”, “Nuclei”, or “Cytoplasm”.
samples (str, default "all") – List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is “Metadata_treatment == ‘control’” (include all quotes). If “all”, use all samples to calculate.
freq_cut (float, default 0.05) – Ratio (2nd most common feature val / most common). Must range between 0 and 1. Remove features lower than freq_cut. A low freq_cut will remove features that have large difference between the most common feature value and second most common feature value. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, …])
unique_cut (float, default 0.01) – Ratio (num unique features / num samples). Must range between 0 and 1. Remove features less than unique cut. A low unique_cut will remove features that have very few different measurements compared to the number of samples.

Returns:

excluded_features – List of features to exclude from the population_df.

Return type:

list of str

Operations¶

pycytominer.operations.correlation_threshold module¶

pycytominer.operations.get_na_columns module¶

pycytominer.operations.transform module¶

pycytominer.operations.variance_threshold module¶

Module contents¶