Operations¶
We do not recommend interacting with these functions directly. The core Pycytominer API uses these operations internally.
pycytominer.operations.correlation_threshold module¶
Returns list of features such that no two features have a correlation greater than a specified threshold
- pycytominer.operations.correlation_threshold.correlation_threshold(population_df: DataFrame, features: str | list[str] = 'infer', samples: str = 'all', threshold: float = 0.9, method: str = 'pearson') list[str]¶
Exclude features that have correlations above a certain threshold
- Parameters:
population_df (pd.DataFrame) – DataFrame that includes metadata and observation features.
features (list, default "infer") – A list of strings corresponding to feature measurement column names in the population_df DataFrame. All features listed must be found in population_df. Defaults to “infer”. If “infer”, then assume CellProfiler features are those prefixed with “Cells”, “Nuclei”, or “Cytoplasm”.
samples (str, default "all") – List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is “Metadata_treatment == ‘control’” (include all quotes). If “all”, use all samples to calculate.
float (threshold -) – Must be between (0, 1) to exclude features
0.9 (default) – Must be between (0, 1) to exclude features
str (method -) – indicating which correlation metric to use to test cutoff
"pearson" (default) – indicating which correlation metric to use to test cutoff
- Returns:
excluded_features – List of features to exclude from the population_df.
- Return type:
list of str
- pycytominer.operations.correlation_threshold.determine_high_cor_pair(correlation_row: Series, sorted_correlation_pairs: Index) str¶
Select highest correlated variable given a correlation row with columns: [“pair_a”, “pair_b”, “correlation”]. For use in a pandas.apply().
- Parameters:
correlation_row (pd.Series) – Pandas series of the specific feature in the pairwise_df
sorted_correlation_pairs (pd.Index) – A sorted object by total correlative sum to all other features
- Returns:
The feature that has a lower total correlation sum with all other features
- Return type:
str
pycytominer.operations.get_na_columns module¶
Remove variables with specified threshold of NA values Note: This was called drop_na_columns in cytominer for R
- pycytominer.operations.get_na_columns.get_na_columns(population_df: DataFrame, features: str | list[str] = 'infer', samples: str = 'all', cutoff: float = 0.05) list[str]¶
Get features that have more NA values than cutoff defined
- Parameters:
population_df (pd.DataFrame) – DataFrame that includes metadata and observation features.
features (list, default "infer") – A list of strings corresponding to feature measurement column names in the profiles DataFrame. All features listed must be found in profiles. Defaults to “infer”. If “infer”, then assume CellProfiler features are those prefixed with “Cells”, “Nuclei”, or “Cytoplasm”.
samples (str, default "all") – List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is “Metadata_treatment == ‘control’” (include all quotes). If “all”, use all samples to calculate.
cutoff (float) – Exclude features that have a certain proportion of missingness
- Returns:
excluded_features – List of features to exclude from the population_df.
- Return type:
list of str
pycytominer.operations.transform module¶
Transform observation variables by specified groups.
References
- class pycytominer.operations.transform.RobustMAD(epsilon: float = 1e-18)¶
Bases:
BaseEstimator,TransformerMixinClass to perform a “Robust” normalization with respect to median and mad
scaled = (x - median) / mad
- epsilon¶
fudge factor parameter
- Type:
float
- fit(X: DataFrame, y: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None) RobustMAD_type¶
Compute the median and mad to be used for later scaling.
- Parameters:
X (pd.DataFrame) – dataframe to fit RobustMAD transform
y (None) – Has no effect; only used for consistency in sklearn transform API
- Returns:
With computed median and mad attributes
- Return type:
self
- set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') RobustMAD¶
Configure whether metadata should be requested to be passed to the
transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
copy (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
copyparameter intransform.- Returns:
self – The updated object.
- Return type:
object
- transform(X: DataFrame, copy: bool | None = None) DataFrame¶
Apply the RobustMAD calculation
- Parameters:
X (pd.DataFrame) – dataframe to fit RobustMAD transform
- Returns:
RobustMAD transformed dataframe
- Return type:
pd.DataFrame
- class pycytominer.operations.transform.Spherize(epsilon: float = 1e-06, center: bool = True, method: str = 'ZCA', return_numpy: bool = False)¶
Bases:
BaseEstimator,TransformerMixinClass to apply a sphering transform (aka whitening) data in the base sklearn transform API. Note, this implementation is modified/inspired from the following sources: 1) A custom function written by Juan C. Caicedo 2) A custom ZCA function at https://github.com/mwv/zca 3) Notes from Niranj Chandrasekaran (https://github.com/cytomining/pycytominer/issues/90) 4) The R package “whitening” written by Strimmer et al (http://strimmerlab.org/software/whitening/) 5) Kessy et al. 2016 “Optimal Whitening and Decorrelation” [1]
- epsilon¶
fudge factor parameter
- Type:
float
- center¶
option to center the input X matrix
- Type:
bool
- method¶
a string indicating which class of sphering to perform
- Type:
str
- fit(X: DataFrame, y: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None) Spherize_type¶
Identify the sphering transform given self.X
- Parameters:
X (pd.DataFrame) – dataframe to fit sphering transform
y (None) – Has no effect; only used for consistency in sklearn transform API
- Returns:
With computed weights attribute
- Return type:
self
- transform(X: DataFrame, y: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None) DataFrame¶
Perform the sphering transform
- Parameters:
X (pd.DataFrame) – Profile dataframe to be transformed using the precompiled weights
y (None) – Has no effect; only used for consistency in sklearn transform API
- Returns:
Spherized dataframe
- Return type:
pd.DataFrame
pycytominer.operations.variance_threshold module¶
Remove variables with near-zero variance. Modified from caret::nearZeroVar()
- pycytominer.operations.variance_threshold.calculate_frequency(feature_column: Series, freq_cut: float) str | float¶
Calculate frequency of second most common to most common feature. Used in pandas.apply()
- Parameters:
feature_column (pd.Series) – Pandas series of the specific feature in the population_df
freq_cut (float, (suggested but unenforced default of 0.05)) – Ratio (2nd most common feature val / most common). Must range between 0 and 1. Remove features lower than freq_cut. A low freq_cut will remove features that have large difference between the most common feature and second most common feature. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, …])
- Returns:
Feature name if it passes threshold, “NA” otherwise
- Return type:
Union[str, float]
- pycytominer.operations.variance_threshold.variance_threshold(population_df: DataFrame, features: str | list[str] = 'infer', samples: str = 'all', freq_cut: float = 0.05, unique_cut: float = 0.01) list[str]¶
Exclude features that have low variance (low information content)
- Parameters:
population_df (pd.DataFrame) – DataFrame that includes metadata and observation features.
features (list, default "infer") – A list of strings corresponding to feature measurement column names in the population_df DataFrame. All features listed must be found in population_df. Defaults to “infer”. If “infer”, then assume CellProfiler features are those prefixed with “Cells”, “Nuclei”, or “Cytoplasm”.
samples (str, default "all") – List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is “Metadata_treatment == ‘control’” (include all quotes). If “all”, use all samples to calculate.
freq_cut (float, default 0.05) – Ratio (2nd most common feature val / most common). Must range between 0 and 1. Remove features lower than freq_cut. A low freq_cut will remove features that have large difference between the most common feature value and second most common feature value. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, …])
unique_cut (float, default 0.01) – Ratio (num unique features / num samples). Must range between 0 and 1. Remove features less than unique cut. A low unique_cut will remove features that have very few different measurements compared to the number of samples.
- Returns:
excluded_features – List of features to exclude from the population_df.
- Return type:
list of str