Normalize

Normalize observation features based on specified normalization method

pycytominer.normalize.normalize(profiles: str | DataFrame, features: str | list[str] = 'infer', image_features: bool = False, meta_features: str | list[str] = 'infer', samples: str = 'all', method: str = 'standardize', drop_cosmicqc_rows: bool = False, output_file: str | None = None, output_type: Literal['csv', 'parquet', 'anndata_h5ad', 'anndata_zarr'] | None = 'csv', compression_options: str | dict[str, Any] | None = None, float_format: str | None = None, mad_robustize_epsilon: float | None = 1e-18, spherize_center: bool = True, spherize_method: str = 'ZCA-cor', spherize_epsilon: float = 1e-06) DataFrame | str

Normalize profiling features

Parameters:
  • profiles (pd.DataFrame or path) – Either a pandas DataFrame or a file that stores profile data

  • features (list) – A list of strings corresponding to feature measurement column names in the profiles DataFrame. All features listed must be found in profiles. Defaults to “infer”. If “infer”, then assume features are from CellProfiler output and prefixed with “Cells”, “Nuclei”, or “Cytoplasm”. Selected feature columns must be numeric. Missing values are allowed as long as the column remains numeric. As a temporary compatibility measure, Pycytominer also treats common missing-value strings such as "nan" and "None" as missing values in selected feature columns before numeric validation. If you are working with mixed profile and image payload data, pass explicit feature columns when needed to avoid selecting non-profile content.

  • image_features (bool, default False) – Whether to include inferred Image_* feature columns alongside the default CellProfiler compartments. This preserves support for numeric image-level measurements while avoiding non-numeric Image_* payload columns from OME-Arrow-backed or similarly mixed tables. Non-normalized image payload columns are preserved in the output.

  • meta_features (list) – A list of strings corresponding to metadata column names in the profiles DataFrame. All features listed must be found in profiles. Defaults to “infer”. If “infer”, then assume CellProfiler metadata features, identified by column names that begin with the Metadata_ prefix.”

  • samples (str) – The metadata column values to use as a normalization reference. We often use control samples. The function uses a pd.query() function, so you should structure samples in this fashion. An example is “Metadata_treatment == ‘control’” (include all quotes). Defaults to “all”.

  • method (str) – How to normalize the dataframe. Defaults to “standardize”. Check avail_methods for available normalization methods.

  • drop_cosmicqc_rows (bool) – Whether to drop rows that are flagged as QC failures. The function looks for columns from coSMicQC with “Metadata_cqc_” prefix and drop rows with True. Defaults to False. Suggested use after a prior call to pycytominer.annotate(external_metadata=qc.parquet).

  • output_file (str, optional) – If provided, will write normalized profiles to file. If not specified, will return the normalized profiles as output. We recommend that this output file be suffixed with “_normalized.csv”.

  • output_type (str, optional) – If provided, will write normalized profiles as a specified file type (either CSV or parquet). If not specified and output_file is provided, then the file will be outputed as CSV as default.

  • compression_options (str or dict, optional) – Contains compression options as input to pd.DataFrame.to_csv(compression=compression_options). pandas version >= 1.2.

  • float_format (str, optional) – Decimal precision to use in writing output file as input to pd.DataFrame.to_csv(float_format=float_format). For example, use “%.3g” for 3 decimal precision.

  • mad_robustize_epsilon (float, optional) – The mad_robustize fudge factor parameter. The function only uses this variable if method = “mad_robustize”. Set this to 0 if mad_robustize generates features with large values.

  • spherize_center (bool) – If the function should center data before sphering (aka whitening). The function only uses this variable if method = “spherize”. Defaults to True.

  • spherize_method (str) – The sphering (aka whitening) normalization selection. The function only uses this variable if method = “spherize”. Defaults to “ZCA-corr”. See pycytominer.operations.transform() for available spherize methods.

  • spherize_epsilon (float, default 1e-6.) – The sphering (aka whitening) fudge factor parameter. The function only uses this variable if method = “spherize”.

Returns:

  • pd.DataFrame – The normalized profile DataFrame. If output_file=None, then return the DataFrame. If you specify output_file, then write to file and do not return data.

  • str – If output_file is provided, then the function returns the path to the output file.

Raises:

ValueError – Raised when inferred or manually selected feature columns are non-numeric, because Pycytominer normalization methods operate on numeric features only. In that case, select numeric features explicitly before calling normalize(), for example by passing a curated feature list or by running feature_select() first.

Examples

import pandas as pd
from pycytominer import normalize

data_df = pd.DataFrame(
    {
        "Metadata_plate": ["a", "a", "a", "a", "b", "b", "b", "b"],
        "Metadata_treatment": [
            "drug",
            "drug",
            "control",
            "control",
            "drug",
            "drug",
            "control",
            "control",
        ],
        "x": [1, 2, 8, 2, 5, 5, 5, 1],
        "y": [3, 1, 7, 4, 5, 9, 6, 1],
        "z": [1, 8, 2, 5, 6, 22, 2, 2],
        "zz": [14, 46, 1, 6, 30, 100, 2, 2],
    }
).reset_index(drop=True)

normalized_df = normalize(
    profiles=data_df,
    features=["x", "y", "z", "zz"],
    meta_features=["Metadata_plate", "Metadata_treatment"],
    samples="Metadata_treatment == 'control'",
    method="standardize",
)