Introduction to image-based profiling with Pycytominer¶

Welcome! This tutorial introduces Pycytominer, a Python library for processing image-based profiling data from high-content microscopy experiments.

What You Will Learn¶

By the end of this tutorial, you will know how to:

Aggregate thousands of single-cell measurements into one representative profile per experimental well
Annotate profiles with experimental metadata, such as which compound was applied to each well
Normalize feature values to remove plate-to-plate technical variation
Select features to remove uninformative or redundant measurements
Build consensus profiles that collapse replicate experiments into a single representative vector

Background: What Is High-Content Microscopy?¶

High-content microscopy measures hundreds to thousands of informative phenotypic features that represent the morphology state of cells under different biological conditions (e.g., healthy vs. disease). High-content microscopy is often paired with high-throughput screening experiments that perturb cells with small-molecule compounds or genetic perturbations.

In a typical experiment:

Cells are grown in multi-well plates and treated with a panel of perturbations.
Optionally apply fluorescence dyes to stain distinct cellular compartments.
Automated microscopes capture hundreds of images per plate.

Image analysis software (such as CellProfiler) extracts several thousand numerical features per detected cell, describing each compartment’s and channel’s shape, texture, and fluorescence intensity.

From microscopy images to single-cell feature measurements

A single experiment can generate measurements from millions of individual cells, spanning hundreds to thousands of features. The central challenge is transforming this raw, high-dimensional data into clean, interpretable image-based profiles — compact, comparable vectors that summarise how each condition changed the appearance of cells.

That is exactly what Pycytominer does (Serrano et al., 2025), which has been grounded in image-based profiling methods established over the past decade (Caicedo et al., 2017, Serrano et al. 2026).

Prerequisites¶

This tutorial assumes you have:

Installed Pycytominer
Familiarity with pandas DataFrames
(Optional) Completed the CytoTable tutorial, which shows how to convert raw CellProfiler output into the Parquet format that Pycytominer reads as input.

The Pycytominer Pipeline at a Glance¶

Raw single-cell data travels through five sequential steps:

Step	Pycytominer function	What changes
Aggregate	`aggregate()`	One row per cell → one row per well
Annotate	`annotate()`	Well positions → biological treatment labels
Normalize	`normalize()`	Raw feature values → z-scores relative to controls
Feature Select	`feature_select()`	Hundreds of features → only the informative ones
Consensus	`consensus()`	One row per well → one row per treatment condition

At the end, you have a compact, analysis-ready table where each row is a unique biological condition and each column is an informative morphological measurement.

flowchart TD input["🔬 Single-cell data 1,200 cells, 11 features"] agg["🪣 aggregate() Pool cells per well, 12 profiles"] ann["🏷️ annotate() Join plate map, add treatment labels"] nor["⚖️ normalize() Z-score vs DMSO controls"] fea["✂️ feature_select() Remove redundant, 9 of 11 features kept"] con["🤝 consensus() Median across plates, 3 conditions"] output["📊 Morphological profiles 3 conditions, 9 features"] input --> agg --> ann --> nor --> fea --> con --> output style input fill:#f0d9fa,stroke:#88239A,color:#111 style output fill:#f0d9fa,stroke:#88239A,color:#111 style agg fill:#ffffff,stroke:#88239A,color:#111 style ann fill:#ffffff,stroke:#88239A,color:#111 style nor fill:#ffffff,stroke:#88239A,color:#111 style fea fill:#ffffff,stroke:#88239A,color:#111 style con fill:#ffffff,stroke:#88239A,color:#111

[1]:

import numpy as np
import pandas as pd

from pycytominer import aggregate, annotate, consensus, feature_select, normalize

# Fix the random seed so this tutorial produces identical results every time it is run
np.random.seed(42)

Tutorial Data¶

In a real workflow, you would start from the Parquet file produced by CytoTable:

from pycytominer.cyto_utils import load_profiles

# Load single-cell measurements exported by CytoTable
single_cells = load_profiles("outputs/examplehuman.parquet")

For this tutorial we generate a small synthetic dataset that mirrors the exact structure of a real high-content microscopy experiment. The column names, data types, and naming conventions are identical to what CellProfiler and CytoTable produce — only the numerical values are simulated.

Experiment design:

Property	Value
Plates (biological replicates)	2
Wells per plate	6 (2 × DMSO vehicle control, 2 × Compound A, 2 × Compound B)
Cells per well	~100
Total single-cell measurements	~1,200
Morphological features	11 (across three compartments)

Note on the features: Two of the eleven features are intentionally designed to be uninformative — one is constant across all cells, and one is nearly perfectly correlated with another. You will see these removed automatically in Step 4 (Feature Selection).

The simulation function is available in the expandable block below if you’d like to inspect it — you can skip it and go straight to Step 1.

def simulate_single_cells(plate_id, n_cells_per_well=100):
    """
    Generate synthetic single-cell morphology measurements for one plate.

    Column naming follows the CellProfiler convention:
      Metadata_*  — experimental context (plate, well, object identity)
      Cells_*     — measurements of the whole-cell boundary
      Cytoplasm_* — measurements of the cytoplasmic region
      Nuclei_*    — measurements of the nuclear region

    To keep this tutorial focused on the pipeline rather than biology,
    only the cell-area features respond to treatment. All other features
    are independent noise sampled from realistic distributions.
    In a real experiment every feature may carry some biological signal.
    """
    well_treatments = {
        'B02': 'DMSO',
        'C02': 'DMSO',
        'B03': 'Compound_A',
        'C03': 'Compound_A',
        'B04': 'Compound_B',
        'C04': 'Compound_B',
    }

    rows = []
    for image_number, (well, treatment) in enumerate(well_treatments.items(), start=1):
        is_a = float(treatment == 'Compound_A')
        is_b = float(treatment == 'Compound_B')

        # Only the Area family of features responds to treatment.
        # This ensures only the intentionally correlated pair is removed in Step 4.
        cell_area_base = np.random.normal(500 + 180 * is_a - 90 * is_b, 120, n_cells_per_well)

        for obj_num in range(1, n_cells_per_well + 1):
            cell_area = cell_area_base[obj_num - 1]
            rows.append({
                # ── Metadata columns ──────────────────────────────────────────
                'Metadata_Plate':        plate_id,
                'Metadata_Well':         well,
                'Metadata_ImageNumber':  image_number,
                'Metadata_ObjectNumber': obj_num,
                # ── Cell-level features ───────────────────────────────────────
                # Area and BoundingBoxArea are intentionally correlated (r ≈ 0.99);
                # one of the pair will be removed during feature selection.
                'Cells_AreaShape_Area':            cell_area,
                'Cells_AreaShape_BoundingBoxArea': cell_area * 1.3 + np.random.normal(0, 4),
                # EulerNumber = 1 for virtually all cells (topological invariant);
                # zero variance → removed during feature selection.
                'Cells_AreaShape_EulerNumber':     1,
                # All remaining features: independent noise with realistic distributions
                'Cells_AreaShape_Eccentricity':          float(np.clip(np.random.normal(0.55, 0.12), 0, 1)),
                'Cells_Intensity_MeanIntensity_Mito':    np.random.normal(0.30, 0.06),
                'Cells_Texture_Correlation_RNA_3_0_256': np.random.normal(0.22, 0.06),
                # ── Cytoplasm features ────────────────────────────────────────
                'Cytoplasm_AreaShape_Area':              np.random.normal(310, 80),
                'Cytoplasm_Intensity_MeanIntensity_AGP': np.random.normal(0.25, 0.07),
                # ── Nuclei features ───────────────────────────────────────────
                'Nuclei_AreaShape_Area':                 np.random.normal(195, 55),
                'Nuclei_AreaShape_Eccentricity':         float(np.clip(np.random.normal(0.40, 0.10), 0, 1)),
                'Nuclei_Intensity_MeanIntensity_DNA':    np.random.normal(0.50, 0.08),
            })
    return pd.DataFrame(rows)


# Generate data for two plates to simulate biological replicates
plate1 = simulate_single_cells('Plate_1')
plate2 = simulate_single_cells('Plate_2')
single_cells = pd.concat([plate1, plate2], ignore_index=True)

print(f'Single-cell dataset: {single_cells.shape[0]:,} rows (cells) x {single_cells.shape[1]} columns')
print(f'Plates:  {single_cells["Metadata_Plate"].unique().tolist()}')
print(f'Wells:   {sorted(single_cells["Metadata_Well"].unique().tolist())}')
print()
single_cells.head()

Step 1: Aggregate — From Cells to Wells¶

The single-cell table contains one row for every detected cell — in a real experiment this can easily reach hundreds of thousands of rows. However, biological interpretation happens at the level of the well (which treatment was applied), not the individual cell.

aggregate() summarises all cells within the same well into a single representative profile by computing the median of each feature across all cells in that well.

Parameter	Description
`population_df`	The single-cell DataFrame
`strata`	Columns that identify each well — cells sharing the same strata values are pooled together
`features='infer'`	Automatically detect feature columns (any column whose name starts with a compartment prefix such as `Cells_`, `Cytoplasm_`, or `Nuclei_`)
`operation`	Summary statistic: `'median'` (default) or `'mean'`

[3]:

well_profiles = aggregate(
    population_df=single_cells,
    strata=["Metadata_Plate", "Metadata_Well"],
    features="infer",
    operation="median",
)

print(
    f"Single cells:  {single_cells.shape[0]:,} rows  →  Well profiles: {well_profiles.shape[0]} rows"
)
print(
    f"Columns:       {single_cells.shape[1]}          →               {well_profiles.shape[1]}"
)
print()
well_profiles

Single cells:  1,200 rows  →  Well profiles: 12 rows
Columns:       15          →               13

[3]:

	Metadata_Plate	Metadata_Well	Cells_AreaShape_Area	Cells_AreaShape_BoundingBoxArea	Cells_AreaShape_EulerNumber	Cells_AreaShape_Eccentricity	Cells_Intensity_MeanIntensity_Mito	Cells_Texture_Correlation_RNA_3_0_256	Cytoplasm_AreaShape_Area	Cytoplasm_Intensity_MeanIntensity_AGP	Nuclei_AreaShape_Area	Nuclei_AreaShape_Eccentricity	Nuclei_Intensity_MeanIntensity_DNA
0	Plate_1	B02	484.765245	634.635906	1.0	0.532796	0.300860	0.216580	315.443258	0.263187	193.043658	0.418260	0.517625
1	Plate_1	B03	669.260838	870.994323	1.0	0.536427	0.304651	0.210215	314.055342	0.256479	190.671898	0.405648	0.498834
2	Plate_1	B04	399.970485	523.433308	1.0	0.533963	0.306359	0.222698	304.712780	0.252233	196.243981	0.394866	0.495179
3	Plate_1	C02	523.730623	685.309911	1.0	0.552298	0.307376	0.214065	328.921311	0.247123	197.812467	0.418402	0.489309
4	Plate_1	C03	671.001479	874.309123	1.0	0.557177	0.297985	0.212211	327.148733	0.235335	195.742017	0.406123	0.504755
5	Plate_1	C04	411.397559	535.575562	1.0	0.537410	0.296999	0.211021	307.751566	0.237890	187.333217	0.406370	0.501055
6	Plate_2	B02	510.859835	663.499468	1.0	0.560717	0.279879	0.210651	297.559100	0.245555	186.979153	0.395230	0.509640
7	Plate_2	B03	671.687004	869.272541	1.0	0.539517	0.310271	0.223312	323.749429	0.256434	193.420235	0.392844	0.503184
8	Plate_2	B04	420.512564	542.359150	1.0	0.556893	0.309685	0.215100	297.933806	0.239380	185.984035	0.389872	0.498350
9	Plate_2	C02	515.312157	667.242484	1.0	0.571428	0.292990	0.219456	304.836628	0.249090	195.434170	0.387314	0.501398
10	Plate_2	C03	677.517456	882.379424	1.0	0.548610	0.296525	0.225144	295.410839	0.250994	197.704177	0.397884	0.514447
11	Plate_2	C04	407.760144	528.536944	1.0	0.514893	0.294377	0.209433	333.205131	0.251384	201.713063	0.417950	0.493228

Step 2: Annotate — Adding Experimental Context¶

After aggregation, each row represents a well, but the DataFrame only records where the measurement came from (plate and well position) — not what biological condition was in that well.

The connection between well positions and experimental conditions is stored in a plate map — a lookup table prepared by the researcher that records which compound, genetic perturbation, concentration, or other variable was assigned to each well.

annotate() merges the plate map onto the well profiles, adding a Metadata_ column for each piece of experimental information.

In real experiments, plate maps are usually supplied as CSV files from a Laboratory Information Management System (LIMS) or prepared manually. Here we create one directly as a DataFrame to show its structure.

First, let us define the plate map:

[4]:

# The plate map records the biological condition in each well position.
# The same layout was used for both plates in this experiment.
platemap = pd.DataFrame({
    # 'well_position' is the standard column name expected by annotate()
    "well_position": ["B02", "C02", "B03", "C03", "B04", "C04"],
    "treatment": [
        "DMSO",
        "DMSO",
        "Compound_A",
        "Compound_A",
        "Compound_B",
        "Compound_B",
    ],
    "cell_line": ["HeLa"] * 6,
    "concentration_um": [0.0, 0.0, 10.0, 10.0, 10.0, 10.0],
})

print("Plate map:")
platemap

Plate map:

[4]:

	well_position	treatment	cell_line	concentration_um
0	B02	DMSO	HeLa	0.0
1	C02	DMSO	HeLa	0.0
2	B03	Compound_A	HeLa	10.0
3	C03	Compound_A	HeLa	10.0
4	B04	Compound_B	HeLa	10.0
5	C04	Compound_B	HeLa	10.0

[5]:

# annotate() joins the plate map onto the well profiles.
#
# join_on specifies [platemap_column, profiles_column] used for matching wells.
# add_metadata_id_to_platemap=True prepends 'Metadata_' to all plate map column names,
# following the pycytominer convention that all non-feature columns start with 'Metadata_'.
annotated_profiles = annotate(
    profiles=well_profiles,
    platemap=platemap,
    join_on=["Metadata_well_position", "Metadata_Well"],
    add_metadata_id_to_platemap=True,
)

print(
    f"Annotated profiles: {annotated_profiles.shape[0]} rows x {annotated_profiles.shape[1]} columns"
)
print()

# Show the metadata columns that were added
meta_cols = [
    "Metadata_Plate",
    "Metadata_Well",
    "Metadata_treatment",
    "Metadata_cell_line",
    "Metadata_concentration_um",
]
print("Well-to-treatment mapping after annotation:")
annotated_profiles[meta_cols].drop_duplicates().sort_values("Metadata_Well")

Annotated profiles: 12 rows x 16 columns

Well-to-treatment mapping after annotation:

[5]:

	Metadata_Plate	Metadata_Well	Metadata_treatment	Metadata_cell_line	Metadata_concentration_um
0	Plate_1	B02	DMSO	HeLa	0.0
1	Plate_2	B02	DMSO	HeLa	0.0
4	Plate_1	B03	Compound_A	HeLa	10.0
5	Plate_2	B03	Compound_A	HeLa	10.0
8	Plate_1	B04	Compound_B	HeLa	10.0
9	Plate_2	B04	Compound_B	HeLa	10.0
2	Plate_1	C02	DMSO	HeLa	0.0
3	Plate_2	C02	DMSO	HeLa	0.0
6	Plate_1	C03	Compound_A	HeLa	10.0
7	Plate_2	C03	Compound_A	HeLa	10.0
10	Plate_1	C04	Compound_B	HeLa	10.0
11	Plate_2	C04	Compound_B	HeLa	10.0

Step 3: Normalize — Removing Technical Variation¶

CellProfiler features differ widely in scale and units. For example:

Cells_AreaShape_Area might range from 200 to 1,000 (pixels²)
Nuclei_Intensity_MeanIntensity_DNA might range from 0.1 to 0.9 (arbitrary fluorescence units)

Without normalization, features with large absolute values would dominate any downstream distance calculation or machine-learning model, regardless of whether they carry biological signal. Normalization also corrects for plate-to-plate technical variation caused by differences in staining efficiency, imaging conditions, or cell density between experimental batches.

normalize() rescales each feature using the distribution of control wells as a reference. The default method ('standardize') subtracts the control mean and divides by the control standard deviation — a standard z-score transformation. After normalization, control wells cluster around zero, and treated wells are expressed in units of standard deviations away from the control.

What is a vehicle control? DMSO (dimethyl sulfoxide) is the standard solvent used to dissolve most small-molecule compounds. Adding DMSO at the same concentration as the compound solvent, but without any active compound, defines the biological baseline — what cells look like when nothing meaningful has been done to them.

Parameter	Description
`samples`	A pandas query string selecting the control wells used to compute normalization statistics
`method`	`'standardize'` (z-score), `'robustize'` (median-based), or `'mad_robustize'`

[6]:

normalized_profiles = normalize(
    profiles=annotated_profiles,
    features="infer",
    meta_features="infer",
    samples="Metadata_treatment == 'DMSO'",  # use DMSO wells as the normalization reference
    method="standardize",
)

print(f"Normalized profiles: {normalized_profiles.shape}")

# Sanity check: DMSO wells should be centred near 0 after normalization
feature_cols = [c for c in normalized_profiles.columns if not c.startswith("Metadata_")]
dmso_rows = normalized_profiles["Metadata_treatment"] == "DMSO"

print()
print("Mean feature values in DMSO wells (should be ≈ 0.0 after normalization):")
print(normalized_profiles.loc[dmso_rows, feature_cols].mean().round(3).to_string())

Normalized profiles: (12, 16)

Mean feature values in DMSO wells (should be ≈ 0.0 after normalization):
Cells_AreaShape_Area                     0.0
Cells_AreaShape_BoundingBoxArea         -0.0
Cells_AreaShape_EulerNumber              0.0
Cells_AreaShape_Eccentricity             0.0
Cells_Intensity_MeanIntensity_Mito       0.0
Cells_Texture_Correlation_RNA_3_0_256    0.0
Cytoplasm_AreaShape_Area                -0.0
Cytoplasm_Intensity_MeanIntensity_AGP    0.0
Nuclei_AreaShape_Area                    0.0
Nuclei_AreaShape_Eccentricity           -0.0
Nuclei_Intensity_MeanIntensity_DNA      -0.0

Step 4: Feature Selection — Keeping Only Informative Features¶

A typical high-content microscopy dataset contains 500–1,500 features per well. Many of these features carry little or no biological information and can actively harm downstream analyses by adding noise:

Constant or near-constant features have the same value in every well and therefore cannot distinguish one treatment from another.
Highly correlated features (e.g., Cells_AreaShape_Area and Cells_AreaShape_BoundingBoxArea) measure essentially the same property through slightly different calculations. Retaining both adds redundancy without adding biological information.
Blocklisted features are features empirically identified as technically unreliable across many published CellProfiler pipelines.

feature_select() applies these removal criteria in sequence:

Operation	What it removes
`'variance_threshold'`	Features with variance below a minimum threshold (effectively constant)
`'correlation_threshold'`	One feature from every highly correlated pair (Pearson r > 0.9)
`'blocklist'`	Features on the community-curated Pycytominer blocklist

Recall from the data generation step that we deliberately included two uninformative features:

Cells_AreaShape_EulerNumber — constant (= 1) for all cells → removed by variance_threshold
Cells_AreaShape_Area / Cells_AreaShape_BoundingBoxArea — nearly perfectly correlated (r ≈ 0.99) → one of the pair is removed by correlation_threshold

Which of the correlated pair is kept? The algorithm retains the feature with the lower average correlation to all other features.

[7]:

selected_profiles = feature_select(
    profiles=normalized_profiles,
    features="infer",
    operation=["variance_threshold", "correlation_threshold", "blocklist"],
)

feature_cols_before = [
    c for c in normalized_profiles.columns if not c.startswith("Metadata_")
]
feature_cols_after = [
    c for c in selected_profiles.columns if not c.startswith("Metadata_")
]

print(f"Features before selection: {len(feature_cols_before)}")
print(f"Features after  selection: {len(feature_cols_after)}")
print(
    f"Features removed:          {len(feature_cols_before) - len(feature_cols_after)}"
)
print()
print("Retained features:")
for col in sorted(feature_cols_after):
    print(f"  {col}")

Features before selection: 11
Features after  selection: 9
Features removed:          2

Retained features:
  Cells_AreaShape_BoundingBoxArea
  Cells_AreaShape_Eccentricity
  Cells_Intensity_MeanIntensity_Mito
  Cells_Texture_Correlation_RNA_3_0_256
  Cytoplasm_AreaShape_Area
  Cytoplasm_Intensity_MeanIntensity_AGP
  Nuclei_AreaShape_Area
  Nuclei_AreaShape_Eccentricity
  Nuclei_Intensity_MeanIntensity_DNA

Step 5: Consensus — Collapsing Replicates¶

At this point we have one profile per well. Because our experiment was run across two plates (biological replicates), we have four profiles for each treatment condition — two wells per plate times two plates. Some downstream analyses expect a single, definitive profile per condition.

consensus() collapses replicate profiles into one consensus profile per treatment group by computing the median across all replicates.

Using the consensus profile instead of individual replicates:

Reduces the influence of plate-specific technical artefacts that survived normalization
Produces a lower-variance, higher-confidence representation of the treatment effect
Simplifies downstream analysis by reducing the number of rows to cluster or classify

The replicate_columns parameter specifies which metadata columns define a unique condition. Profiles that share the same values in these columns are treated as replicates and collapsed into a single consensus row.

[8]:

consensus_profiles = consensus(
    profiles=selected_profiles,
    replicate_columns=[
        "Metadata_treatment",
        "Metadata_cell_line",
        "Metadata_concentration_um",
    ],
    operation="median",
    features="infer",
)

print(f"Profiles before consensus: {selected_profiles.shape[0]} rows  (one per well)")
print(
    f"Profiles after  consensus: {consensus_profiles.shape[0]} rows  (one per treatment)"
)
print()
consensus_profiles

Profiles before consensus: 12 rows  (one per well)
Profiles after  consensus: 3 rows  (one per treatment)

[8]:

	Metadata_treatment	Metadata_cell_line	Metadata_concentration_um	Cells_AreaShape_BoundingBoxArea	Cells_AreaShape_Eccentricity	Cells_Intensity_MeanIntensity_Mito	Cells_Texture_Correlation_RNA_3_0_256	Cytoplasm_AreaShape_Area	Cytoplasm_Intensity_MeanIntensity_AGP	Nuclei_AreaShape_Area	Nuclei_AreaShape_Eccentricity	Nuclei_Intensity_MeanIntensity_DNA
0	Compound_A	HeLa	10.0	11.558693	-0.724044	0.589681	0.794162	0.610831	0.353012	0.313659	-0.219728	-0.049983
1	Compound_B	HeLa	10.0	-7.189962	-1.316052	0.624929	-0.656629	-0.462245	-0.835377	-0.379430	-0.302806	-0.737681
2	DMSO	HeLa	0.0	0.148573	0.155330	0.160923	0.041498	-0.131285	-0.446743	0.228724	0.140684	0.097928

Summary¶

You have now processed a Cell Painting dataset through the complete Pycytominer pipeline. Here is a recap of what each step accomplished:

Step	Function	Rows	Features
Raw single-cell data	—	1,200 cells	11
After `aggregate()`	pool cells per well	12 wells	11
After `annotate()`	add treatment labels	12 wells	11
After `normalize()`	z-score vs DMSO	12 wells	11
After `feature_select()`	remove uninformative features	12 wells	9
After `consensus()`	collapse replicates	3 conditions	9

The final consensus_profiles DataFrame contains one row per biological treatment condition and nine informative morphological features — a compact, analysis-ready representation of how each treatment changed the appearance of cells.

Saving Your Profiles¶

Pycytominer provides cyto_utils.output() as its canonical function for writing profiles to disk — the same function each pipeline step calls internally when you pass an output_file argument. It handles compression, format selection, and file naming in one call, and supports four output types:

`output_type`	Extension	Best for
`"csv"` (default)	`.csv.gz`	Gzip-compressed, readable by any tool
`"parquet"`	`.parquet`	Faster reads and smaller files for large screens
`"anndata_h5ad"`	`.h5ad`	AnnData / scanpy workflows
`"anndata_zarr"`	`.zarr`	Cloud-native AnnData storage

from pycytominer.cyto_utils import output

# Gzip-compressed CSV (default) — small footprint, readable by any tool
output(
    df=consensus_profiles,
    output_filename="consensus_profiles.csv.gz",
    output_type="csv",
)

# Parquet — fast reads and efficient storage for large screens
output(
    df=consensus_profiles,
    output_filename="consensus_profiles.parquet",
    output_type="parquet",
)

# AnnData HDF5 — ready for scanpy, scverse, and single-cell workflows
output(
    df=consensus_profiles,
    output_filename="consensus_profiles.h5ad",
    output_type="anndata_h5ad",
)

Pro tip: Every pipeline function accepts an output_file argument that writes directly to disk and returns the file path instead of a DataFrame. This avoids storing intermediate results in memory for large datasets:
consensus_profiles = consensus(
    profiles=selected_profiles,
    replicate_columns=["Metadata_treatment", "Metadata_cell_line", "Metadata_concentration_um"],
    operation="median",
    features="infer",
    output_file="consensus_profiles.parquet",
    output_type="parquet",
)
# consensus_profiles is now the file path string, not a DataFrame

What to Do Next¶

With morphology profiles in hand, common next steps include:

Phenotypic clustering — group treatments by morphological similarity using hierarchical clustering or UMAP
Similarity analysis — identify compounds that produce the same cellular phenotype using correlation or cosine similarity metrics
Classification — train machine-learning models to predict a compound’s mechanism of action from its morphological profile
Dimensionality reduction — visualise the morphological space of an entire compound library in two dimensions using PCA or UMAP
Hit calling — identify which compounds produce a statistically significant morphological change relative to controls. copairs computes mean Average Precision (mAP) to score phenotypic activity and consistency at the well/profile level; Buscar operates directly on single-cell distributions to capture cellular heterogeneity and flag off-target effects

Pycytominer in the Wild¶

Pycytominer is used across some of the largest and most impactful image-based profiling initiatives in the world. Here are a few to spark your curiosity:

🧬 JUMP-CP — Joint Undertaking for Morphological Profiling

The largest public Cell Painting dataset ever produced, generated by a consortium of 13 pharmaceutical companies and academic institutions (including AstraZeneca, Bayer, Pfizer, Merck KGaA, and the Broad Institute). JUMP-CP profiled over 116,000 compounds and ~15,000 genetic perturbations, with all profiles processed using Pycytominer. The resulting resource is used to predict compound activity, identify drug mechanisms, and match small molecules to disease phenotypes — at industrial scale.

🔬 LINCS Cell Painting — Library of Integrated Network-based Cellular Signatures

An NIH-funded initiative that profiled 1,571 bioactive compounds across six doses and five replicates in A549 lung cancer cells. Pycytominer was adopted as the primary profiling tool for this dataset, producing normalized and feature-selected profiles (Levels 3–5) that are publicly available for download. LINCS demonstrated that image-based profiles could serve as a systematic, reproducible reference map of cellular responses to chemical perturbation.

🌍 EU-OPENSCREEN — European Chemical Biology Research Infrastructure

A distributed pan-European research infrastructure spanning 30 partner sites across eight countries. EU-OPENSCREEN has integrated Cell Painting into its screening platform, enabling European academic and industry researchers to access high-content imaging and morphological profiling as a service. Their contributions to the JUMP-CP consortium extended the reach of image-based profiling into the broader European drug discovery community.

🖼️ Cell Painting Gallery — Broad Institute Open Dataset Collection

A growing public repository of Cell Painting datasets, hosted on AWS as open data and maintained by the Carpenter–Singh and Cimini labs at the Broad Institute. The gallery spans tens of thousands of small-molecule treatments across diverse cell lines and experimental designs — all freely accessible and ready for analysis. It is the canonical reference point for new Cell Painting datasets produced by the community.

These resources process their raw CellProfiler outputs through the same Pycytominer pipeline you just ran — the only difference is scale.