Introduction to image-based profiling with Pycytominer

Welcome! This tutorial introduces Pycytominer, a Python library for processing image-based profiling data from high-content microscopy experiments.

What You Will Learn

By the end of this tutorial, you will know how to:

  1. Aggregate thousands of single-cell measurements into one representative profile per experimental well

  2. Annotate profiles with experimental metadata, such as which compound was applied to each well

  3. Normalize feature values to remove plate-to-plate technical variation

  4. Select features to remove uninformative or redundant measurements

  5. Build consensus profiles that collapse replicate experiments into a single representative vector

Background: What Is High-Content Microscopy?

High-content microscopy measures hundreds to thousands of informative phenotypic features that represent the morphology state of cells under different biological conditions (e.g., healthy vs. disease). High-content microscopy is often paired with high-throughput screening experiments that perturb cells with small-molecule compounds or genetic perturbations.

In a typical experiment:

  1. Cells are grown in multi-well plates and treated with a panel of perturbations.

  2. Optionally apply fluorescence dyes to stain distinct cellular compartments.

  3. Automated microscopes capture hundreds of images per plate.

Cell staining and imaging pipeline
  1. Image analysis software (such as CellProfiler) extracts several thousand numerical features per detected cell, describing each compartment’s and channel’s shape, texture, and fluorescence intensity.

From microscopy images to single-cell feature measurements

A single experiment can generate measurements from millions of individual cells, spanning hundreds to thousands of features. The central challenge is transforming this raw, high-dimensional data into clean, interpretable image-based profiles — compact, comparable vectors that summarise how each condition changed the appearance of cells.

That is exactly what Pycytominer does (Serrano et al., 2025), which has been grounded in image-based profiling methods established over the past decade (Caicedo et al., 2017, Serrano et al. 2026).

Prerequisites

This tutorial assumes you have:

The Pycytominer Pipeline at a Glance

Raw single-cell data travels through five sequential steps:

Step

Pycytominer function

What changes

  1. Aggregate

aggregate()

One row per cell → one row per well

  1. Annotate

annotate()

Well positions → biological treatment labels

  1. Normalize

normalize()

Raw feature values → z-scores relative to controls

  1. Feature Select

feature_select()

Hundreds of features → only the informative ones

  1. Consensus

consensus()

One row per well → one row per treatment condition

At the end, you have a compact, analysis-ready table where each row is a unique biological condition and each column is an informative morphological measurement.

flowchart TD input["🔬 Single-cell data<br/>1,200 cells, 11 features"] agg["🪣 aggregate()<br/>Pool cells per well, 12 profiles"] ann["🏷️ annotate()<br/>Join plate map, add treatment labels"] nor["⚖️ normalize()<br/>Z-score vs DMSO controls"] fea["✂️ feature_select()<br/>Remove redundant, 9 of 11 features kept"] con["🤝 consensus()<br/>Median across plates, 3 conditions"] output["📊 Morphological profiles<br/>3 conditions, 9 features"] input --> agg --> ann --> nor --> fea --> con --> output style input fill:#f0d9fa,stroke:#88239A,color:#111 style output fill:#f0d9fa,stroke:#88239A,color:#111 style agg fill:#ffffff,stroke:#88239A,color:#111 style ann fill:#ffffff,stroke:#88239A,color:#111 style nor fill:#ffffff,stroke:#88239A,color:#111 style fea fill:#ffffff,stroke:#88239A,color:#111 style con fill:#ffffff,stroke:#88239A,color:#111
[1]:
import numpy as np
import pandas as pd

from pycytominer import aggregate, annotate, consensus, feature_select, normalize

# Fix the random seed so this tutorial produces identical results every time it is run
np.random.seed(42)

Tutorial Data

In a real workflow, you would start from the Parquet file produced by CytoTable:

from pycytominer.cyto_utils import load_profiles

# Load single-cell measurements exported by CytoTable
single_cells = load_profiles("outputs/examplehuman.parquet")

For this tutorial we generate a small synthetic dataset that mirrors the exact structure of a real high-content microscopy experiment. The column names, data types, and naming conventions are identical to what CellProfiler and CytoTable produce — only the numerical values are simulated.

Experiment design:

Property

Value

Plates (biological replicates)

2

Wells per plate

6 (2 × DMSO vehicle control, 2 × Compound A, 2 × Compound B)

Cells per well

~100

Total single-cell measurements

~1,200

Morphological features

11 (across three compartments)

Note on the features: Two of the eleven features are intentionally designed to be uninformative — one is constant across all cells, and one is nearly perfectly correlated with another. You will see these removed automatically in Step 4 (Feature Selection).

The simulation function is available in the expandable block below if you’d like to inspect it — you can skip it and go straight to Step 1.

def simulate_single_cells(plate_id, n_cells_per_well=100):
    """
    Generate synthetic single-cell morphology measurements for one plate.

    Column naming follows the CellProfiler convention:
      Metadata_*  — experimental context (plate, well, object identity)
      Cells_*     — measurements of the whole-cell boundary
      Cytoplasm_* — measurements of the cytoplasmic region
      Nuclei_*    — measurements of the nuclear region

    To keep this tutorial focused on the pipeline rather than biology,
    only the cell-area features respond to treatment. All other features
    are independent noise sampled from realistic distributions.
    In a real experiment every feature may carry some biological signal.
    """
    well_treatments = {
        'B02': 'DMSO',
        'C02': 'DMSO',
        'B03': 'Compound_A',
        'C03': 'Compound_A',
        'B04': 'Compound_B',
        'C04': 'Compound_B',
    }

    rows = []
    for image_number, (well, treatment) in enumerate(well_treatments.items(), start=1):
        is_a = float(treatment == 'Compound_A')
        is_b = float(treatment == 'Compound_B')

        # Only the Area family of features responds to treatment.
        # This ensures only the intentionally correlated pair is removed in Step 4.
        cell_area_base = np.random.normal(500 + 180 * is_a - 90 * is_b, 120, n_cells_per_well)

        for obj_num in range(1, n_cells_per_well + 1):
            cell_area = cell_area_base[obj_num - 1]
            rows.append({
                # ── Metadata columns ──────────────────────────────────────────
                'Metadata_Plate':        plate_id,
                'Metadata_Well':         well,
                'Metadata_ImageNumber':  image_number,
                'Metadata_ObjectNumber': obj_num,
                # ── Cell-level features ───────────────────────────────────────
                # Area and BoundingBoxArea are intentionally correlated (r ≈ 0.99);
                # one of the pair will be removed during feature selection.
                'Cells_AreaShape_Area':            cell_area,
                'Cells_AreaShape_BoundingBoxArea': cell_area * 1.3 + np.random.normal(0, 4),
                # EulerNumber = 1 for virtually all cells (topological invariant);
                # zero variance → removed during feature selection.
                'Cells_AreaShape_EulerNumber':     1,
                # All remaining features: independent noise with realistic distributions
                'Cells_AreaShape_Eccentricity':          float(np.clip(np.random.normal(0.55, 0.12), 0, 1)),
                'Cells_Intensity_MeanIntensity_Mito':    np.random.normal(0.30, 0.06),
                'Cells_Texture_Correlation_RNA_3_0_256': np.random.normal(0.22, 0.06),
                # ── Cytoplasm features ────────────────────────────────────────
                'Cytoplasm_AreaShape_Area':              np.random.normal(310, 80),
                'Cytoplasm_Intensity_MeanIntensity_AGP': np.random.normal(0.25, 0.07),
                # ── Nuclei features ───────────────────────────────────────────
                'Nuclei_AreaShape_Area':                 np.random.normal(195, 55),
                'Nuclei_AreaShape_Eccentricity':         float(np.clip(np.random.normal(0.40, 0.10), 0, 1)),
                'Nuclei_Intensity_MeanIntensity_DNA':    np.random.normal(0.50, 0.08),
            })
    return pd.DataFrame(rows)


# Generate data for two plates to simulate biological replicates
plate1 = simulate_single_cells('Plate_1')
plate2 = simulate_single_cells('Plate_2')
single_cells = pd.concat([plate1, plate2], ignore_index=True)

print(f'Single-cell dataset: {single_cells.shape[0]:,} rows (cells) x {single_cells.shape[1]} columns')
print(f'Plates:  {single_cells["Metadata_Plate"].unique().tolist()}')
print(f'Wells:   {sorted(single_cells["Metadata_Well"].unique().tolist())}')
print()
single_cells.head()

Step 1: Aggregate — From Cells to Wells

The single-cell table contains one row for every detected cell — in a real experiment this can easily reach hundreds of thousands of rows. However, biological interpretation happens at the level of the well (which treatment was applied), not the individual cell.

aggregate() summarises all cells within the same well into a single representative profile by computing the median of each feature across all cells in that well.

Parameter

Description

population_df

The single-cell DataFrame

strata

Columns that identify each well — cells sharing the same strata values are pooled together

features='infer'

Automatically detect feature columns (any column whose name starts with a compartment prefix such as Cells_, Cytoplasm_, or Nuclei_)

operation

Summary statistic: 'median' (default) or 'mean'

[3]:
well_profiles = aggregate(
    population_df=single_cells,
    strata=["Metadata_Plate", "Metadata_Well"],
    features="infer",
    operation="median",
)

print(
    f"Single cells:  {single_cells.shape[0]:,} rows  →  Well profiles: {well_profiles.shape[0]} rows"
)
print(
    f"Columns:       {single_cells.shape[1]}{well_profiles.shape[1]}"
)
print()
well_profiles
Single cells:  1,200 rows  →  Well profiles: 12 rows
Columns:       15          →               13

[3]:
Metadata_Plate Metadata_Well Cells_AreaShape_Area Cells_AreaShape_BoundingBoxArea Cells_AreaShape_EulerNumber Cells_AreaShape_Eccentricity Cells_Intensity_MeanIntensity_Mito Cells_Texture_Correlation_RNA_3_0_256 Cytoplasm_AreaShape_Area Cytoplasm_Intensity_MeanIntensity_AGP Nuclei_AreaShape_Area Nuclei_AreaShape_Eccentricity Nuclei_Intensity_MeanIntensity_DNA
0 Plate_1 B02 484.765245 634.635906 1.0 0.532796 0.300860 0.216580 315.443258 0.263187 193.043658 0.418260 0.517625
1 Plate_1 B03 669.260838 870.994323 1.0 0.536427 0.304651 0.210215 314.055342 0.256479 190.671898 0.405648 0.498834
2 Plate_1 B04 399.970485 523.433308 1.0 0.533963 0.306359 0.222698 304.712780 0.252233 196.243981 0.394866 0.495179
3 Plate_1 C02 523.730623 685.309911 1.0 0.552298 0.307376 0.214065 328.921311 0.247123 197.812467 0.418402 0.489309
4 Plate_1 C03 671.001479 874.309123 1.0 0.557177 0.297985 0.212211 327.148733 0.235335 195.742017 0.406123 0.504755
5 Plate_1 C04 411.397559 535.575562 1.0 0.537410 0.296999 0.211021 307.751566 0.237890 187.333217 0.406370 0.501055
6 Plate_2 B02 510.859835 663.499468 1.0 0.560717 0.279879 0.210651 297.559100 0.245555 186.979153 0.395230 0.509640
7 Plate_2 B03 671.687004 869.272541 1.0 0.539517 0.310271 0.223312 323.749429 0.256434 193.420235 0.392844 0.503184
8 Plate_2 B04 420.512564 542.359150 1.0 0.556893 0.309685 0.215100 297.933806 0.239380 185.984035 0.389872 0.498350
9 Plate_2 C02 515.312157 667.242484 1.0 0.571428 0.292990 0.219456 304.836628 0.249090 195.434170 0.387314 0.501398
10 Plate_2 C03 677.517456 882.379424 1.0 0.548610 0.296525 0.225144 295.410839 0.250994 197.704177 0.397884 0.514447
11 Plate_2 C04 407.760144 528.536944 1.0 0.514893 0.294377 0.209433 333.205131 0.251384 201.713063 0.417950 0.493228

Step 2: Annotate — Adding Experimental Context

After aggregation, each row represents a well, but the DataFrame only records where the measurement came from (plate and well position) — not what biological condition was in that well.

The connection between well positions and experimental conditions is stored in a plate map — a lookup table prepared by the researcher that records which compound, genetic perturbation, concentration, or other variable was assigned to each well.

annotate() merges the plate map onto the well profiles, adding a Metadata_ column for each piece of experimental information.

In real experiments, plate maps are usually supplied as CSV files from a Laboratory Information Management System (LIMS) or prepared manually. Here we create one directly as a DataFrame to show its structure.

First, let us define the plate map:

[4]:
# The plate map records the biological condition in each well position.
# The same layout was used for both plates in this experiment.
platemap = pd.DataFrame({
    # 'well_position' is the standard column name expected by annotate()
    "well_position": ["B02", "C02", "B03", "C03", "B04", "C04"],
    "treatment": [
        "DMSO",
        "DMSO",
        "Compound_A",
        "Compound_A",
        "Compound_B",
        "Compound_B",
    ],
    "cell_line": ["HeLa"] * 6,
    "concentration_um": [0.0, 0.0, 10.0, 10.0, 10.0, 10.0],
})

print("Plate map:")
platemap
Plate map:
[4]:
well_position treatment cell_line concentration_um
0 B02 DMSO HeLa 0.0
1 C02 DMSO HeLa 0.0
2 B03 Compound_A HeLa 10.0
3 C03 Compound_A HeLa 10.0
4 B04 Compound_B HeLa 10.0
5 C04 Compound_B HeLa 10.0
[5]:
# annotate() joins the plate map onto the well profiles.
#
# join_on specifies [platemap_column, profiles_column] used for matching wells.
# add_metadata_id_to_platemap=True prepends 'Metadata_' to all plate map column names,
# following the pycytominer convention that all non-feature columns start with 'Metadata_'.
annotated_profiles = annotate(
    profiles=well_profiles,
    platemap=platemap,
    join_on=["Metadata_well_position", "Metadata_Well"],
    add_metadata_id_to_platemap=True,
)

print(
    f"Annotated profiles: {annotated_profiles.shape[0]} rows x {annotated_profiles.shape[1]} columns"
)
print()

# Show the metadata columns that were added
meta_cols = [
    "Metadata_Plate",
    "Metadata_Well",
    "Metadata_treatment",
    "Metadata_cell_line",
    "Metadata_concentration_um",
]
print("Well-to-treatment mapping after annotation:")
annotated_profiles[meta_cols].drop_duplicates().sort_values("Metadata_Well")
Annotated profiles: 12 rows x 16 columns

Well-to-treatment mapping after annotation:
[5]:
Metadata_Plate Metadata_Well Metadata_treatment Metadata_cell_line Metadata_concentration_um
0 Plate_1 B02 DMSO HeLa 0.0
1 Plate_2 B02 DMSO HeLa 0.0
4 Plate_1 B03 Compound_A HeLa 10.0
5 Plate_2 B03 Compound_A HeLa 10.0
8 Plate_1 B04 Compound_B HeLa 10.0
9 Plate_2 B04 Compound_B HeLa 10.0
2 Plate_1 C02 DMSO HeLa 0.0
3 Plate_2 C02 DMSO HeLa 0.0
6 Plate_1 C03 Compound_A HeLa 10.0
7 Plate_2 C03 Compound_A HeLa 10.0
10 Plate_1 C04 Compound_B HeLa 10.0
11 Plate_2 C04 Compound_B HeLa 10.0

Step 3: Normalize — Removing Technical Variation

CellProfiler features differ widely in scale and units. For example:

  • Cells_AreaShape_Area might range from 200 to 1,000 (pixels²)

  • Nuclei_Intensity_MeanIntensity_DNA might range from 0.1 to 0.9 (arbitrary fluorescence units)

Without normalization, features with large absolute values would dominate any downstream distance calculation or machine-learning model, regardless of whether they carry biological signal. Normalization also corrects for plate-to-plate technical variation caused by differences in staining efficiency, imaging conditions, or cell density between experimental batches.

normalize() rescales each feature using the distribution of control wells as a reference. The default method ('standardize') subtracts the control mean and divides by the control standard deviation — a standard z-score transformation. After normalization, control wells cluster around zero, and treated wells are expressed in units of standard deviations away from the control.

What is a vehicle control? DMSO (dimethyl sulfoxide) is the standard solvent used to dissolve most small-molecule compounds. Adding DMSO at the same concentration as the compound solvent, but without any active compound, defines the biological baseline — what cells look like when nothing meaningful has been done to them.

Parameter

Description

samples

A pandas query string selecting the control wells used to compute normalization statistics

method

'standardize' (z-score), 'robustize' (median-based), or 'mad_robustize'

[6]:
normalized_profiles = normalize(
    profiles=annotated_profiles,
    features="infer",
    meta_features="infer",
    samples="Metadata_treatment == 'DMSO'",  # use DMSO wells as the normalization reference
    method="standardize",
)

print(f"Normalized profiles: {normalized_profiles.shape}")

# Sanity check: DMSO wells should be centred near 0 after normalization
feature_cols = [c for c in normalized_profiles.columns if not c.startswith("Metadata_")]
dmso_rows = normalized_profiles["Metadata_treatment"] == "DMSO"

print()
print("Mean feature values in DMSO wells (should be ≈ 0.0 after normalization):")
print(normalized_profiles.loc[dmso_rows, feature_cols].mean().round(3).to_string())
Normalized profiles: (12, 16)

Mean feature values in DMSO wells (should be ≈ 0.0 after normalization):
Cells_AreaShape_Area                     0.0
Cells_AreaShape_BoundingBoxArea         -0.0
Cells_AreaShape_EulerNumber              0.0
Cells_AreaShape_Eccentricity             0.0
Cells_Intensity_MeanIntensity_Mito       0.0
Cells_Texture_Correlation_RNA_3_0_256    0.0
Cytoplasm_AreaShape_Area                -0.0
Cytoplasm_Intensity_MeanIntensity_AGP    0.0
Nuclei_AreaShape_Area                    0.0
Nuclei_AreaShape_Eccentricity           -0.0
Nuclei_Intensity_MeanIntensity_DNA      -0.0

Step 4: Feature Selection — Keeping Only Informative Features

A typical high-content microscopy dataset contains 500–1,500 features per well. Many of these features carry little or no biological information and can actively harm downstream analyses by adding noise:

  • Constant or near-constant features have the same value in every well and therefore cannot distinguish one treatment from another.

  • Highly correlated features (e.g., Cells_AreaShape_Area and Cells_AreaShape_BoundingBoxArea) measure essentially the same property through slightly different calculations. Retaining both adds redundancy without adding biological information.

  • Blocklisted features are features empirically identified as technically unreliable across many published CellProfiler pipelines.

feature_select() applies these removal criteria in sequence:

Operation

What it removes

'variance_threshold'

Features with variance below a minimum threshold (effectively constant)

'correlation_threshold'

One feature from every highly correlated pair (Pearson r > 0.9)

'blocklist'

Features on the community-curated Pycytominer blocklist

Recall from the data generation step that we deliberately included two uninformative features:

  • Cells_AreaShape_EulerNumber — constant (= 1) for all cells → removed by variance_threshold

  • Cells_AreaShape_Area / Cells_AreaShape_BoundingBoxArea — nearly perfectly correlated (r ≈ 0.99) → one of the pair is removed by correlation_threshold

Which of the correlated pair is kept? The algorithm retains the feature with the lower average correlation to all other features.

[7]:
selected_profiles = feature_select(
    profiles=normalized_profiles,
    features="infer",
    operation=["variance_threshold", "correlation_threshold", "blocklist"],
)

feature_cols_before = [
    c for c in normalized_profiles.columns if not c.startswith("Metadata_")
]
feature_cols_after = [
    c for c in selected_profiles.columns if not c.startswith("Metadata_")
]

print(f"Features before selection: {len(feature_cols_before)}")
print(f"Features after  selection: {len(feature_cols_after)}")
print(
    f"Features removed:          {len(feature_cols_before) - len(feature_cols_after)}"
)
print()
print("Retained features:")
for col in sorted(feature_cols_after):
    print(f"  {col}")
Features before selection: 11
Features after  selection: 9
Features removed:          2

Retained features:
  Cells_AreaShape_BoundingBoxArea
  Cells_AreaShape_Eccentricity
  Cells_Intensity_MeanIntensity_Mito
  Cells_Texture_Correlation_RNA_3_0_256
  Cytoplasm_AreaShape_Area
  Cytoplasm_Intensity_MeanIntensity_AGP
  Nuclei_AreaShape_Area
  Nuclei_AreaShape_Eccentricity
  Nuclei_Intensity_MeanIntensity_DNA

Step 5: Consensus — Collapsing Replicates

At this point we have one profile per well. Because our experiment was run across two plates (biological replicates), we have four profiles for each treatment condition — two wells per plate times two plates. Some downstream analyses expect a single, definitive profile per condition.

consensus() collapses replicate profiles into one consensus profile per treatment group by computing the median across all replicates.

Using the consensus profile instead of individual replicates:

  • Reduces the influence of plate-specific technical artefacts that survived normalization

  • Produces a lower-variance, higher-confidence representation of the treatment effect

  • Simplifies downstream analysis by reducing the number of rows to cluster or classify

The replicate_columns parameter specifies which metadata columns define a unique condition. Profiles that share the same values in these columns are treated as replicates and collapsed into a single consensus row.

[8]:
consensus_profiles = consensus(
    profiles=selected_profiles,
    replicate_columns=[
        "Metadata_treatment",
        "Metadata_cell_line",
        "Metadata_concentration_um",
    ],
    operation="median",
    features="infer",
)

print(f"Profiles before consensus: {selected_profiles.shape[0]} rows  (one per well)")
print(
    f"Profiles after  consensus: {consensus_profiles.shape[0]} rows  (one per treatment)"
)
print()
consensus_profiles
Profiles before consensus: 12 rows  (one per well)
Profiles after  consensus: 3 rows  (one per treatment)

[8]:
Metadata_treatment Metadata_cell_line Metadata_concentration_um Cells_AreaShape_BoundingBoxArea Cells_AreaShape_Eccentricity Cells_Intensity_MeanIntensity_Mito Cells_Texture_Correlation_RNA_3_0_256 Cytoplasm_AreaShape_Area Cytoplasm_Intensity_MeanIntensity_AGP Nuclei_AreaShape_Area Nuclei_AreaShape_Eccentricity Nuclei_Intensity_MeanIntensity_DNA
0 Compound_A HeLa 10.0 11.558693 -0.724044 0.589681 0.794162 0.610831 0.353012 0.313659 -0.219728 -0.049983
1 Compound_B HeLa 10.0 -7.189962 -1.316052 0.624929 -0.656629 -0.462245 -0.835377 -0.379430 -0.302806 -0.737681
2 DMSO HeLa 0.0 0.148573 0.155330 0.160923 0.041498 -0.131285 -0.446743 0.228724 0.140684 0.097928

Summary

You have now processed a Cell Painting dataset through the complete Pycytominer pipeline. Here is a recap of what each step accomplished:

Step

Function

Rows

Features

Raw single-cell data

1,200 cells

11

After aggregate()

pool cells per well

12 wells

11

After annotate()

add treatment labels

12 wells

11

After normalize()

z-score vs DMSO

12 wells

11

After feature_select()

remove uninformative features

12 wells

9

After consensus()

collapse replicates

3 conditions

9

The final consensus_profiles DataFrame contains one row per biological treatment condition and nine informative morphological features — a compact, analysis-ready representation of how each treatment changed the appearance of cells.

Saving Your Profiles

Pycytominer provides cyto_utils.output() as its canonical function for writing profiles to disk — the same function each pipeline step calls internally when you pass an output_file argument. It handles compression, format selection, and file naming in one call, and supports four output types:

output_type

Extension

Best for

"csv" (default)

.csv.gz

Gzip-compressed, readable by any tool

"parquet"

.parquet

Faster reads and smaller files for large screens

"anndata_h5ad"

.h5ad

AnnData / scanpy workflows

"anndata_zarr"

.zarr

Cloud-native AnnData storage

from pycytominer.cyto_utils import output

# Gzip-compressed CSV (default) — small footprint, readable by any tool
output(
    df=consensus_profiles,
    output_filename="consensus_profiles.csv.gz",
    output_type="csv",
)

# Parquet — fast reads and efficient storage for large screens
output(
    df=consensus_profiles,
    output_filename="consensus_profiles.parquet",
    output_type="parquet",
)

# AnnData HDF5 — ready for scanpy, scverse, and single-cell workflows
output(
    df=consensus_profiles,
    output_filename="consensus_profiles.h5ad",
    output_type="anndata_h5ad",
)

Pro tip: Every pipeline function accepts an output_file argument that writes directly to disk and returns the file path instead of a DataFrame. This avoids storing intermediate results in memory for large datasets:

consensus_profiles = consensus(
    profiles=selected_profiles,
    replicate_columns=["Metadata_treatment", "Metadata_cell_line", "Metadata_concentration_um"],
    operation="median",
    features="infer",
    output_file="consensus_profiles.parquet",
    output_type="parquet",
)
# consensus_profiles is now the file path string, not a DataFrame

What to Do Next

With morphology profiles in hand, common next steps include:

  • Phenotypic clustering — group treatments by morphological similarity using hierarchical clustering or UMAP

  • Similarity analysis — identify compounds that produce the same cellular phenotype using correlation or cosine similarity metrics

  • Classification — train machine-learning models to predict a compound’s mechanism of action from its morphological profile

  • Dimensionality reduction — visualise the morphological space of an entire compound library in two dimensions using PCA or UMAP

  • Hit calling — identify which compounds produce a statistically significant morphological change relative to controls. copairs computes mean Average Precision (mAP) to score phenotypic activity and consistency at the well/profile level; Buscar operates directly on single-cell distributions to capture cellular heterogeneity and flag off-target effects

Further Reading


Pycytominer in the Wild

Pycytominer is used across some of the largest and most impactful image-based profiling initiatives in the world. Here are a few to spark your curiosity:


🧬 JUMP-CP — Joint Undertaking for Morphological Profiling

The largest public Cell Painting dataset ever produced, generated by a consortium of 13 pharmaceutical companies and academic institutions (including AstraZeneca, Bayer, Pfizer, Merck KGaA, and the Broad Institute). JUMP-CP profiled over 116,000 compounds and ~15,000 genetic perturbations, with all profiles processed using Pycytominer. The resulting resource is used to predict compound activity, identify drug mechanisms, and match small molecules to disease phenotypes — at industrial scale.


🔬 LINCS Cell Painting — Library of Integrated Network-based Cellular Signatures

An NIH-funded initiative that profiled 1,571 bioactive compounds across six doses and five replicates in A549 lung cancer cells. Pycytominer was adopted as the primary profiling tool for this dataset, producing normalized and feature-selected profiles (Levels 3–5) that are publicly available for download. LINCS demonstrated that image-based profiles could serve as a systematic, reproducible reference map of cellular responses to chemical perturbation.


🌍 EU-OPENSCREEN — European Chemical Biology Research Infrastructure

A distributed pan-European research infrastructure spanning 30 partner sites across eight countries. EU-OPENSCREEN has integrated Cell Painting into its screening platform, enabling European academic and industry researchers to access high-content imaging and morphological profiling as a service. Their contributions to the JUMP-CP consortium extended the reach of image-based profiling into the broader European drug discovery community.


🖼️ Cell Painting Gallery — Broad Institute Open Dataset Collection

A growing public repository of Cell Painting datasets, hosted on AWS as open data and maintained by the Carpenter–Singh and Cimini labs at the Broad Institute. The gallery spans tens of thousands of small-molecule treatments across diverse cell lines and experimental designs — all freely accessible and ready for analysis. It is the canonical reference point for new Cell Painting datasets produced by the community.


These resources process their raw CellProfiler outputs through the same Pycytominer pipeline you just ran — the only difference is scale.