Single-cell image-based profiling

A complete single-cell processing pipeline with Pycytominer

High-content microscopy experiments can produce thousands of single-cell measurements per image. Working at single-cell resolution (rather than first aggregating cells into well-level profiles) preserves the full diversity of cellular responses: rare subpopulations, bimodal distributions, and heterogeneous drug effects that vanish in the average.

Single-cell profiling introduces a challenge that well-level profiling sidesteps: not every detected object is a real, well-segmented cell. Debris, out-of-focus objects, and fused cells contaminate the feature matrix and distort downstream analyses. A quality-control step is therefore essential before dimensionality reduction, clustering, or hit calling.

This tutorial walks through a complete single-cell processing pipeline starting from CytoTable output. coSMicQC is used here for QC:

  1. Load: read the joined single-cell Parquet file produced by CytoTable

  2. Annotate: attach experimental metadata and QC flags from coSMicQC

  3. Normalize: drop QC outliers and z-score features against DMSO controls

  4. Feature select: drop redundant and uninformative features

The result is a clean, normalized single-cell feature matrix ready for dimensionality reduction, clustering, or further aggregation.

New to pycytominer? Read the Introduction to Pycytominer tutorial first. This tutorial assumes familiarity with the core pipeline steps.

flowchart TD cytotable["CytoTable output<br/>single_cells.parquet, 1200 cells"] qcfile["coSMicQC output<br/>qc.parquet, QC annotations"] ann["annotate()<br/>Add platemap + QC flags"] nor["normalize()<br/>Drop QC outliers · Z-score vs DMSO"] fea["feature_select()<br/>Remove redundant features"] output["Single-cell profiles<br/>~1174 cells, 10 features"] cytotable --> ann qcfile --> ann ann --> nor --> fea --> output style cytotable fill:#f0d9fa,stroke:#88239A,color:#111 style qcfile fill:#f0d9fa,stroke:#88239A,color:#111 style output fill:#f0d9fa,stroke:#88239A,color:#111 style ann fill:#ffffff,stroke:#88239A,color:#111 style nor fill:#ffffff,stroke:#88239A,color:#111 style fea fill:#ffffff,stroke:#88239A,color:#111

Prerequisites

Install the required packages:

pip install pycytominer coSMicQC pyarrow pandas numpy

This tutorial uses simulated data that matches the exact schema produced by CytoTable and coSMicQC. In a real experiment, replace the simulation block with your own single_cells.parquet and qc.parquet files.

[1]:
import tempfile
from pathlib import Path

import numpy as np
import pandas as pd

from pycytominer import annotate, feature_select, normalize

# Reproducible random state used throughout the simulation
rng = np.random.default_rng(42)

# Temporary directory — stands in for the output directory on your filesystem
tmp_dir = Path(tempfile.mkdtemp())
print(f"Working directory: {tmp_dir}")
Working directory: /var/folders/02/q30k_4wn2dqbz5pj_vvc8xn40000gp/T/tmp57clvnip

Input: CytoTable Single-Cell Data

CytoTable converts CellProfiler SQLite or CSV output into a single analysis-ready Parquet file. Each row represents one segmented object (a cell), and columns fall into three groups:

Group

Example columns

Purpose

Metadata_*

Metadata_Plate, Metadata_Well, Metadata_ImageNumber, Metadata_ObjectNumber

Describe the experiment

cytotable_meta_*

cytotable_meta_source_path, cytotable_meta_offset

CytoTable provenance. Pycytominer ignores these automatically

Feature columns

Cells_AreaShape_Area, Nuclei_Intensity_MeanIntensity_DNA

Morphology measurements per single-cell

Metadata_ImageNumber and Metadata_ObjectNumber together uniquely identify every cell and serve as the join key between the single-cell data and the coSMicQC annotations.

Note on ``cytotable_meta_*`` columns: These provenance columns track source-file offsets for CytoTable’s internal bookkeeping. Pycytominer’s feature inference uses CellProfiler compartment prefixes (Cells_, Cytoplasm_, Nuclei_) and ignores them automatically. They pass through annotate() unchanged and are dropped at the normalize() step.

The simulation code is available in the expandable block below. Skip it to go straight to the next step.

In a real experiment these files come from running CytoTable and coSMicQC on your CellProfiler output. The functions below reproduce their output schemas using synthetic data.

Step A — simulate CytoTable single-cell data

WELLS = {
    "B02": "DMSO",       "C02": "DMSO",
    "B03": "Compound_A", "C03": "Compound_A",
    "B04": "Compound_B", "C04": "Compound_B",
}
N_CELLS_PER_WELL = 100

def simulate_cytotable(plate_id: str) -> pd.DataFrame:
    """Generate a synthetic CytoTable-style single-cell DataFrame."""
    rows = []
    for img_num, (well, treatment) in enumerate(WELLS.items(), start=1):
        is_a = float(treatment == "Compound_A")
        is_b = float(treatment == "Compound_B")
        cell_areas   = rng.normal(500 + 180 * is_a - 90 * is_b, 120, N_CELLS_PER_WELL)
        nuclei_areas = rng.normal(195, 55, N_CELLS_PER_WELL)
        for obj_num in range(1, N_CELLS_PER_WELL + 1):
            rows.append(                {
              # ── CytoTable metadata ──────────────────────────────────
              "Metadata_Plate": plate_id,
              "Metadata_Well": well,
              "Metadata_ImageNumber": img_num,
              "Metadata_ObjectNumber": obj_num,
              # CytoTable provenance columns
              "cytotable_meta_source_path": f"/data/{plate_id}/images/",
              "cytotable_meta_offset": (img_num - 1) * N_CELLS_PER_WELL + obj_num,
              "cytotable_meta_rownum": obj_num,
              # ── Feature columns ─────────────────────────────────────
              "Cells_AreaShape_Area": cell_areas[obj_num - 1],
              "Cells_AreaShape_BoundingBoxArea": cell_areas[obj_num - 1] * 1.3
              + rng.normal(0, 4),
              "Cells_AreaShape_EulerNumber": 1,
              "Cells_AreaShape_Eccentricity": float(
                  np.clip(rng.normal(0.55, 0.12), 0, 1)
              ),
              "Cells_Intensity_MeanIntensity_Mito": rng.normal(0.30, 0.06),
              "Cells_Texture_Correlation_RNA_3_0_256": rng.normal(0.22, 0.06),
              "Cytoplasm_AreaShape_Area": rng.normal(310, 80),
              "Cytoplasm_Intensity_MeanIntensity_AGP": rng.normal(0.25, 0.07),
              "Nuclei_AreaShape_Area": nuclei_areas[obj_num - 1],
              "Nuclei_AreaShape_Eccentricity": float(
                  np.clip(rng.normal(0.40, 0.10), 0, 1)
              ),
              "Nuclei_Intensity_MeanIntensity_DNA": rng.normal(0.50, 0.08),
              "Nuclei_Intensity_MassDisplacement_DNA": abs(rng.normal(6, 4)),
          })
    return pd.DataFrame(rows)

Step B — simulate coSMicQC QC annotations

label_outliers(..., export_as_annotations=True) writes a compact Parquet with only join-key columns and boolean Metadata_cqc_* flags.

def simulate_qc_parquet(sc_df: pd.DataFrame) -> pd.DataFrame:
    """Reproduce the annotation schema produced by coSMicQC label_outliers()."""

    join_keys = [
        "Metadata_Plate",
        "Metadata_Well",
        "Metadata_ImageNumber",
        "Metadata_ObjectNumber",
    ]

    qc = sc_df[join_keys].copy()

    nuc_area = sc_df["Nuclei_AreaShape_Area"]
    nuc_z = (nuc_area - nuc_area.mean()) / nuc_area.std()

    mass_disp = sc_df["Nuclei_Intensity_MassDisplacement_DNA"]
    mass_disp_z = (mass_disp - mass_disp.mean()) / mass_disp.std()

    qc["Metadata_cqc_large_nuclear_size_is_outlier"] = nuc_z > 2.5
    qc["Metadata_cqc_small_nuclear_size_is_outlier"] = nuc_z < -2.5
    qc["Metadata_cqc_poor_segmentation_is_outlier"] = mass_disp_z > 2.5

    return qc

Step C — build two plates and write to disk

plate1 = simulate_cytotable("Plate_1")
plate2 = simulate_cytotable("Plate_2")
single_cells_raw = pd.concat([plate1, plate2], ignore_index=True)

qc_annotations_raw = simulate_qc_parquet(single_cells_raw)

sc_path = tmp_dir / "single_cells.parquet"
qc_path = tmp_dir / "qc.parquet"
single_cells_raw.to_parquet(sc_path, index=False)
qc_annotations_raw.to_parquet(qc_path, index=False)

print(f"single_cells.parquet  {single_cells_raw.shape[0]:,} rows x {single_cells_raw.shape[1]} cols")
print(f"qc.parquet            {qc_annotations_raw.shape[0]:,} rows x {qc_annotations_raw.shape[1]} cols")
print(f"\nqc.parquet columns: {list(qc_annotations_raw.columns)}")
[3]:
# Load the CytoTable parquet from disk
single_cells = pd.read_parquet(sc_path)

print(
    f"Loaded {len(single_cells):,} single cells across "
    f"{single_cells['Metadata_Plate'].nunique()} plates and "
    f"{single_cells['Metadata_Well'].nunique()} unique wells"
)
print(
    f"\nFeature columns ({len([c for c in single_cells.columns if not c.startswith('Metadata_') and not c.startswith('cytotable_')])}): "
    f"{[c for c in single_cells.columns if not c.startswith('Metadata_') and not c.startswith('cytotable_')]}"
)
single_cells.head(3)
Loaded 1,200 single cells across 2 plates and 6 unique wells

Feature columns (12): ['Cells_AreaShape_Area', 'Cells_AreaShape_BoundingBoxArea', 'Cells_AreaShape_EulerNumber', 'Cells_AreaShape_Eccentricity', 'Cells_Intensity_MeanIntensity_Mito', 'Cells_Texture_Correlation_RNA_3_0_256', 'Cytoplasm_AreaShape_Area', 'Cytoplasm_Intensity_MeanIntensity_AGP', 'Nuclei_AreaShape_Area', 'Nuclei_AreaShape_Eccentricity', 'Nuclei_Intensity_MeanIntensity_DNA', 'Nuclei_Intensity_MassDisplacement_DNA']
[3]:
Metadata_Plate Metadata_Well Metadata_ImageNumber Metadata_ObjectNumber cytotable_meta_source_path cytotable_meta_offset cytotable_meta_rownum Cells_AreaShape_Area Cells_AreaShape_BoundingBoxArea Cells_AreaShape_EulerNumber Cells_AreaShape_Eccentricity Cells_Intensity_MeanIntensity_Mito Cells_Texture_Correlation_RNA_3_0_256 Cytoplasm_AreaShape_Area Cytoplasm_Intensity_MeanIntensity_AGP Nuclei_AreaShape_Area Nuclei_AreaShape_Eccentricity Nuclei_Intensity_MeanIntensity_DNA Nuclei_Intensity_MassDisplacement_DNA
0 Plate_1 B02 1 1 /data/Plate_1/images/ 1 1 536.566050 698.886163 1 0.718898 0.305435 0.258636 145.986232 0.246590 174.201060 0.315677 0.402495 2.487391
1 Plate_1 B02 1 2 /data/Plate_1/images/ 2 2 375.201907 486.425986 1 0.659908 0.220416 0.221838 271.266445 0.227063 266.457556 0.500276 0.543049 11.349592
2 Plate_1 B02 1 3 /data/Plate_1/images/ 3 3 590.054143 766.452364 1 0.466487 0.286568 0.234550 324.125869 0.174093 175.405482 0.409049 0.518258 16.069896

Background: Single-cell quality control with coSMicQC [Optional]

coSMicQC (GitHub | docs | preprint) is a Python package from the Way Lab that systematically identifies segmentation artifacts, for example:

Artifact

Morphological signature

Biological cause

Debris / background

Very small nucleus; low DNA intensity

Out-of-focus plane, dust on coverslip

Over-segmented nucleus

Nucleus area far above the population mean

One nucleus split into multiple objects

Touching / fused cells

Very high mass displacement from multiple objects

Adjacent cells merged into a single object

How coSMicQC flags outliers

coSMicQC computes a z-score for each quality-relevant feature across the entire experiment. Cells whose z-scores fall outside user-defined thresholds are flagged as outliers. Thresholds are signed:

  • A negative threshold (e.g. −2.5) flags cells where the feature is unusually small (debris, broken nuclei).

  • A positive threshold (e.g. +2.5) flags cells where the feature is unusually large (fused or over-segmented objects).

The main entry point is label_outliers(), which accepts a dictionary of named QC conditions. Each condition name becomes part of the output column name, making the reason for each flag explicit and auditable:

import cosmicqc

labeled = cosmicqc.label_outliers(
    df=single_cells,
    feature_thresholds={
        # Flag nuclei that are too small (debris)
        "small_nuclear_size": {
            "Nuclei_AreaShape_Area": -2.5,
        },
        # Flag nuclei that are too large (over-segmented)
        "large_nuclear_size": {
            "Nuclei_AreaShape_Area": 2.5,
        },
        # Flag cells with an abnormally high nuclear mass displacement
        # (a hallmark of touching or merged nuclei in one object)
        "poor_segmentation": {
            "Nuclei_Intensity_MassDisplacement_DNA": 2.5,
        },
    },
    include_threshold_scores=True,   # also write z-score columns for auditing
    export_path="qc.parquet",
    export_as_annotations=True,      # write compact annotation file only
    annotation_metadata_columns=[
        "Metadata_Plate", "Metadata_Well",
        "Metadata_ImageNumber", "Metadata_ObjectNumber",
    ],
)

The qc.parquet annotation file

When export_as_annotations=True, coSMicQC writes a compact annotation file called qc.parquet, which contains only the join-key metadata columns and the Metadata_cqc_* flag columns (not the full feature table). This makes qc.parquet lightweight and easy to share independently of the raw single-cell data.

Each Metadata_cqc_<condition>_is_outlier column is a boolean: True = flagged, False = passes that QC check. A cell must pass all conditions to be included in downstream analysis.

[4]:
# Load the coSMicQC annotation file and inspect its contents
qc_annotations = pd.read_parquet(qc_path)

print("coSMicQC annotation columns:")
for col in qc_annotations.columns:
    print(f"  {col}")

outlier_cols = [c for c in qc_annotations.columns if c.endswith("_is_outlier")]
print()
for col in outlier_cols:
    n_flagged = qc_annotations[col].sum()
    print(
        f"  {col}: {n_flagged:,} cells flagged ({100 * n_flagged / len(qc_annotations):.1f}%)"
    )
coSMicQC annotation columns:
  Metadata_Plate
  Metadata_Well
  Metadata_ImageNumber
  Metadata_ObjectNumber
  Metadata_cqc_large_nuclear_size_is_outlier
  Metadata_cqc_small_nuclear_size_is_outlier
  Metadata_cqc_poor_segmentation_is_outlier

  Metadata_cqc_large_nuclear_size_is_outlier: 5 cells flagged (0.4%)
  Metadata_cqc_small_nuclear_size_is_outlier: 9 cells flagged (0.8%)
  Metadata_cqc_poor_segmentation_is_outlier: 12 cells flagged (1.0%)

Step 1: Annotate

annotate() does two jobs at once via its external_metadata parameter:

  1. Plate-map join attaches the biological condition (treatment, cell line, concentration) recorded for each well to every cell in that well.

  2. External metadata merge merges any additional per-cell metadata DataFrame or file. The most common use case is a qc.parquet file from coSMicQC: passing it as external_metadata adds the Metadata_cqc_* flag columns directly to the annotated profiles.

Parameter

Description

platemap

Maps well positions to treatment conditions

join_on

Column pair [platemap_col, profiles_col] for the well-position join

external_metadata

Path to qc.parquet (or any additional metadata DataFrame)

external_join_on

Column(s) shared by profiles and external metadata (here the four-part cell identity key)

After annotate() runs, the Metadata_cqc_* flag columns are present on every row and flow straight into normalize(), which applies the QC filter internally via drop_cosmicqc_rows=True.

[5]:
platemap = pd.DataFrame({
    "well_position": ["B02", "C02", "B03", "C03", "B04", "C04"],
    "treatment": [
        "DMSO",
        "DMSO",
        "Compound_A",
        "Compound_A",
        "Compound_B",
        "Compound_B",
    ],
    "cell_line": ["HeLa"] * 6,
    "concentration_um": [0.0, 0.0, 10.0, 10.0, 5.0, 5.0],
})
platemap
[5]:
well_position treatment cell_line concentration_um
0 B02 DMSO HeLa 0.0
1 C02 DMSO HeLa 0.0
2 B03 Compound_A HeLa 10.0
3 C03 Compound_A HeLa 10.0
4 B04 Compound_B HeLa 5.0
5 C04 Compound_B HeLa 5.0
[6]:
join_keys = [
    "Metadata_Plate",
    "Metadata_Well",
    "Metadata_ImageNumber",
    "Metadata_ObjectNumber",
]

# annotate() merges the plate map AND the QC annotation file in a single call.
# The qc.parquet columns already carry the Metadata_ prefix, so they pass through
# prepare_external_metadata_for_annotate() unchanged.
annotated_cells = annotate(
    profiles=single_cells,
    platemap=platemap,
    join_on=["Metadata_well_position", "Metadata_Well"],
    add_metadata_id_to_platemap=True,
    external_metadata=str(qc_path),
    external_join_on=join_keys,
)

new_cols = [c for c in annotated_cells.columns if c not in single_cells.columns]
qc_cols = [c for c in new_cols if "cqc" in c]
print(f"New columns: {new_cols}")
print(f"QC flag columns: {qc_cols}")
print(
    f"\nCells flagged by any QC condition: "
    f"{annotated_cells[qc_cols].any(axis=1).sum():,} "
    f"({100 * annotated_cells[qc_cols].any(axis=1).mean():.1f}%)"
)
print()
annotated_cells[
    [c for c in annotated_cells.columns if c.startswith("Metadata_")]
].drop_duplicates(subset=["Metadata_Well"]).head()
New columns: ['Metadata_treatment', 'Metadata_cell_line', 'Metadata_concentration_um', 'Metadata_cqc_large_nuclear_size_is_outlier', 'Metadata_cqc_small_nuclear_size_is_outlier', 'Metadata_cqc_poor_segmentation_is_outlier']
QC flag columns: ['Metadata_cqc_large_nuclear_size_is_outlier', 'Metadata_cqc_small_nuclear_size_is_outlier', 'Metadata_cqc_poor_segmentation_is_outlier']

Cells flagged by any QC condition: 26 (2.2%)

[6]:
Metadata_treatment Metadata_cell_line Metadata_concentration_um Metadata_Plate Metadata_Well Metadata_ImageNumber Metadata_ObjectNumber Metadata_cqc_large_nuclear_size_is_outlier Metadata_cqc_small_nuclear_size_is_outlier Metadata_cqc_poor_segmentation_is_outlier
0 DMSO HeLa 0.0 Plate_1 B02 1 1 False False False
200 DMSO HeLa 0.0 Plate_1 C02 2 1 False False False
400 Compound_A HeLa 10.0 Plate_1 B03 3 1 False False False
600 Compound_A HeLa 10.0 Plate_1 C03 4 1 False False False
800 Compound_B HeLa 5.0 Plate_1 B04 5 1 False False False

Step 2: Normalize

Raw CellProfiler features vary in scale (cell area in pixels², intensities in 0–1) and are influenced by plate-to-plate technical effects. Normalization places all features on a common scale and limits plate-to-plate variation by z-scoring each feature relative to the DMSO control cells.

Passing drop_cosmicqc_rows=True tells normalize() to drop every row where any Metadata_cqc_* flag is True before computing the z-scores, so QC filtering and normalization happen in a single call.

[7]:
# drop_cosmicqc_rows=True removes QC-flagged cells before z-scoring.
normalized_cells = normalize(
    profiles=annotated_cells,
    features="infer",
    meta_features="infer",
    samples="Metadata_treatment == 'DMSO'",
    method="standardize",
    drop_cosmicqc_rows=True,
)

n_removed = len(annotated_cells) - len(normalized_cells)
print(f"{'Total cells':<22} {len(annotated_cells):>6,}")
print(
    f"{'Removed (QC outliers)':<22} {n_removed:>6,}  ({100 * n_removed / len(annotated_cells):.1f}%)"
)
print(f"{'Retained':<22} {len(normalized_cells):>6,}")
print()
print(f"Normalized shape: {normalized_cells.shape}")
normalized_cells.head(3)
Total cells             1,200
Removed (QC outliers)      26  (2.2%)
Retained                1,174

Normalized shape: (1174, 22)
[7]:
Metadata_treatment Metadata_cell_line Metadata_concentration_um Metadata_Plate Metadata_Well Metadata_ImageNumber Metadata_ObjectNumber Metadata_cqc_large_nuclear_size_is_outlier Metadata_cqc_small_nuclear_size_is_outlier Metadata_cqc_poor_segmentation_is_outlier ... Cells_AreaShape_EulerNumber Cells_AreaShape_Eccentricity Cells_Intensity_MeanIntensity_Mito Cells_Texture_Correlation_RNA_3_0_256 Cytoplasm_AreaShape_Area Cytoplasm_Intensity_MeanIntensity_AGP Nuclei_AreaShape_Area Nuclei_AreaShape_Eccentricity Nuclei_Intensity_MeanIntensity_DNA Nuclei_Intensity_MassDisplacement_DNA
0 DMSO HeLa 0.0 Plate_1 B02 1 1 False False False ... 0.0 1.292772 0.100004 0.733196 -2.107301 -0.009458 -0.380694 -0.907252 -1.166556 -1.028890
1 DMSO HeLa 0.0 Plate_1 B02 1 2 False False False ... 0.0 0.819530 -1.325156 0.121467 -0.481165 -0.293417 1.319231 1.025113 0.607217 1.555226
3 DMSO HeLa 0.0 Plate_1 B02 1 4 False False False ... 0.0 -0.883623 -0.280147 -1.368761 -0.591794 0.361401 0.749972 1.237712 -0.672132 -0.767620

3 rows × 22 columns


Step 3: Feature Selection

Even after QC and normalization, some features carry little information:

  • Low-variance features are nearly constant across all cells and cannot distinguish biological conditions.

  • Highly correlated feature pairs are redundant; keeping both double-weights that axis of variation in clustering and embeddings.

  • Blocklisted features are known to capture image artifacts rather than cell biology.

feature_select() applies all three filters, producing a lean feature matrix ready for single-cell analyses such as UMAP or hierarchical clustering.

[8]:
selected_cells = feature_select(
    profiles=normalized_cells,
    features="infer",
    operation=["variance_threshold", "correlation_threshold", "blocklist"],
)

feature_cols_before = [
    c for c in normalized_cells.columns if not c.startswith("Metadata_")
]
feature_cols_after = [
    c for c in selected_cells.columns if not c.startswith("Metadata_")
]

print(f"Features before selection: {len(feature_cols_before)}")
print(f"Features after  selection: {len(feature_cols_after)}")
print(f"Features removed: {set(feature_cols_before) - set(feature_cols_after)}")
selected_cells.head(3)
Features before selection: 12
Features after  selection: 10
Features removed: {'Cells_AreaShape_Area', 'Cells_AreaShape_EulerNumber'}
[8]:
Metadata_treatment Metadata_cell_line Metadata_concentration_um Metadata_Plate Metadata_Well Metadata_ImageNumber Metadata_ObjectNumber Metadata_cqc_large_nuclear_size_is_outlier Metadata_cqc_small_nuclear_size_is_outlier Metadata_cqc_poor_segmentation_is_outlier Cells_AreaShape_BoundingBoxArea Cells_AreaShape_Eccentricity Cells_Intensity_MeanIntensity_Mito Cells_Texture_Correlation_RNA_3_0_256 Cytoplasm_AreaShape_Area Cytoplasm_Intensity_MeanIntensity_AGP Nuclei_AreaShape_Area Nuclei_AreaShape_Eccentricity Nuclei_Intensity_MeanIntensity_DNA Nuclei_Intensity_MassDisplacement_DNA
0 DMSO HeLa 0.0 Plate_1 B02 1 1 False False False 0.423873 1.292772 0.100004 0.733196 -2.107301 -0.009458 -0.380694 -0.907252 -1.166556 -1.028890
1 DMSO HeLa 0.0 Plate_1 B02 1 2 False False False -1.020114 0.819530 -1.325156 0.121467 -0.481165 -0.293417 1.319231 1.025113 0.607217 1.555226
3 DMSO HeLa 0.0 Plate_1 B02 1 4 False False False 1.139881 -0.883623 -0.280147 -1.368761 -0.591794 0.361401 0.749972 1.237712 -0.672132 -0.767620

Summary

You have processed a CytoTable single-cell dataset through a complete quality-control and normalization pipeline, preserving single-cell resolution throughout:

Step

Function

Input

Output

Load

pd.read_parquet

CytoTable Parquet

1,200 single cells

Annotate

annotate()

Single cells + platemap + qc.parquet

Cells with treatment labels and QC flags

Normalize

normalize(drop_cosmicqc_rows=True)

Annotated cells

~1,176 passing cells, Z-scored

Feature select

feature_select()

11 features

9 features

The output is a clean, normalized single-cell feature matrix, selected_cells, where every row is one cell and every column is an informative morphological feature.

Next steps

  • Embed: run UMAP or t-SNE on selected_cells to visualize how treatments separate in morphological space at single-cell resolution.

  • Cluster: apply k-means or Leiden clustering to discover subpopulations within each treatment condition.

  • Aggregate: feed selected_cells into aggregate() if you need well-level profiles (e.g. for the consensus pipeline shown in the Introduction to Image-based Profiling with Pycytominer tutorial).

  • Hit calling: identify which compounds produce a statistically significant morphological change relative to controls. Buscar operates directly on single-cell distributions to capture cellular heterogeneity and flag off-target effects.