Single-cell image-based profiling¶

A complete single-cell processing pipeline with Pycytominer¶

High-content microscopy experiments can produce thousands of single-cell measurements per image. Working at single-cell resolution (rather than first aggregating cells into well-level profiles) preserves the full diversity of cellular responses: rare subpopulations, bimodal distributions, and heterogeneous drug effects that vanish in the average.

Single-cell profiling introduces a challenge that well-level profiling sidesteps: not every detected object is a real, well-segmented cell. Debris, out-of-focus objects, and fused cells contaminate the feature matrix and distort downstream analyses. A quality-control step is therefore essential before dimensionality reduction, clustering, or hit calling.

This tutorial walks through a complete single-cell processing pipeline starting from CytoTable output. coSMicQC is used here for QC:

Load: read the joined single-cell Parquet file produced by CytoTable
Annotate: attach experimental metadata and QC flags from coSMicQC
Normalize: drop QC outliers and z-score features against DMSO controls
Feature select: drop redundant and uninformative features

The result is a clean, normalized single-cell feature matrix ready for dimensionality reduction, clustering, or further aggregation.

New to pycytominer? Read the Introduction to Pycytominer tutorial first. This tutorial assumes familiarity with the core pipeline steps.

flowchart TD cytotable["CytoTable output single_cells.parquet, 1200 cells"] qcfile["coSMicQC output qc.parquet, QC annotations"] ann["annotate() Add platemap + QC flags"] nor["normalize() Drop QC outliers · Z-score vs DMSO"] fea["feature_select() Remove redundant features"] output["Single-cell profiles ~1174 cells, 10 features"] cytotable --> ann qcfile --> ann ann --> nor --> fea --> output style cytotable fill:#f0d9fa,stroke:#88239A,color:#111 style qcfile fill:#f0d9fa,stroke:#88239A,color:#111 style output fill:#f0d9fa,stroke:#88239A,color:#111 style ann fill:#ffffff,stroke:#88239A,color:#111 style nor fill:#ffffff,stroke:#88239A,color:#111 style fea fill:#ffffff,stroke:#88239A,color:#111

Prerequisites¶

Install the required packages:

pip install pycytominer coSMicQC pyarrow pandas numpy

This tutorial uses simulated data that matches the exact schema produced by CytoTable and coSMicQC. In a real experiment, replace the simulation block with your own single_cells.parquet and qc.parquet files.

[1]:

import tempfile
from pathlib import Path

import numpy as np
import pandas as pd

from pycytominer import annotate, feature_select, normalize

# Reproducible random state used throughout the simulation
rng = np.random.default_rng(42)

# Temporary directory — stands in for the output directory on your filesystem
tmp_dir = Path(tempfile.mkdtemp())
print(f"Working directory: {tmp_dir}")

Working directory: /var/folders/02/q30k_4wn2dqbz5pj_vvc8xn40000gp/T/tmp57clvnip

Input: CytoTable Single-Cell Data¶

CytoTable converts CellProfiler SQLite or CSV output into a single analysis-ready Parquet file. Each row represents one segmented object (a cell), and columns fall into three groups:

Group	Example columns	Purpose
`Metadata_*`	`Metadata_Plate`, `Metadata_Well`, `Metadata_ImageNumber`, `Metadata_ObjectNumber`	Describe the experiment
`cytotable_meta_*`	`cytotable_meta_source_path`, `cytotable_meta_offset`	CytoTable provenance. Pycytominer ignores these automatically
Feature columns	`Cells_AreaShape_Area`, `Nuclei_Intensity_MeanIntensity_DNA`	Morphology measurements per single-cell

Metadata_ImageNumber and Metadata_ObjectNumber together uniquely identify every cell and serve as the join key between the single-cell data and the coSMicQC annotations.

Note on ``cytotable_meta_*`` columns: These provenance columns track source-file offsets for CytoTable’s internal bookkeeping. Pycytominer’s feature inference uses CellProfiler compartment prefixes (Cells_, Cytoplasm_, Nuclei_) and ignores them automatically. They pass through annotate() unchanged and are dropped at the normalize() step.

The simulation code is available in the expandable block below. Skip it to go straight to the next step.

In a real experiment these files come from running CytoTable and coSMicQC on your CellProfiler output. The functions below reproduce their output schemas using synthetic data.

Step A — simulate CytoTable single-cell data

WELLS = {
    "B02": "DMSO",       "C02": "DMSO",
    "B03": "Compound_A", "C03": "Compound_A",
    "B04": "Compound_B", "C04": "Compound_B",
}
N_CELLS_PER_WELL = 100

def simulate_cytotable(plate_id: str) -> pd.DataFrame:
    """Generate a synthetic CytoTable-style single-cell DataFrame."""
    rows = []
    for img_num, (well, treatment) in enumerate(WELLS.items(), start=1):
        is_a = float(treatment == "Compound_A")
        is_b = float(treatment == "Compound_B")
        cell_areas   = rng.normal(500 + 180 * is_a - 90 * is_b, 120, N_CELLS_PER_WELL)
        nuclei_areas = rng.normal(195, 55, N_CELLS_PER_WELL)
        for obj_num in range(1, N_CELLS_PER_WELL + 1):
            rows.append(                {
              # ── CytoTable metadata ──────────────────────────────────
              "Metadata_Plate": plate_id,
              "Metadata_Well": well,
              "Metadata_ImageNumber": img_num,
              "Metadata_ObjectNumber": obj_num,
              # CytoTable provenance columns
              "cytotable_meta_source_path": f"/data/{plate_id}/images/",
              "cytotable_meta_offset": (img_num - 1) * N_CELLS_PER_WELL + obj_num,
              "cytotable_meta_rownum": obj_num,
              # ── Feature columns ─────────────────────────────────────
              "Cells_AreaShape_Area": cell_areas[obj_num - 1],
              "Cells_AreaShape_BoundingBoxArea": cell_areas[obj_num - 1] * 1.3
              + rng.normal(0, 4),
              "Cells_AreaShape_EulerNumber": 1,
              "Cells_AreaShape_Eccentricity": float(
                  np.clip(rng.normal(0.55, 0.12), 0, 1)
              ),
              "Cells_Intensity_MeanIntensity_Mito": rng.normal(0.30, 0.06),
              "Cells_Texture_Correlation_RNA_3_0_256": rng.normal(0.22, 0.06),
              "Cytoplasm_AreaShape_Area": rng.normal(310, 80),
              "Cytoplasm_Intensity_MeanIntensity_AGP": rng.normal(0.25, 0.07),
              "Nuclei_AreaShape_Area": nuclei_areas[obj_num - 1],
              "Nuclei_AreaShape_Eccentricity": float(
                  np.clip(rng.normal(0.40, 0.10), 0, 1)
              ),
              "Nuclei_Intensity_MeanIntensity_DNA": rng.normal(0.50, 0.08),
              "Nuclei_Intensity_MassDisplacement_DNA": abs(rng.normal(6, 4)),
          })
    return pd.DataFrame(rows)

Step B — simulate coSMicQC QC annotations

label_outliers(..., export_as_annotations=True) writes a compact Parquet with only join-key columns and boolean Metadata_cqc_* flags.

def simulate_qc_parquet(sc_df: pd.DataFrame) -> pd.DataFrame:
    """Reproduce the annotation schema produced by coSMicQC label_outliers()."""

    join_keys = [
        "Metadata_Plate",
        "Metadata_Well",
        "Metadata_ImageNumber",
        "Metadata_ObjectNumber",
    ]

    qc = sc_df[join_keys].copy()

    nuc_area = sc_df["Nuclei_AreaShape_Area"]
    nuc_z = (nuc_area - nuc_area.mean()) / nuc_area.std()

    mass_disp = sc_df["Nuclei_Intensity_MassDisplacement_DNA"]
    mass_disp_z = (mass_disp - mass_disp.mean()) / mass_disp.std()

    qc["Metadata_cqc_large_nuclear_size_is_outlier"] = nuc_z > 2.5
    qc["Metadata_cqc_small_nuclear_size_is_outlier"] = nuc_z < -2.5
    qc["Metadata_cqc_poor_segmentation_is_outlier"] = mass_disp_z > 2.5

    return qc

Step C — build two plates and write to disk

plate1 = simulate_cytotable("Plate_1")
plate2 = simulate_cytotable("Plate_2")
single_cells_raw = pd.concat([plate1, plate2], ignore_index=True)

qc_annotations_raw = simulate_qc_parquet(single_cells_raw)

sc_path = tmp_dir / "single_cells.parquet"
qc_path = tmp_dir / "qc.parquet"
single_cells_raw.to_parquet(sc_path, index=False)
qc_annotations_raw.to_parquet(qc_path, index=False)

print(f"single_cells.parquet  {single_cells_raw.shape[0]:,} rows x {single_cells_raw.shape[1]} cols")
print(f"qc.parquet            {qc_annotations_raw.shape[0]:,} rows x {qc_annotations_raw.shape[1]} cols")
print(f"\nqc.parquet columns: {list(qc_annotations_raw.columns)}")

[3]:

# Load the CytoTable parquet from disk
single_cells = pd.read_parquet(sc_path)

print(
    f"Loaded {len(single_cells):,} single cells across "
    f"{single_cells['Metadata_Plate'].nunique()} plates and "
    f"{single_cells['Metadata_Well'].nunique()} unique wells"
)
print(
    f"\nFeature columns ({len([c for c in single_cells.columns if not c.startswith('Metadata_') and not c.startswith('cytotable_')])}): "
    f"{[c for c in single_cells.columns if not c.startswith('Metadata_') and not c.startswith('cytotable_')]}"
)
single_cells.head(3)

Loaded 1,200 single cells across 2 plates and 6 unique wells

Feature columns (12): ['Cells_AreaShape_Area', 'Cells_AreaShape_BoundingBoxArea', 'Cells_AreaShape_EulerNumber', 'Cells_AreaShape_Eccentricity', 'Cells_Intensity_MeanIntensity_Mito', 'Cells_Texture_Correlation_RNA_3_0_256', 'Cytoplasm_AreaShape_Area', 'Cytoplasm_Intensity_MeanIntensity_AGP', 'Nuclei_AreaShape_Area', 'Nuclei_AreaShape_Eccentricity', 'Nuclei_Intensity_MeanIntensity_DNA', 'Nuclei_Intensity_MassDisplacement_DNA']

[3]:

	Metadata_Plate	Metadata_Well	Metadata_ImageNumber	Metadata_ObjectNumber	cytotable_meta_source_path	cytotable_meta_offset	cytotable_meta_rownum	Cells_AreaShape_Area	Cells_AreaShape_BoundingBoxArea	Cells_AreaShape_EulerNumber	Cells_AreaShape_Eccentricity	Cells_Intensity_MeanIntensity_Mito	Cells_Texture_Correlation_RNA_3_0_256	Cytoplasm_AreaShape_Area	Cytoplasm_Intensity_MeanIntensity_AGP	Nuclei_AreaShape_Area	Nuclei_AreaShape_Eccentricity	Nuclei_Intensity_MeanIntensity_DNA	Nuclei_Intensity_MassDisplacement_DNA
0	Plate_1	B02	1	1	/data/Plate_1/images/	1	1	536.566050	698.886163	1	0.718898	0.305435	0.258636	145.986232	0.246590	174.201060	0.315677	0.402495	2.487391
1	Plate_1	B02	1	2	/data/Plate_1/images/	2	2	375.201907	486.425986	1	0.659908	0.220416	0.221838	271.266445	0.227063	266.457556	0.500276	0.543049	11.349592
2	Plate_1	B02	1	3	/data/Plate_1/images/	3	3	590.054143	766.452364	1	0.466487	0.286568	0.234550	324.125869	0.174093	175.405482	0.409049	0.518258	16.069896

Background: Single-cell quality control with coSMicQC [Optional]¶

coSMicQC (GitHub | docs | preprint) is a Python package from the Way Lab that systematically identifies segmentation artifacts, for example:

Artifact	Morphological signature	Biological cause
Debris / background	Very small nucleus; low DNA intensity	Out-of-focus plane, dust on coverslip
Over-segmented nucleus	Nucleus area far above the population mean	One nucleus split into multiple objects
Touching / fused cells	Very high mass displacement from multiple objects	Adjacent cells merged into a single object

How coSMicQC flags outliers¶

coSMicQC computes a z-score for each quality-relevant feature across the entire experiment. Cells whose z-scores fall outside user-defined thresholds are flagged as outliers. Thresholds are signed:

A negative threshold (e.g. −2.5) flags cells where the feature is unusually small (debris, broken nuclei).
A positive threshold (e.g. +2.5) flags cells where the feature is unusually large (fused or over-segmented objects).

The main entry point is label_outliers(), which accepts a dictionary of named QC conditions. Each condition name becomes part of the output column name, making the reason for each flag explicit and auditable:

import cosmicqc

labeled = cosmicqc.label_outliers(
    df=single_cells,
    feature_thresholds={
        # Flag nuclei that are too small (debris)
        "small_nuclear_size": {
            "Nuclei_AreaShape_Area": -2.5,
        },
        # Flag nuclei that are too large (over-segmented)
        "large_nuclear_size": {
            "Nuclei_AreaShape_Area": 2.5,
        },
        # Flag cells with an abnormally high nuclear mass displacement
        # (a hallmark of touching or merged nuclei in one object)
        "poor_segmentation": {
            "Nuclei_Intensity_MassDisplacement_DNA": 2.5,
        },
    },
    include_threshold_scores=True,   # also write z-score columns for auditing
    export_path="qc.parquet",
    export_as_annotations=True,      # write compact annotation file only
    annotation_metadata_columns=[
        "Metadata_Plate", "Metadata_Well",
        "Metadata_ImageNumber", "Metadata_ObjectNumber",
    ],
)

The `qc.parquet` annotation file¶

When export_as_annotations=True, coSMicQC writes a compact annotation file called qc.parquet, which contains only the join-key metadata columns and the Metadata_cqc_* flag columns (not the full feature table). This makes qc.parquet lightweight and easy to share independently of the raw single-cell data.

Each Metadata_cqc_<condition>_is_outlier column is a boolean: True = flagged, False = passes that QC check. A cell must pass all conditions to be included in downstream analysis.

[4]:

# Load the coSMicQC annotation file and inspect its contents
qc_annotations = pd.read_parquet(qc_path)

print("coSMicQC annotation columns:")
for col in qc_annotations.columns:
    print(f"  {col}")

outlier_cols = [c for c in qc_annotations.columns if c.endswith("_is_outlier")]
print()
for col in outlier_cols:
    n_flagged = qc_annotations[col].sum()
    print(
        f"  {col}: {n_flagged:,} cells flagged ({100 * n_flagged / len(qc_annotations):.1f}%)"
    )

coSMicQC annotation columns:
  Metadata_Plate
  Metadata_Well
  Metadata_ImageNumber
  Metadata_ObjectNumber
  Metadata_cqc_large_nuclear_size_is_outlier
  Metadata_cqc_small_nuclear_size_is_outlier
  Metadata_cqc_poor_segmentation_is_outlier

  Metadata_cqc_large_nuclear_size_is_outlier: 5 cells flagged (0.4%)
  Metadata_cqc_small_nuclear_size_is_outlier: 9 cells flagged (0.8%)
  Metadata_cqc_poor_segmentation_is_outlier: 12 cells flagged (1.0%)

Step 1: Annotate¶

annotate() does two jobs at once via its external_metadata parameter:

Plate-map join attaches the biological condition (treatment, cell line, concentration) recorded for each well to every cell in that well.
External metadata merge merges any additional per-cell metadata DataFrame or file. The most common use case is a qc.parquet file from coSMicQC: passing it as external_metadata adds the Metadata_cqc_* flag columns directly to the annotated profiles.

Parameter	Description
`platemap`	Maps well positions to treatment conditions
`join_on`	Column pair `[platemap_col, profiles_col]` for the well-position join
`external_metadata`	Path to `qc.parquet` (or any additional metadata DataFrame)
`external_join_on`	Column(s) shared by profiles and external metadata (here the four-part cell identity key)

After annotate() runs, the Metadata_cqc_* flag columns are present on every row and flow straight into normalize(), which applies the QC filter internally via drop_cosmicqc_rows=True.

[5]:

platemap = pd.DataFrame({
    "well_position": ["B02", "C02", "B03", "C03", "B04", "C04"],
    "treatment": [
        "DMSO",
        "DMSO",
        "Compound_A",
        "Compound_A",
        "Compound_B",
        "Compound_B",
    ],
    "cell_line": ["HeLa"] * 6,
    "concentration_um": [0.0, 0.0, 10.0, 10.0, 5.0, 5.0],
})
platemap

[5]:

	well_position	treatment	cell_line	concentration_um
0	B02	DMSO	HeLa	0.0
1	C02	DMSO	HeLa	0.0
2	B03	Compound_A	HeLa	10.0
3	C03	Compound_A	HeLa	10.0
4	B04	Compound_B	HeLa	5.0
5	C04	Compound_B	HeLa	5.0

[6]:

join_keys = [
    "Metadata_Plate",
    "Metadata_Well",
    "Metadata_ImageNumber",
    "Metadata_ObjectNumber",
]

# annotate() merges the plate map AND the QC annotation file in a single call.
# The qc.parquet columns already carry the Metadata_ prefix, so they pass through
# prepare_external_metadata_for_annotate() unchanged.
annotated_cells = annotate(
    profiles=single_cells,
    platemap=platemap,
    join_on=["Metadata_well_position", "Metadata_Well"],
    add_metadata_id_to_platemap=True,
    external_metadata=str(qc_path),
    external_join_on=join_keys,
)

new_cols = [c for c in annotated_cells.columns if c not in single_cells.columns]
qc_cols = [c for c in new_cols if "cqc" in c]
print(f"New columns: {new_cols}")
print(f"QC flag columns: {qc_cols}")
print(
    f"\nCells flagged by any QC condition: "
    f"{annotated_cells[qc_cols].any(axis=1).sum():,} "
    f"({100 * annotated_cells[qc_cols].any(axis=1).mean():.1f}%)"
)
print()
annotated_cells[
    [c for c in annotated_cells.columns if c.startswith("Metadata_")]
].drop_duplicates(subset=["Metadata_Well"]).head()

New columns: ['Metadata_treatment', 'Metadata_cell_line', 'Metadata_concentration_um', 'Metadata_cqc_large_nuclear_size_is_outlier', 'Metadata_cqc_small_nuclear_size_is_outlier', 'Metadata_cqc_poor_segmentation_is_outlier']
QC flag columns: ['Metadata_cqc_large_nuclear_size_is_outlier', 'Metadata_cqc_small_nuclear_size_is_outlier', 'Metadata_cqc_poor_segmentation_is_outlier']

Cells flagged by any QC condition: 26 (2.2%)

[6]:

	Metadata_treatment	Metadata_cell_line	Metadata_concentration_um	Metadata_Plate	Metadata_Well	Metadata_ImageNumber	Metadata_ObjectNumber	Metadata_cqc_large_nuclear_size_is_outlier	Metadata_cqc_small_nuclear_size_is_outlier	Metadata_cqc_poor_segmentation_is_outlier
0	DMSO	HeLa	0.0	Plate_1	B02	1	1	False	False	False
200	DMSO	HeLa	0.0	Plate_1	C02	2	1	False	False	False
400	Compound_A	HeLa	10.0	Plate_1	B03	3	1	False	False	False
600	Compound_A	HeLa	10.0	Plate_1	C03	4	1	False	False	False
800	Compound_B	HeLa	5.0	Plate_1	B04	5	1	False	False	False

Step 2: Normalize¶

Raw CellProfiler features vary in scale (cell area in pixels², intensities in 0–1) and are influenced by plate-to-plate technical effects. Normalization places all features on a common scale and limits plate-to-plate variation by z-scoring each feature relative to the DMSO control cells.

Passing drop_cosmicqc_rows=True tells normalize() to drop every row where any Metadata_cqc_* flag is True before computing the z-scores, so QC filtering and normalization happen in a single call.

[7]:

# drop_cosmicqc_rows=True removes QC-flagged cells before z-scoring.
normalized_cells = normalize(
    profiles=annotated_cells,
    features="infer",
    meta_features="infer",
    samples="Metadata_treatment == 'DMSO'",
    method="standardize",
    drop_cosmicqc_rows=True,
)

n_removed = len(annotated_cells) - len(normalized_cells)
print(f"{'Total cells':<22} {len(annotated_cells):>6,}")
print(
    f"{'Removed (QC outliers)':<22} {n_removed:>6,}  ({100 * n_removed / len(annotated_cells):.1f}%)"
)
print(f"{'Retained':<22} {len(normalized_cells):>6,}")
print()
print(f"Normalized shape: {normalized_cells.shape}")
normalized_cells.head(3)

Total cells             1,200
Removed (QC outliers)      26  (2.2%)
Retained                1,174

Normalized shape: (1174, 22)

[7]:

	Metadata_treatment	Metadata_cell_line	Metadata_Plate	Metadata_Well	Metadata_ImageNumber	Metadata_ObjectNumber	Metadata_cqc_large_nuclear_size_is_outlier	Metadata_cqc_small_nuclear_size_is_outlier	Metadata_cqc_poor_segmentation_is_outlier	...	Cells_AreaShape_Eccentricity	Cells_Intensity_MeanIntensity_Mito	Cells_Texture_Correlation_RNA_3_0_256	Cytoplasm_AreaShape_Area	Cytoplasm_Intensity_MeanIntensity_AGP	Nuclei_AreaShape_Area	Nuclei_AreaShape_Eccentricity	Nuclei_Intensity_MeanIntensity_DNA	Nuclei_Intensity_MassDisplacement_DNA
0	DMSO	HeLa	Plate_1	B02	1	1	False	False	False	...	1.292772	0.100004	0.733196	-2.107301	-0.009458	-0.380694	-0.907252	-1.166556	-1.028890
1	DMSO	HeLa	Plate_1	B02	1	2	False	False	False	...	0.819530	-1.325156	0.121467	-0.481165	-0.293417	1.319231	1.025113	0.607217	1.555226
3	DMSO	HeLa	Plate_1	B02	1	4	False	False	False	...	-0.883623	-0.280147	-1.368761	-0.591794	0.361401	0.749972	1.237712	-0.672132	-0.767620

3 rows × 22 columns

Step 3: Feature Selection¶

Even after QC and normalization, some features carry little information:

Low-variance features are nearly constant across all cells and cannot distinguish biological conditions.
Highly correlated feature pairs are redundant; keeping both double-weights that axis of variation in clustering and embeddings.
Blocklisted features are known to capture image artifacts rather than cell biology.

feature_select() applies all three filters, producing a lean feature matrix ready for single-cell analyses such as UMAP or hierarchical clustering.

[8]:

selected_cells = feature_select(
    profiles=normalized_cells,
    features="infer",
    operation=["variance_threshold", "correlation_threshold", "blocklist"],
)

feature_cols_before = [
    c for c in normalized_cells.columns if not c.startswith("Metadata_")
]
feature_cols_after = [
    c for c in selected_cells.columns if not c.startswith("Metadata_")
]

print(f"Features before selection: {len(feature_cols_before)}")
print(f"Features after  selection: {len(feature_cols_after)}")
print(f"Features removed: {set(feature_cols_before) - set(feature_cols_after)}")
selected_cells.head(3)

Features before selection: 12
Features after  selection: 10
Features removed: {'Cells_AreaShape_Area', 'Cells_AreaShape_EulerNumber'}

[8]:

	Metadata_treatment	Metadata_cell_line	Metadata_Plate	Metadata_Well	Metadata_ImageNumber	Metadata_ObjectNumber	Metadata_cqc_large_nuclear_size_is_outlier	Metadata_cqc_small_nuclear_size_is_outlier	Metadata_cqc_poor_segmentation_is_outlier	Cells_AreaShape_BoundingBoxArea	Cells_AreaShape_Eccentricity	Cells_Intensity_MeanIntensity_Mito	Cells_Texture_Correlation_RNA_3_0_256	Cytoplasm_AreaShape_Area	Cytoplasm_Intensity_MeanIntensity_AGP	Nuclei_AreaShape_Area	Nuclei_AreaShape_Eccentricity	Nuclei_Intensity_MeanIntensity_DNA	Nuclei_Intensity_MassDisplacement_DNA
0	DMSO	HeLa	Plate_1	B02	1	1	False	False	False	0.423873	1.292772	0.100004	0.733196	-2.107301	-0.009458	-0.380694	-0.907252	-1.166556	-1.028890
1	DMSO	HeLa	Plate_1	B02	1	2	False	False	False	-1.020114	0.819530	-1.325156	0.121467	-0.481165	-0.293417	1.319231	1.025113	0.607217	1.555226
3	DMSO	HeLa	Plate_1	B02	1	4	False	False	False	1.139881	-0.883623	-0.280147	-1.368761	-0.591794	0.361401	0.749972	1.237712	-0.672132	-0.767620

Summary¶

You have processed a CytoTable single-cell dataset through a complete quality-control and normalization pipeline, preserving single-cell resolution throughout:

Step	Function	Input	Output
Load	`pd.read_parquet`	CytoTable Parquet	1,200 single cells
Annotate	`annotate()`	Single cells + platemap + `qc.parquet`	Cells with treatment labels and QC flags
Normalize	`normalize(drop_cosmicqc_rows=True)`	Annotated cells	~1,176 passing cells, Z-scored
Feature select	`feature_select()`	11 features	9 features

The output is a clean, normalized single-cell feature matrix, selected_cells, where every row is one cell and every column is an informative morphological feature.

Next steps¶

Embed: run UMAP or t-SNE on selected_cells to visualize how treatments separate in morphological space at single-cell resolution.
Cluster: apply k-means or Leiden clustering to discover subpopulations within each treatment condition.
Aggregate: feed selected_cells into aggregate() if you need well-level profiles (e.g. for the consensus pipeline shown in the Introduction to Image-based Profiling with Pycytominer tutorial).
Hit calling: identify which compounds produce a statistically significant morphological change relative to controls. Buscar operates directly on single-cell distributions to capture cellular heterogeneity and flag off-target effects.