Single-cell image-based profiling¶
A complete single-cell processing pipeline with Pycytominer¶
High-content microscopy experiments can produce thousands of single-cell measurements per image. Working at single-cell resolution (rather than first aggregating cells into well-level profiles) preserves the full diversity of cellular responses: rare subpopulations, bimodal distributions, and heterogeneous drug effects that vanish in the average.
Single-cell profiling introduces a challenge that well-level profiling sidesteps: not every detected object is a real, well-segmented cell. Debris, out-of-focus objects, and fused cells contaminate the feature matrix and distort downstream analyses. A quality-control step is therefore essential before dimensionality reduction, clustering, or hit calling.
This tutorial walks through a complete single-cell processing pipeline starting from CytoTable output. coSMicQC is used here for QC:
Load: read the joined single-cell Parquet file produced by CytoTable
Annotate: attach experimental metadata and QC flags from coSMicQC
Normalize: drop QC outliers and z-score features against DMSO controls
Feature select: drop redundant and uninformative features
The result is a clean, normalized single-cell feature matrix ready for dimensionality reduction, clustering, or further aggregation.
New to pycytominer? Read the Introduction to Pycytominer tutorial first. This tutorial assumes familiarity with the core pipeline steps.
Prerequisites¶
Install the required packages:
pip install pycytominer coSMicQC pyarrow pandas numpy
This tutorial uses simulated data that matches the exact schema produced by CytoTable and coSMicQC. In a real experiment, replace the simulation block with your own single_cells.parquet and qc.parquet files.
[1]:
import tempfile
from pathlib import Path
import numpy as np
import pandas as pd
from pycytominer import annotate, feature_select, normalize
# Reproducible random state used throughout the simulation
rng = np.random.default_rng(42)
# Temporary directory — stands in for the output directory on your filesystem
tmp_dir = Path(tempfile.mkdtemp())
print(f"Working directory: {tmp_dir}")
Working directory: /var/folders/02/q30k_4wn2dqbz5pj_vvc8xn40000gp/T/tmp57clvnip
Input: CytoTable Single-Cell Data¶
CytoTable converts CellProfiler SQLite or CSV output into a single analysis-ready Parquet file. Each row represents one segmented object (a cell), and columns fall into three groups:
Group |
Example columns |
Purpose |
|---|---|---|
|
|
Describe the experiment |
|
|
CytoTable provenance. Pycytominer ignores these automatically |
Feature columns |
|
Morphology measurements per single-cell |
Metadata_ImageNumber and Metadata_ObjectNumber together uniquely identify every cell and serve as the join key between the single-cell data and the coSMicQC annotations.
Note on ``cytotable_meta_*`` columns: These provenance columns track source-file offsets for CytoTable’s internal bookkeeping. Pycytominer’s feature inference uses CellProfiler compartment prefixes (
Cells_,Cytoplasm_,Nuclei_) and ignores them automatically. They pass throughannotate()unchanged and are dropped at thenormalize()step.
The simulation code is available in the expandable block below. Skip it to go straight to the next step.
In a real experiment these files come from running CytoTable and coSMicQC on your CellProfiler output. The functions below reproduce their output schemas using synthetic data.
Step A — simulate CytoTable single-cell data
WELLS = {
"B02": "DMSO", "C02": "DMSO",
"B03": "Compound_A", "C03": "Compound_A",
"B04": "Compound_B", "C04": "Compound_B",
}
N_CELLS_PER_WELL = 100
def simulate_cytotable(plate_id: str) -> pd.DataFrame:
"""Generate a synthetic CytoTable-style single-cell DataFrame."""
rows = []
for img_num, (well, treatment) in enumerate(WELLS.items(), start=1):
is_a = float(treatment == "Compound_A")
is_b = float(treatment == "Compound_B")
cell_areas = rng.normal(500 + 180 * is_a - 90 * is_b, 120, N_CELLS_PER_WELL)
nuclei_areas = rng.normal(195, 55, N_CELLS_PER_WELL)
for obj_num in range(1, N_CELLS_PER_WELL + 1):
rows.append( {
# ── CytoTable metadata ──────────────────────────────────
"Metadata_Plate": plate_id,
"Metadata_Well": well,
"Metadata_ImageNumber": img_num,
"Metadata_ObjectNumber": obj_num,
# CytoTable provenance columns
"cytotable_meta_source_path": f"/data/{plate_id}/images/",
"cytotable_meta_offset": (img_num - 1) * N_CELLS_PER_WELL + obj_num,
"cytotable_meta_rownum": obj_num,
# ── Feature columns ─────────────────────────────────────
"Cells_AreaShape_Area": cell_areas[obj_num - 1],
"Cells_AreaShape_BoundingBoxArea": cell_areas[obj_num - 1] * 1.3
+ rng.normal(0, 4),
"Cells_AreaShape_EulerNumber": 1,
"Cells_AreaShape_Eccentricity": float(
np.clip(rng.normal(0.55, 0.12), 0, 1)
),
"Cells_Intensity_MeanIntensity_Mito": rng.normal(0.30, 0.06),
"Cells_Texture_Correlation_RNA_3_0_256": rng.normal(0.22, 0.06),
"Cytoplasm_AreaShape_Area": rng.normal(310, 80),
"Cytoplasm_Intensity_MeanIntensity_AGP": rng.normal(0.25, 0.07),
"Nuclei_AreaShape_Area": nuclei_areas[obj_num - 1],
"Nuclei_AreaShape_Eccentricity": float(
np.clip(rng.normal(0.40, 0.10), 0, 1)
),
"Nuclei_Intensity_MeanIntensity_DNA": rng.normal(0.50, 0.08),
"Nuclei_Intensity_MassDisplacement_DNA": abs(rng.normal(6, 4)),
})
return pd.DataFrame(rows)
Step B — simulate coSMicQC QC annotations
label_outliers(..., export_as_annotations=True) writes a compact Parquet
with only join-key columns and boolean Metadata_cqc_* flags.
def simulate_qc_parquet(sc_df: pd.DataFrame) -> pd.DataFrame:
"""Reproduce the annotation schema produced by coSMicQC label_outliers()."""
join_keys = [
"Metadata_Plate",
"Metadata_Well",
"Metadata_ImageNumber",
"Metadata_ObjectNumber",
]
qc = sc_df[join_keys].copy()
nuc_area = sc_df["Nuclei_AreaShape_Area"]
nuc_z = (nuc_area - nuc_area.mean()) / nuc_area.std()
mass_disp = sc_df["Nuclei_Intensity_MassDisplacement_DNA"]
mass_disp_z = (mass_disp - mass_disp.mean()) / mass_disp.std()
qc["Metadata_cqc_large_nuclear_size_is_outlier"] = nuc_z > 2.5
qc["Metadata_cqc_small_nuclear_size_is_outlier"] = nuc_z < -2.5
qc["Metadata_cqc_poor_segmentation_is_outlier"] = mass_disp_z > 2.5
return qc
Step C — build two plates and write to disk
plate1 = simulate_cytotable("Plate_1")
plate2 = simulate_cytotable("Plate_2")
single_cells_raw = pd.concat([plate1, plate2], ignore_index=True)
qc_annotations_raw = simulate_qc_parquet(single_cells_raw)
sc_path = tmp_dir / "single_cells.parquet"
qc_path = tmp_dir / "qc.parquet"
single_cells_raw.to_parquet(sc_path, index=False)
qc_annotations_raw.to_parquet(qc_path, index=False)
print(f"single_cells.parquet {single_cells_raw.shape[0]:,} rows x {single_cells_raw.shape[1]} cols")
print(f"qc.parquet {qc_annotations_raw.shape[0]:,} rows x {qc_annotations_raw.shape[1]} cols")
print(f"\nqc.parquet columns: {list(qc_annotations_raw.columns)}")
[3]:
# Load the CytoTable parquet from disk
single_cells = pd.read_parquet(sc_path)
print(
f"Loaded {len(single_cells):,} single cells across "
f"{single_cells['Metadata_Plate'].nunique()} plates and "
f"{single_cells['Metadata_Well'].nunique()} unique wells"
)
print(
f"\nFeature columns ({len([c for c in single_cells.columns if not c.startswith('Metadata_') and not c.startswith('cytotable_')])}): "
f"{[c for c in single_cells.columns if not c.startswith('Metadata_') and not c.startswith('cytotable_')]}"
)
single_cells.head(3)
Loaded 1,200 single cells across 2 plates and 6 unique wells
Feature columns (12): ['Cells_AreaShape_Area', 'Cells_AreaShape_BoundingBoxArea', 'Cells_AreaShape_EulerNumber', 'Cells_AreaShape_Eccentricity', 'Cells_Intensity_MeanIntensity_Mito', 'Cells_Texture_Correlation_RNA_3_0_256', 'Cytoplasm_AreaShape_Area', 'Cytoplasm_Intensity_MeanIntensity_AGP', 'Nuclei_AreaShape_Area', 'Nuclei_AreaShape_Eccentricity', 'Nuclei_Intensity_MeanIntensity_DNA', 'Nuclei_Intensity_MassDisplacement_DNA']
[3]:
| Metadata_Plate | Metadata_Well | Metadata_ImageNumber | Metadata_ObjectNumber | cytotable_meta_source_path | cytotable_meta_offset | cytotable_meta_rownum | Cells_AreaShape_Area | Cells_AreaShape_BoundingBoxArea | Cells_AreaShape_EulerNumber | Cells_AreaShape_Eccentricity | Cells_Intensity_MeanIntensity_Mito | Cells_Texture_Correlation_RNA_3_0_256 | Cytoplasm_AreaShape_Area | Cytoplasm_Intensity_MeanIntensity_AGP | Nuclei_AreaShape_Area | Nuclei_AreaShape_Eccentricity | Nuclei_Intensity_MeanIntensity_DNA | Nuclei_Intensity_MassDisplacement_DNA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Plate_1 | B02 | 1 | 1 | /data/Plate_1/images/ | 1 | 1 | 536.566050 | 698.886163 | 1 | 0.718898 | 0.305435 | 0.258636 | 145.986232 | 0.246590 | 174.201060 | 0.315677 | 0.402495 | 2.487391 |
| 1 | Plate_1 | B02 | 1 | 2 | /data/Plate_1/images/ | 2 | 2 | 375.201907 | 486.425986 | 1 | 0.659908 | 0.220416 | 0.221838 | 271.266445 | 0.227063 | 266.457556 | 0.500276 | 0.543049 | 11.349592 |
| 2 | Plate_1 | B02 | 1 | 3 | /data/Plate_1/images/ | 3 | 3 | 590.054143 | 766.452364 | 1 | 0.466487 | 0.286568 | 0.234550 | 324.125869 | 0.174093 | 175.405482 | 0.409049 | 0.518258 | 16.069896 |
Background: Single-cell quality control with coSMicQC [Optional]¶
coSMicQC (GitHub | docs | preprint) is a Python package from the Way Lab that systematically identifies segmentation artifacts, for example:
Artifact |
Morphological signature |
Biological cause |
|---|---|---|
Debris / background |
Very small nucleus; low DNA intensity |
Out-of-focus plane, dust on coverslip |
Over-segmented nucleus |
Nucleus area far above the population mean |
One nucleus split into multiple objects |
Touching / fused cells |
Very high mass displacement from multiple objects |
Adjacent cells merged into a single object |
How coSMicQC flags outliers¶
coSMicQC computes a z-score for each quality-relevant feature across the entire experiment. Cells whose z-scores fall outside user-defined thresholds are flagged as outliers. Thresholds are signed:
A negative threshold (e.g.
−2.5) flags cells where the feature is unusually small (debris, broken nuclei).A positive threshold (e.g.
+2.5) flags cells where the feature is unusually large (fused or over-segmented objects).
The main entry point is label_outliers(), which accepts a dictionary of named QC conditions. Each condition name becomes part of the output column name, making the reason for each flag explicit and auditable:
import cosmicqc
labeled = cosmicqc.label_outliers(
df=single_cells,
feature_thresholds={
# Flag nuclei that are too small (debris)
"small_nuclear_size": {
"Nuclei_AreaShape_Area": -2.5,
},
# Flag nuclei that are too large (over-segmented)
"large_nuclear_size": {
"Nuclei_AreaShape_Area": 2.5,
},
# Flag cells with an abnormally high nuclear mass displacement
# (a hallmark of touching or merged nuclei in one object)
"poor_segmentation": {
"Nuclei_Intensity_MassDisplacement_DNA": 2.5,
},
},
include_threshold_scores=True, # also write z-score columns for auditing
export_path="qc.parquet",
export_as_annotations=True, # write compact annotation file only
annotation_metadata_columns=[
"Metadata_Plate", "Metadata_Well",
"Metadata_ImageNumber", "Metadata_ObjectNumber",
],
)
The qc.parquet annotation file¶
When export_as_annotations=True, coSMicQC writes a compact annotation file called qc.parquet, which contains only the join-key metadata columns and the Metadata_cqc_* flag columns (not the full feature table). This makes qc.parquet lightweight and easy to share independently of the raw single-cell data.
Each Metadata_cqc_<condition>_is_outlier column is a boolean: True = flagged, False = passes that QC check. A cell must pass all conditions to be included in downstream analysis.
[4]:
# Load the coSMicQC annotation file and inspect its contents
qc_annotations = pd.read_parquet(qc_path)
print("coSMicQC annotation columns:")
for col in qc_annotations.columns:
print(f" {col}")
outlier_cols = [c for c in qc_annotations.columns if c.endswith("_is_outlier")]
print()
for col in outlier_cols:
n_flagged = qc_annotations[col].sum()
print(
f" {col}: {n_flagged:,} cells flagged ({100 * n_flagged / len(qc_annotations):.1f}%)"
)
coSMicQC annotation columns:
Metadata_Plate
Metadata_Well
Metadata_ImageNumber
Metadata_ObjectNumber
Metadata_cqc_large_nuclear_size_is_outlier
Metadata_cqc_small_nuclear_size_is_outlier
Metadata_cqc_poor_segmentation_is_outlier
Metadata_cqc_large_nuclear_size_is_outlier: 5 cells flagged (0.4%)
Metadata_cqc_small_nuclear_size_is_outlier: 9 cells flagged (0.8%)
Metadata_cqc_poor_segmentation_is_outlier: 12 cells flagged (1.0%)
Step 1: Annotate¶
annotate() does two jobs at once via its external_metadata parameter:
Plate-map join attaches the biological condition (treatment, cell line, concentration) recorded for each well to every cell in that well.
External metadata merge merges any additional per-cell metadata DataFrame or file. The most common use case is a
qc.parquetfile from coSMicQC: passing it asexternal_metadataadds theMetadata_cqc_*flag columns directly to the annotated profiles.
Parameter |
Description |
|---|---|
|
Maps well positions to treatment conditions |
|
Column pair |
|
Path to |
|
Column(s) shared by profiles and external metadata (here the four-part cell identity key) |
After annotate() runs, the Metadata_cqc_* flag columns are present on every row and flow straight into normalize(), which applies the QC filter internally via drop_cosmicqc_rows=True.
[5]:
platemap = pd.DataFrame({
"well_position": ["B02", "C02", "B03", "C03", "B04", "C04"],
"treatment": [
"DMSO",
"DMSO",
"Compound_A",
"Compound_A",
"Compound_B",
"Compound_B",
],
"cell_line": ["HeLa"] * 6,
"concentration_um": [0.0, 0.0, 10.0, 10.0, 5.0, 5.0],
})
platemap
[5]:
| well_position | treatment | cell_line | concentration_um | |
|---|---|---|---|---|
| 0 | B02 | DMSO | HeLa | 0.0 |
| 1 | C02 | DMSO | HeLa | 0.0 |
| 2 | B03 | Compound_A | HeLa | 10.0 |
| 3 | C03 | Compound_A | HeLa | 10.0 |
| 4 | B04 | Compound_B | HeLa | 5.0 |
| 5 | C04 | Compound_B | HeLa | 5.0 |
[6]:
join_keys = [
"Metadata_Plate",
"Metadata_Well",
"Metadata_ImageNumber",
"Metadata_ObjectNumber",
]
# annotate() merges the plate map AND the QC annotation file in a single call.
# The qc.parquet columns already carry the Metadata_ prefix, so they pass through
# prepare_external_metadata_for_annotate() unchanged.
annotated_cells = annotate(
profiles=single_cells,
platemap=platemap,
join_on=["Metadata_well_position", "Metadata_Well"],
add_metadata_id_to_platemap=True,
external_metadata=str(qc_path),
external_join_on=join_keys,
)
new_cols = [c for c in annotated_cells.columns if c not in single_cells.columns]
qc_cols = [c for c in new_cols if "cqc" in c]
print(f"New columns: {new_cols}")
print(f"QC flag columns: {qc_cols}")
print(
f"\nCells flagged by any QC condition: "
f"{annotated_cells[qc_cols].any(axis=1).sum():,} "
f"({100 * annotated_cells[qc_cols].any(axis=1).mean():.1f}%)"
)
print()
annotated_cells[
[c for c in annotated_cells.columns if c.startswith("Metadata_")]
].drop_duplicates(subset=["Metadata_Well"]).head()
New columns: ['Metadata_treatment', 'Metadata_cell_line', 'Metadata_concentration_um', 'Metadata_cqc_large_nuclear_size_is_outlier', 'Metadata_cqc_small_nuclear_size_is_outlier', 'Metadata_cqc_poor_segmentation_is_outlier']
QC flag columns: ['Metadata_cqc_large_nuclear_size_is_outlier', 'Metadata_cqc_small_nuclear_size_is_outlier', 'Metadata_cqc_poor_segmentation_is_outlier']
Cells flagged by any QC condition: 26 (2.2%)
[6]:
| Metadata_treatment | Metadata_cell_line | Metadata_concentration_um | Metadata_Plate | Metadata_Well | Metadata_ImageNumber | Metadata_ObjectNumber | Metadata_cqc_large_nuclear_size_is_outlier | Metadata_cqc_small_nuclear_size_is_outlier | Metadata_cqc_poor_segmentation_is_outlier | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | DMSO | HeLa | 0.0 | Plate_1 | B02 | 1 | 1 | False | False | False |
| 200 | DMSO | HeLa | 0.0 | Plate_1 | C02 | 2 | 1 | False | False | False |
| 400 | Compound_A | HeLa | 10.0 | Plate_1 | B03 | 3 | 1 | False | False | False |
| 600 | Compound_A | HeLa | 10.0 | Plate_1 | C03 | 4 | 1 | False | False | False |
| 800 | Compound_B | HeLa | 5.0 | Plate_1 | B04 | 5 | 1 | False | False | False |
Step 2: Normalize¶
Raw CellProfiler features vary in scale (cell area in pixels², intensities in 0–1) and are influenced by plate-to-plate technical effects. Normalization places all features on a common scale and limits plate-to-plate variation by z-scoring each feature relative to the DMSO control cells.
Passing drop_cosmicqc_rows=True tells normalize() to drop every row where any Metadata_cqc_* flag is True before computing the z-scores, so QC filtering and normalization happen in a single call.
[7]:
# drop_cosmicqc_rows=True removes QC-flagged cells before z-scoring.
normalized_cells = normalize(
profiles=annotated_cells,
features="infer",
meta_features="infer",
samples="Metadata_treatment == 'DMSO'",
method="standardize",
drop_cosmicqc_rows=True,
)
n_removed = len(annotated_cells) - len(normalized_cells)
print(f"{'Total cells':<22} {len(annotated_cells):>6,}")
print(
f"{'Removed (QC outliers)':<22} {n_removed:>6,} ({100 * n_removed / len(annotated_cells):.1f}%)"
)
print(f"{'Retained':<22} {len(normalized_cells):>6,}")
print()
print(f"Normalized shape: {normalized_cells.shape}")
normalized_cells.head(3)
Total cells 1,200
Removed (QC outliers) 26 (2.2%)
Retained 1,174
Normalized shape: (1174, 22)
[7]:
| Metadata_treatment | Metadata_cell_line | Metadata_concentration_um | Metadata_Plate | Metadata_Well | Metadata_ImageNumber | Metadata_ObjectNumber | Metadata_cqc_large_nuclear_size_is_outlier | Metadata_cqc_small_nuclear_size_is_outlier | Metadata_cqc_poor_segmentation_is_outlier | ... | Cells_AreaShape_EulerNumber | Cells_AreaShape_Eccentricity | Cells_Intensity_MeanIntensity_Mito | Cells_Texture_Correlation_RNA_3_0_256 | Cytoplasm_AreaShape_Area | Cytoplasm_Intensity_MeanIntensity_AGP | Nuclei_AreaShape_Area | Nuclei_AreaShape_Eccentricity | Nuclei_Intensity_MeanIntensity_DNA | Nuclei_Intensity_MassDisplacement_DNA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | DMSO | HeLa | 0.0 | Plate_1 | B02 | 1 | 1 | False | False | False | ... | 0.0 | 1.292772 | 0.100004 | 0.733196 | -2.107301 | -0.009458 | -0.380694 | -0.907252 | -1.166556 | -1.028890 |
| 1 | DMSO | HeLa | 0.0 | Plate_1 | B02 | 1 | 2 | False | False | False | ... | 0.0 | 0.819530 | -1.325156 | 0.121467 | -0.481165 | -0.293417 | 1.319231 | 1.025113 | 0.607217 | 1.555226 |
| 3 | DMSO | HeLa | 0.0 | Plate_1 | B02 | 1 | 4 | False | False | False | ... | 0.0 | -0.883623 | -0.280147 | -1.368761 | -0.591794 | 0.361401 | 0.749972 | 1.237712 | -0.672132 | -0.767620 |
3 rows × 22 columns
Step 3: Feature Selection¶
Even after QC and normalization, some features carry little information:
Low-variance features are nearly constant across all cells and cannot distinguish biological conditions.
Highly correlated feature pairs are redundant; keeping both double-weights that axis of variation in clustering and embeddings.
Blocklisted features are known to capture image artifacts rather than cell biology.
feature_select() applies all three filters, producing a lean feature matrix ready for single-cell analyses such as UMAP or hierarchical clustering.
[8]:
selected_cells = feature_select(
profiles=normalized_cells,
features="infer",
operation=["variance_threshold", "correlation_threshold", "blocklist"],
)
feature_cols_before = [
c for c in normalized_cells.columns if not c.startswith("Metadata_")
]
feature_cols_after = [
c for c in selected_cells.columns if not c.startswith("Metadata_")
]
print(f"Features before selection: {len(feature_cols_before)}")
print(f"Features after selection: {len(feature_cols_after)}")
print(f"Features removed: {set(feature_cols_before) - set(feature_cols_after)}")
selected_cells.head(3)
Features before selection: 12
Features after selection: 10
Features removed: {'Cells_AreaShape_Area', 'Cells_AreaShape_EulerNumber'}
[8]:
| Metadata_treatment | Metadata_cell_line | Metadata_concentration_um | Metadata_Plate | Metadata_Well | Metadata_ImageNumber | Metadata_ObjectNumber | Metadata_cqc_large_nuclear_size_is_outlier | Metadata_cqc_small_nuclear_size_is_outlier | Metadata_cqc_poor_segmentation_is_outlier | Cells_AreaShape_BoundingBoxArea | Cells_AreaShape_Eccentricity | Cells_Intensity_MeanIntensity_Mito | Cells_Texture_Correlation_RNA_3_0_256 | Cytoplasm_AreaShape_Area | Cytoplasm_Intensity_MeanIntensity_AGP | Nuclei_AreaShape_Area | Nuclei_AreaShape_Eccentricity | Nuclei_Intensity_MeanIntensity_DNA | Nuclei_Intensity_MassDisplacement_DNA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | DMSO | HeLa | 0.0 | Plate_1 | B02 | 1 | 1 | False | False | False | 0.423873 | 1.292772 | 0.100004 | 0.733196 | -2.107301 | -0.009458 | -0.380694 | -0.907252 | -1.166556 | -1.028890 |
| 1 | DMSO | HeLa | 0.0 | Plate_1 | B02 | 1 | 2 | False | False | False | -1.020114 | 0.819530 | -1.325156 | 0.121467 | -0.481165 | -0.293417 | 1.319231 | 1.025113 | 0.607217 | 1.555226 |
| 3 | DMSO | HeLa | 0.0 | Plate_1 | B02 | 1 | 4 | False | False | False | 1.139881 | -0.883623 | -0.280147 | -1.368761 | -0.591794 | 0.361401 | 0.749972 | 1.237712 | -0.672132 | -0.767620 |
Summary¶
You have processed a CytoTable single-cell dataset through a complete quality-control and normalization pipeline, preserving single-cell resolution throughout:
Step |
Function |
Input |
Output |
|---|---|---|---|
Load |
|
CytoTable Parquet |
1,200 single cells |
Annotate |
|
Single cells + platemap + |
Cells with treatment labels and QC flags |
Normalize |
|
Annotated cells |
~1,176 passing cells, Z-scored |
Feature select |
|
11 features |
9 features |
The output is a clean, normalized single-cell feature matrix, selected_cells, where every row is one cell and every column is an informative morphological feature.
Next steps¶
Embed: run UMAP or t-SNE on
selected_cellsto visualize how treatments separate in morphological space at single-cell resolution.Cluster: apply k-means or Leiden clustering to discover subpopulations within each treatment condition.
Aggregate: feed
selected_cellsintoaggregate()if you need well-level profiles (e.g. for the consensus pipeline shown in the Introduction to Image-based Profiling with Pycytominer tutorial).Hit calling: identify which compounds produce a statistically significant morphological change relative to controls. Buscar operates directly on single-cell distributions to capture cellular heterogeneity and flag off-target effects.