{
"cells": [
{
"cell_type": "markdown",
"id": "cell-title",
"metadata": {},
"source": [
"# Single-cell image-based profiling\n",
"\n",
"## A complete single-cell processing pipeline with Pycytominer\n",
"\n",
"High-content microscopy experiments can produce thousands of single-cell\n",
"measurements per image. Working at single-cell resolution (rather than first\n",
"aggregating cells into well-level profiles) preserves the full diversity of\n",
"cellular responses: rare subpopulations, bimodal distributions, and heterogeneous\n",
"drug effects that vanish in the average.\n",
"\n",
"Single-cell profiling introduces a challenge that well-level profiling sidesteps:\n",
"**not every detected object is a real, well-segmented cell.** Debris, out-of-focus\n",
"objects, and fused cells contaminate the feature matrix and distort downstream\n",
"analyses. A quality-control step is therefore essential before dimensionality\n",
"reduction, clustering, or hit calling.\n",
"\n",
"This tutorial walks through a complete single-cell processing pipeline starting\n",
"from [CytoTable](https://cytomining.github.io/CytoTable/) output.\n",
"[coSMicQC](https://cytomining.github.io/coSMicQC/) is used here for QC:\n",
"\n",
"1. **Load**: read the joined single-cell Parquet file produced by CytoTable\n",
"2. **Annotate**: attach experimental metadata and QC flags from coSMicQC\n",
"3. **Normalize**: drop QC outliers and z-score features against DMSO controls\n",
"4. **Feature select**: drop redundant and uninformative features\n",
"\n",
"The result is a clean, normalized single-cell feature matrix ready for\n",
"dimensionality reduction, clustering, or further aggregation.\n",
"\n",
"> **New to pycytominer?** Read the\n",
"> [Introduction to Pycytominer](introduction_to_pycytominer.ipynb) tutorial first.\n",
"> This tutorial assumes familiarity with the core pipeline steps."
]
},
{
"cell_type": "raw",
"id": "cell-pipeline-diagram",
"metadata": {
"raw_mimetype": "text/restructuredtext"
},
"source": [
".. mermaid::\n",
" :align: center\n",
"\n",
" flowchart TD\n",
" cytotable[\"CytoTable output
single_cells.parquet, 1200 cells\"]\n",
" qcfile[\"coSMicQC output
qc.parquet, QC annotations\"]\n",
" ann[\"annotate()
Add platemap + QC flags\"]\n",
" nor[\"normalize()
Drop QC outliers · Z-score vs DMSO\"]\n",
" fea[\"feature_select()
Remove redundant features\"]\n",
" output[\"Single-cell profiles
~1174 cells, 10 features\"]\n",
"\n",
" cytotable --> ann\n",
" qcfile --> ann\n",
" ann --> nor --> fea --> output\n",
"\n",
" style cytotable fill:#f0d9fa,stroke:#88239A,color:#111\n",
" style qcfile fill:#f0d9fa,stroke:#88239A,color:#111\n",
" style output fill:#f0d9fa,stroke:#88239A,color:#111\n",
" style ann fill:#ffffff,stroke:#88239A,color:#111\n",
" style nor fill:#ffffff,stroke:#88239A,color:#111\n",
" style fea fill:#ffffff,stroke:#88239A,color:#111"
]
},
{
"cell_type": "markdown",
"id": "cell-prereqs",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"Install the required packages:\n",
"\n",
"```bash\n",
"pip install pycytominer coSMicQC pyarrow pandas numpy\n",
"```\n",
"\n",
"This tutorial uses **simulated data** that matches the exact schema produced by [CytoTable](https://cytomining.github.io/CytoTable/) and [coSMicQC](https://cytomining.github.io/coSMicQC/). In a real experiment, replace the simulation block with your own `single_cells.parquet` and `qc.parquet` files."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "cell-imports",
"metadata": {
"execution": {
"iopub.execute_input": "2026-06-01T19:34:58.190044Z",
"iopub.status.busy": "2026-06-01T19:34:58.189829Z",
"iopub.status.idle": "2026-06-01T19:34:59.724211Z",
"shell.execute_reply": "2026-06-01T19:34:59.723901Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Working directory: /var/folders/02/q30k_4wn2dqbz5pj_vvc8xn40000gp/T/tmp57clvnip\n"
]
}
],
"source": [
"import tempfile\n",
"from pathlib import Path\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"from pycytominer import annotate, feature_select, normalize\n",
"\n",
"# Reproducible random state used throughout the simulation\n",
"rng = np.random.default_rng(42)\n",
"\n",
"# Temporary directory — stands in for the output directory on your filesystem\n",
"tmp_dir = Path(tempfile.mkdtemp())\n",
"print(f\"Working directory: {tmp_dir}\")"
]
},
{
"cell_type": "markdown",
"id": "cell-cytotable-intro",
"metadata": {},
"source": [
"## Input: CytoTable Single-Cell Data\n",
"\n",
"[CytoTable](https://cytomining.github.io/CytoTable/) converts CellProfiler SQLite or CSV output into a single analysis-ready Parquet file. Each row represents one segmented object (a cell), and columns fall into three groups:\n",
"\n",
"| Group | Example columns | Purpose |\n",
"|---|---|---|\n",
"| `Metadata_*` | `Metadata_Plate`, `Metadata_Well`, `Metadata_ImageNumber`, `Metadata_ObjectNumber` | Describe the experiment |\n",
"| `cytotable_meta_*` | `cytotable_meta_source_path`, `cytotable_meta_offset` | CytoTable provenance. Pycytominer ignores these automatically |\n",
"| Feature columns | `Cells_AreaShape_Area`, `Nuclei_Intensity_MeanIntensity_DNA` | Morphology measurements per single-cell |\n",
"\n",
"`Metadata_ImageNumber` and `Metadata_ObjectNumber` together uniquely identify every cell and serve as the **join key** between the single-cell data and the coSMicQC annotations.\n",
"\n",
"> **Note on `cytotable_meta_*` columns:** These provenance columns track source-file offsets for CytoTable's internal bookkeeping. Pycytominer's feature inference uses CellProfiler compartment prefixes (`Cells_`, `Cytoplasm_`, `Nuclei_`) and ignores them automatically. They pass through `annotate()` unchanged and are dropped at the `normalize()` step.\n",
"\n",
"The simulation code is available in the expandable block below. Skip it to go straight to the next step."
]
},
{
"cell_type": "raw",
"id": "cell-simulate-toggle",
"metadata": {
"raw_mimetype": "text/restructuredtext",
"vscode": {
"languageId": "raw"
}
},
"source": [
".. toggle::\n",
"\n",
" In a real experiment these files come from running\n",
" `CytoTable `__ and\n",
" `coSMicQC `__ on your CellProfiler\n",
" output. The functions below reproduce their output schemas using synthetic data.\n",
"\n",
" **Step A — simulate CytoTable single-cell data**\n",
"\n",
" .. code-block:: python\n",
"\n",
" WELLS = {\n",
" \"B02\": \"DMSO\", \"C02\": \"DMSO\",\n",
" \"B03\": \"Compound_A\", \"C03\": \"Compound_A\",\n",
" \"B04\": \"Compound_B\", \"C04\": \"Compound_B\",\n",
" }\n",
" N_CELLS_PER_WELL = 100\n",
"\n",
" def simulate_cytotable(plate_id: str) -> pd.DataFrame:\n",
" \"\"\"Generate a synthetic CytoTable-style single-cell DataFrame.\"\"\"\n",
" rows = []\n",
" for img_num, (well, treatment) in enumerate(WELLS.items(), start=1):\n",
" is_a = float(treatment == \"Compound_A\")\n",
" is_b = float(treatment == \"Compound_B\")\n",
" cell_areas = rng.normal(500 + 180 * is_a - 90 * is_b, 120, N_CELLS_PER_WELL)\n",
" nuclei_areas = rng.normal(195, 55, N_CELLS_PER_WELL)\n",
" for obj_num in range(1, N_CELLS_PER_WELL + 1):\n",
" rows.append( {\n",
" # ── CytoTable metadata ──────────────────────────────────\n",
" \"Metadata_Plate\": plate_id,\n",
" \"Metadata_Well\": well,\n",
" \"Metadata_ImageNumber\": img_num,\n",
" \"Metadata_ObjectNumber\": obj_num,\n",
" # CytoTable provenance columns\n",
" \"cytotable_meta_source_path\": f\"/data/{plate_id}/images/\",\n",
" \"cytotable_meta_offset\": (img_num - 1) * N_CELLS_PER_WELL + obj_num,\n",
" \"cytotable_meta_rownum\": obj_num,\n",
" # ── Feature columns ─────────────────────────────────────\n",
" \"Cells_AreaShape_Area\": cell_areas[obj_num - 1],\n",
" \"Cells_AreaShape_BoundingBoxArea\": cell_areas[obj_num - 1] * 1.3\n",
" + rng.normal(0, 4),\n",
" \"Cells_AreaShape_EulerNumber\": 1,\n",
" \"Cells_AreaShape_Eccentricity\": float(\n",
" np.clip(rng.normal(0.55, 0.12), 0, 1)\n",
" ),\n",
" \"Cells_Intensity_MeanIntensity_Mito\": rng.normal(0.30, 0.06),\n",
" \"Cells_Texture_Correlation_RNA_3_0_256\": rng.normal(0.22, 0.06),\n",
" \"Cytoplasm_AreaShape_Area\": rng.normal(310, 80),\n",
" \"Cytoplasm_Intensity_MeanIntensity_AGP\": rng.normal(0.25, 0.07),\n",
" \"Nuclei_AreaShape_Area\": nuclei_areas[obj_num - 1],\n",
" \"Nuclei_AreaShape_Eccentricity\": float(\n",
" np.clip(rng.normal(0.40, 0.10), 0, 1)\n",
" ),\n",
" \"Nuclei_Intensity_MeanIntensity_DNA\": rng.normal(0.50, 0.08),\n",
" \"Nuclei_Intensity_MassDisplacement_DNA\": abs(rng.normal(6, 4)),\n",
" })\n",
" return pd.DataFrame(rows)\n",
"\n",
" **Step B — simulate coSMicQC QC annotations**\n",
"\n",
" ``label_outliers(..., export_as_annotations=True)`` writes a compact Parquet\n",
" with only join-key columns and boolean ``Metadata_cqc_*`` flags.\n",
"\n",
" .. code-block:: python\n",
"\n",
" def simulate_qc_parquet(sc_df: pd.DataFrame) -> pd.DataFrame:\n",
" \"\"\"Reproduce the annotation schema produced by coSMicQC label_outliers().\"\"\"\n",
"\n",
" join_keys = [\n",
" \"Metadata_Plate\",\n",
" \"Metadata_Well\",\n",
" \"Metadata_ImageNumber\",\n",
" \"Metadata_ObjectNumber\",\n",
" ]\n",
"\n",
" qc = sc_df[join_keys].copy()\n",
"\n",
" nuc_area = sc_df[\"Nuclei_AreaShape_Area\"]\n",
" nuc_z = (nuc_area - nuc_area.mean()) / nuc_area.std()\n",
"\n",
" mass_disp = sc_df[\"Nuclei_Intensity_MassDisplacement_DNA\"]\n",
" mass_disp_z = (mass_disp - mass_disp.mean()) / mass_disp.std()\n",
"\n",
" qc[\"Metadata_cqc_large_nuclear_size_is_outlier\"] = nuc_z > 2.5\n",
" qc[\"Metadata_cqc_small_nuclear_size_is_outlier\"] = nuc_z < -2.5\n",
" qc[\"Metadata_cqc_poor_segmentation_is_outlier\"] = mass_disp_z > 2.5\n",
"\n",
" return qc\n",
"\n",
" **Step C — build two plates and write to disk**\n",
"\n",
" .. code-block:: python\n",
"\n",
" plate1 = simulate_cytotable(\"Plate_1\")\n",
" plate2 = simulate_cytotable(\"Plate_2\")\n",
" single_cells_raw = pd.concat([plate1, plate2], ignore_index=True)\n",
"\n",
" qc_annotations_raw = simulate_qc_parquet(single_cells_raw)\n",
"\n",
" sc_path = tmp_dir / \"single_cells.parquet\"\n",
" qc_path = tmp_dir / \"qc.parquet\"\n",
" single_cells_raw.to_parquet(sc_path, index=False)\n",
" qc_annotations_raw.to_parquet(qc_path, index=False)\n",
"\n",
" print(f\"single_cells.parquet {single_cells_raw.shape[0]:,} rows x {single_cells_raw.shape[1]} cols\")\n",
" print(f\"qc.parquet {qc_annotations_raw.shape[0]:,} rows x {qc_annotations_raw.shape[1]} cols\")\n",
" print(f\"\\nqc.parquet columns: {list(qc_annotations_raw.columns)}\")\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "cell-simulate",
"metadata": {
"execution": {
"iopub.execute_input": "2026-06-01T19:34:59.726742Z",
"iopub.status.busy": "2026-06-01T19:34:59.726457Z",
"iopub.status.idle": "2026-06-01T19:34:59.852969Z",
"shell.execute_reply": "2026-06-01T19:34:59.852647Z"
},
"nbsphinx": "hidden"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"single_cells.parquet 1,200 rows x 19 cols\n",
"qc.parquet 1,200 rows x 7 cols\n",
"\n",
"qc.parquet columns: ['Metadata_Plate', 'Metadata_Well', 'Metadata_ImageNumber', 'Metadata_ObjectNumber', 'Metadata_cqc_large_nuclear_size_is_outlier', 'Metadata_cqc_small_nuclear_size_is_outlier', 'Metadata_cqc_poor_segmentation_is_outlier']\n"
]
}
],
"source": [
"# ── Simulate CytoTable single-cell output ─────────────────────────────────\n",
"#\n",
"# In a real experiment this file is produced by:\n",
"# import cytotable\n",
"# cytotable.convert(source_path=\"...\", dest_path=\"single_cells.parquet\", ...)\n",
"#\n",
"# Here we generate synthetic data with the same column schema.\n",
"\n",
"WELLS = {\n",
" \"B02\": \"DMSO\",\n",
" \"C02\": \"DMSO\",\n",
" \"B03\": \"Compound_A\",\n",
" \"C03\": \"Compound_A\",\n",
" \"B04\": \"Compound_B\",\n",
" \"C04\": \"Compound_B\",\n",
"}\n",
"N_CELLS_PER_WELL = 100\n",
"\n",
"\n",
"def simulate_cytotable(plate_id: str) -> pd.DataFrame:\n",
" \"\"\"Generate a synthetic CytoTable-style single-cell DataFrame.\"\"\"\n",
" rows = []\n",
" for img_num, (well, treatment) in enumerate(WELLS.items(), start=1):\n",
" is_a = float(treatment == \"Compound_A\")\n",
" is_b = float(treatment == \"Compound_B\")\n",
" cell_areas = rng.normal(500 + 180 * is_a - 90 * is_b, 120, N_CELLS_PER_WELL)\n",
" nuclei_areas = rng.normal(195, 55, N_CELLS_PER_WELL)\n",
" for obj_num in range(1, N_CELLS_PER_WELL + 1):\n",
" rows.append({\n",
" # ── CytoTable metadata ──────────────────────────────────\n",
" \"Metadata_Plate\": plate_id,\n",
" \"Metadata_Well\": well,\n",
" \"Metadata_ImageNumber\": img_num,\n",
" \"Metadata_ObjectNumber\": obj_num,\n",
" # CytoTable provenance columns\n",
" \"cytotable_meta_source_path\": f\"/data/{plate_id}/images/\",\n",
" \"cytotable_meta_offset\": (img_num - 1) * N_CELLS_PER_WELL + obj_num,\n",
" \"cytotable_meta_rownum\": obj_num,\n",
" # ── Feature columns ─────────────────────────────────────\n",
" \"Cells_AreaShape_Area\": cell_areas[obj_num - 1],\n",
" \"Cells_AreaShape_BoundingBoxArea\": cell_areas[obj_num - 1] * 1.3\n",
" + rng.normal(0, 4),\n",
" \"Cells_AreaShape_EulerNumber\": 1,\n",
" \"Cells_AreaShape_Eccentricity\": float(\n",
" np.clip(rng.normal(0.55, 0.12), 0, 1)\n",
" ),\n",
" \"Cells_Intensity_MeanIntensity_Mito\": rng.normal(0.30, 0.06),\n",
" \"Cells_Texture_Correlation_RNA_3_0_256\": rng.normal(0.22, 0.06),\n",
" \"Cytoplasm_AreaShape_Area\": rng.normal(310, 80),\n",
" \"Cytoplasm_Intensity_MeanIntensity_AGP\": rng.normal(0.25, 0.07),\n",
" \"Nuclei_AreaShape_Area\": nuclei_areas[obj_num - 1],\n",
" \"Nuclei_AreaShape_Eccentricity\": float(\n",
" np.clip(rng.normal(0.40, 0.10), 0, 1)\n",
" ),\n",
" \"Nuclei_Intensity_MeanIntensity_DNA\": rng.normal(0.50, 0.08),\n",
" \"Nuclei_Intensity_MassDisplacement_DNA\": abs(rng.normal(6, 4)),\n",
" })\n",
" return pd.DataFrame(rows)\n",
"\n",
"\n",
"# ── Simulate coSMicQC annotation output (qc.parquet) ──────────────────────\n",
"#\n",
"# coSMicQC label_outliers(...) flags outliers using signed z-score thresholds\n",
"# and writes a compact annotation file with Metadata_cqc_* boolean columns.\n",
"# Here we reproduce that schema directly.\n",
"\n",
"\n",
"def simulate_qc_parquet(sc_df: pd.DataFrame) -> pd.DataFrame:\n",
" \"\"\"Reproduce the annotation schema produced by coSMicQC label_outliers().\"\"\"\n",
"\n",
" join_keys = [\n",
" \"Metadata_Plate\",\n",
" \"Metadata_Well\",\n",
" \"Metadata_ImageNumber\",\n",
" \"Metadata_ObjectNumber\",\n",
" ]\n",
"\n",
" qc = sc_df[join_keys].copy()\n",
"\n",
" nuc_area = sc_df[\"Nuclei_AreaShape_Area\"]\n",
" nuc_z = (nuc_area - nuc_area.mean()) / nuc_area.std()\n",
"\n",
" mass_disp = sc_df[\"Nuclei_Intensity_MassDisplacement_DNA\"]\n",
" mass_disp_z = (mass_disp - mass_disp.mean()) / mass_disp.std()\n",
"\n",
" qc[\"Metadata_cqc_large_nuclear_size_is_outlier\"] = nuc_z > 2.5\n",
" qc[\"Metadata_cqc_small_nuclear_size_is_outlier\"] = nuc_z < -2.5\n",
" qc[\"Metadata_cqc_poor_segmentation_is_outlier\"] = mass_disp_z > 2.5\n",
"\n",
" return qc\n",
"\n",
"\n",
"# Build two plates, concatenate, then write both files to disk\n",
"plate1 = simulate_cytotable(\"Plate_1\")\n",
"plate2 = simulate_cytotable(\"Plate_2\")\n",
"single_cells_raw = pd.concat([plate1, plate2], ignore_index=True)\n",
"\n",
"qc_annotations_raw = simulate_qc_parquet(single_cells_raw)\n",
"\n",
"sc_path = tmp_dir / \"single_cells.parquet\"\n",
"qc_path = tmp_dir / \"qc.parquet\"\n",
"single_cells_raw.to_parquet(sc_path, index=False)\n",
"qc_annotations_raw.to_parquet(qc_path, index=False)\n",
"\n",
"print(\n",
" f\"single_cells.parquet {single_cells_raw.shape[0]:,} rows x {single_cells_raw.shape[1]} cols\"\n",
")\n",
"print(\n",
" f\"qc.parquet {qc_annotations_raw.shape[0]:,} rows x {qc_annotations_raw.shape[1]} cols\"\n",
")\n",
"print(f\"\\nqc.parquet columns: {list(qc_annotations_raw.columns)}\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "cell-load-inspect",
"metadata": {
"execution": {
"iopub.execute_input": "2026-06-01T19:34:59.854384Z",
"iopub.status.busy": "2026-06-01T19:34:59.854273Z",
"iopub.status.idle": "2026-06-01T19:34:59.988267Z",
"shell.execute_reply": "2026-06-01T19:34:59.987891Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loaded 1,200 single cells across 2 plates and 6 unique wells\n",
"\n",
"Feature columns (12): ['Cells_AreaShape_Area', 'Cells_AreaShape_BoundingBoxArea', 'Cells_AreaShape_EulerNumber', 'Cells_AreaShape_Eccentricity', 'Cells_Intensity_MeanIntensity_Mito', 'Cells_Texture_Correlation_RNA_3_0_256', 'Cytoplasm_AreaShape_Area', 'Cytoplasm_Intensity_MeanIntensity_AGP', 'Nuclei_AreaShape_Area', 'Nuclei_AreaShape_Eccentricity', 'Nuclei_Intensity_MeanIntensity_DNA', 'Nuclei_Intensity_MassDisplacement_DNA']\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Metadata_Plate | \n",
" Metadata_Well | \n",
" Metadata_ImageNumber | \n",
" Metadata_ObjectNumber | \n",
" cytotable_meta_source_path | \n",
" cytotable_meta_offset | \n",
" cytotable_meta_rownum | \n",
" Cells_AreaShape_Area | \n",
" Cells_AreaShape_BoundingBoxArea | \n",
" Cells_AreaShape_EulerNumber | \n",
" Cells_AreaShape_Eccentricity | \n",
" Cells_Intensity_MeanIntensity_Mito | \n",
" Cells_Texture_Correlation_RNA_3_0_256 | \n",
" Cytoplasm_AreaShape_Area | \n",
" Cytoplasm_Intensity_MeanIntensity_AGP | \n",
" Nuclei_AreaShape_Area | \n",
" Nuclei_AreaShape_Eccentricity | \n",
" Nuclei_Intensity_MeanIntensity_DNA | \n",
" Nuclei_Intensity_MassDisplacement_DNA | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" Plate_1 | \n",
" B02 | \n",
" 1 | \n",
" 1 | \n",
" /data/Plate_1/images/ | \n",
" 1 | \n",
" 1 | \n",
" 536.566050 | \n",
" 698.886163 | \n",
" 1 | \n",
" 0.718898 | \n",
" 0.305435 | \n",
" 0.258636 | \n",
" 145.986232 | \n",
" 0.246590 | \n",
" 174.201060 | \n",
" 0.315677 | \n",
" 0.402495 | \n",
" 2.487391 | \n",
"
\n",
" \n",
" | 1 | \n",
" Plate_1 | \n",
" B02 | \n",
" 1 | \n",
" 2 | \n",
" /data/Plate_1/images/ | \n",
" 2 | \n",
" 2 | \n",
" 375.201907 | \n",
" 486.425986 | \n",
" 1 | \n",
" 0.659908 | \n",
" 0.220416 | \n",
" 0.221838 | \n",
" 271.266445 | \n",
" 0.227063 | \n",
" 266.457556 | \n",
" 0.500276 | \n",
" 0.543049 | \n",
" 11.349592 | \n",
"
\n",
" \n",
" | 2 | \n",
" Plate_1 | \n",
" B02 | \n",
" 1 | \n",
" 3 | \n",
" /data/Plate_1/images/ | \n",
" 3 | \n",
" 3 | \n",
" 590.054143 | \n",
" 766.452364 | \n",
" 1 | \n",
" 0.466487 | \n",
" 0.286568 | \n",
" 0.234550 | \n",
" 324.125869 | \n",
" 0.174093 | \n",
" 175.405482 | \n",
" 0.409049 | \n",
" 0.518258 | \n",
" 16.069896 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Metadata_Plate Metadata_Well Metadata_ImageNumber Metadata_ObjectNumber \\\n",
"0 Plate_1 B02 1 1 \n",
"1 Plate_1 B02 1 2 \n",
"2 Plate_1 B02 1 3 \n",
"\n",
" cytotable_meta_source_path cytotable_meta_offset cytotable_meta_rownum \\\n",
"0 /data/Plate_1/images/ 1 1 \n",
"1 /data/Plate_1/images/ 2 2 \n",
"2 /data/Plate_1/images/ 3 3 \n",
"\n",
" Cells_AreaShape_Area Cells_AreaShape_BoundingBoxArea \\\n",
"0 536.566050 698.886163 \n",
"1 375.201907 486.425986 \n",
"2 590.054143 766.452364 \n",
"\n",
" Cells_AreaShape_EulerNumber Cells_AreaShape_Eccentricity \\\n",
"0 1 0.718898 \n",
"1 1 0.659908 \n",
"2 1 0.466487 \n",
"\n",
" Cells_Intensity_MeanIntensity_Mito Cells_Texture_Correlation_RNA_3_0_256 \\\n",
"0 0.305435 0.258636 \n",
"1 0.220416 0.221838 \n",
"2 0.286568 0.234550 \n",
"\n",
" Cytoplasm_AreaShape_Area Cytoplasm_Intensity_MeanIntensity_AGP \\\n",
"0 145.986232 0.246590 \n",
"1 271.266445 0.227063 \n",
"2 324.125869 0.174093 \n",
"\n",
" Nuclei_AreaShape_Area Nuclei_AreaShape_Eccentricity \\\n",
"0 174.201060 0.315677 \n",
"1 266.457556 0.500276 \n",
"2 175.405482 0.409049 \n",
"\n",
" Nuclei_Intensity_MeanIntensity_DNA Nuclei_Intensity_MassDisplacement_DNA \n",
"0 0.402495 2.487391 \n",
"1 0.543049 11.349592 \n",
"2 0.518258 16.069896 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Load the CytoTable parquet from disk\n",
"single_cells = pd.read_parquet(sc_path)\n",
"\n",
"print(\n",
" f\"Loaded {len(single_cells):,} single cells across \"\n",
" f\"{single_cells['Metadata_Plate'].nunique()} plates and \"\n",
" f\"{single_cells['Metadata_Well'].nunique()} unique wells\"\n",
")\n",
"print(\n",
" f\"\\nFeature columns ({len([c for c in single_cells.columns if not c.startswith('Metadata_') and not c.startswith('cytotable_')])}): \"\n",
" f\"{[c for c in single_cells.columns if not c.startswith('Metadata_') and not c.startswith('cytotable_')]}\"\n",
")\n",
"single_cells.head(3)"
]
},
{
"cell_type": "markdown",
"id": "cell-cosmicqc-intro",
"metadata": {},
"source": [
"## Background: Single-cell quality control with coSMicQC [Optional]\n",
"\n",
"[coSMicQC](https://github.com/WayScience/coSMicQC) ([GitHub](https://github.com/WayScience/coSMicQC) | [docs](https://cytomining.github.io/coSMicQC/) | [preprint](https://www.biorxiv.org/content/10.1101/2025.10.14.682427v1)) is a Python package from the Way Lab that systematically identifies segmentation artifacts, for example:\n",
"\n",
"| Artifact | Morphological signature | Biological cause |\n",
"|---|---|---|\n",
"| **Debris / background** | Very small nucleus; low DNA intensity | Out-of-focus plane, dust on coverslip |\n",
"| **Over-segmented nucleus** | Nucleus area far above the population mean | One nucleus split into multiple objects |\n",
"| **Touching / fused cells** | Very high mass displacement from multiple objects | Adjacent cells merged into a single object |\n",
"\n",
"### How coSMicQC flags outliers\n",
"\n",
"coSMicQC computes a **z-score** for each quality-relevant feature across the entire experiment. Cells whose z-scores fall outside user-defined thresholds are flagged as outliers. Thresholds are **signed**:\n",
"\n",
"- A **negative threshold** (e.g. `−2.5`) flags cells where the feature is *unusually small* (debris, broken nuclei).\n",
"- A **positive threshold** (e.g. `+2.5`) flags cells where the feature is *unusually large* (fused or over-segmented objects).\n",
"\n",
"The main entry point is `label_outliers()`, which accepts a dictionary of **named QC conditions**. Each condition name becomes part of the output column name, making the reason for each flag explicit and auditable:\n",
"\n",
"```python\n",
"import cosmicqc\n",
"\n",
"labeled = cosmicqc.label_outliers(\n",
" df=single_cells,\n",
" feature_thresholds={\n",
" # Flag nuclei that are too small (debris)\n",
" \"small_nuclear_size\": {\n",
" \"Nuclei_AreaShape_Area\": -2.5,\n",
" },\n",
" # Flag nuclei that are too large (over-segmented)\n",
" \"large_nuclear_size\": {\n",
" \"Nuclei_AreaShape_Area\": 2.5,\n",
" },\n",
" # Flag cells with an abnormally high nuclear mass displacement\n",
" # (a hallmark of touching or merged nuclei in one object)\n",
" \"poor_segmentation\": {\n",
" \"Nuclei_Intensity_MassDisplacement_DNA\": 2.5,\n",
" },\n",
" },\n",
" include_threshold_scores=True, # also write z-score columns for auditing\n",
" export_path=\"qc.parquet\",\n",
" export_as_annotations=True, # write compact annotation file only\n",
" annotation_metadata_columns=[\n",
" \"Metadata_Plate\", \"Metadata_Well\",\n",
" \"Metadata_ImageNumber\", \"Metadata_ObjectNumber\",\n",
" ],\n",
")\n",
"```\n",
"\n",
"### The `qc.parquet` annotation file\n",
"\n",
"When `export_as_annotations=True`, coSMicQC writes a **compact annotation file** called `qc.parquet`, which contains only the join-key metadata columns and the `Metadata_cqc_*` flag columns (not the full feature table). This makes `qc.parquet` lightweight and easy to share independently of the raw single-cell data.\n",
"\n",
"Each `Metadata_cqc__is_outlier` column is a boolean: `True` = flagged, `False` = passes that QC check. A cell must pass **all** conditions to be included in downstream analysis."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "cell-apply-qc",
"metadata": {
"execution": {
"iopub.execute_input": "2026-06-01T19:34:59.989678Z",
"iopub.status.busy": "2026-06-01T19:34:59.989573Z",
"iopub.status.idle": "2026-06-01T19:34:59.994761Z",
"shell.execute_reply": "2026-06-01T19:34:59.994475Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"coSMicQC annotation columns:\n",
" Metadata_Plate\n",
" Metadata_Well\n",
" Metadata_ImageNumber\n",
" Metadata_ObjectNumber\n",
" Metadata_cqc_large_nuclear_size_is_outlier\n",
" Metadata_cqc_small_nuclear_size_is_outlier\n",
" Metadata_cqc_poor_segmentation_is_outlier\n",
"\n",
" Metadata_cqc_large_nuclear_size_is_outlier: 5 cells flagged (0.4%)\n",
" Metadata_cqc_small_nuclear_size_is_outlier: 9 cells flagged (0.8%)\n",
" Metadata_cqc_poor_segmentation_is_outlier: 12 cells flagged (1.0%)\n"
]
}
],
"source": [
"# Load the coSMicQC annotation file and inspect its contents\n",
"qc_annotations = pd.read_parquet(qc_path)\n",
"\n",
"print(\"coSMicQC annotation columns:\")\n",
"for col in qc_annotations.columns:\n",
" print(f\" {col}\")\n",
"\n",
"outlier_cols = [c for c in qc_annotations.columns if c.endswith(\"_is_outlier\")]\n",
"print()\n",
"for col in outlier_cols:\n",
" n_flagged = qc_annotations[col].sum()\n",
" print(\n",
" f\" {col}: {n_flagged:,} cells flagged ({100 * n_flagged / len(qc_annotations):.1f}%)\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "cell-annotate-intro",
"metadata": {},
"source": [
"---\n",
"\n",
"## Step 1: Annotate\n",
"\n",
"`annotate()` does two jobs at once via its `external_metadata` parameter:\n",
"\n",
"1. **Plate-map join** attaches the biological condition (treatment, cell line, concentration) recorded for each well to every cell in that well.\n",
"2. **External metadata merge** merges any additional per-cell metadata DataFrame or file. The most common use case is a `qc.parquet` file from coSMicQC: passing it as `external_metadata` adds the `Metadata_cqc_*` flag columns directly to the annotated profiles.\n",
"\n",
"| Parameter | Description |\n",
"|---|---|\n",
"| `platemap` | Maps well positions to treatment conditions |\n",
"| `join_on` | Column pair `[platemap_col, profiles_col]` for the well-position join |\n",
"| `external_metadata` | Path to `qc.parquet` (or any additional metadata DataFrame) |\n",
"| `external_join_on` | Column(s) shared by profiles and external metadata (here the four-part cell identity key) |\n",
"\n",
"After `annotate()` runs, the `Metadata_cqc_*` flag columns are present on every row and flow straight into `normalize()`, which applies the QC filter internally via `drop_cosmicqc_rows=True`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "cell-platemap",
"metadata": {
"execution": {
"iopub.execute_input": "2026-06-01T19:34:59.996077Z",
"iopub.status.busy": "2026-06-01T19:34:59.995987Z",
"iopub.status.idle": "2026-06-01T19:35:00.000086Z",
"shell.execute_reply": "2026-06-01T19:34:59.999817Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" well_position | \n",
" treatment | \n",
" cell_line | \n",
" concentration_um | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" B02 | \n",
" DMSO | \n",
" HeLa | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 1 | \n",
" C02 | \n",
" DMSO | \n",
" HeLa | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 2 | \n",
" B03 | \n",
" Compound_A | \n",
" HeLa | \n",
" 10.0 | \n",
"
\n",
" \n",
" | 3 | \n",
" C03 | \n",
" Compound_A | \n",
" HeLa | \n",
" 10.0 | \n",
"
\n",
" \n",
" | 4 | \n",
" B04 | \n",
" Compound_B | \n",
" HeLa | \n",
" 5.0 | \n",
"
\n",
" \n",
" | 5 | \n",
" C04 | \n",
" Compound_B | \n",
" HeLa | \n",
" 5.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" well_position treatment cell_line concentration_um\n",
"0 B02 DMSO HeLa 0.0\n",
"1 C02 DMSO HeLa 0.0\n",
"2 B03 Compound_A HeLa 10.0\n",
"3 C03 Compound_A HeLa 10.0\n",
"4 B04 Compound_B HeLa 5.0\n",
"5 C04 Compound_B HeLa 5.0"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"platemap = pd.DataFrame({\n",
" \"well_position\": [\"B02\", \"C02\", \"B03\", \"C03\", \"B04\", \"C04\"],\n",
" \"treatment\": [\n",
" \"DMSO\",\n",
" \"DMSO\",\n",
" \"Compound_A\",\n",
" \"Compound_A\",\n",
" \"Compound_B\",\n",
" \"Compound_B\",\n",
" ],\n",
" \"cell_line\": [\"HeLa\"] * 6,\n",
" \"concentration_um\": [0.0, 0.0, 10.0, 10.0, 5.0, 5.0],\n",
"})\n",
"platemap"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "cell-annotate",
"metadata": {
"execution": {
"iopub.execute_input": "2026-06-01T19:35:00.001369Z",
"iopub.status.busy": "2026-06-01T19:35:00.001281Z",
"iopub.status.idle": "2026-06-01T19:35:00.018843Z",
"shell.execute_reply": "2026-06-01T19:35:00.018552Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"New columns: ['Metadata_treatment', 'Metadata_cell_line', 'Metadata_concentration_um', 'Metadata_cqc_large_nuclear_size_is_outlier', 'Metadata_cqc_small_nuclear_size_is_outlier', 'Metadata_cqc_poor_segmentation_is_outlier']\n",
"QC flag columns: ['Metadata_cqc_large_nuclear_size_is_outlier', 'Metadata_cqc_small_nuclear_size_is_outlier', 'Metadata_cqc_poor_segmentation_is_outlier']\n",
"\n",
"Cells flagged by any QC condition: 26 (2.2%)\n",
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Metadata_treatment | \n",
" Metadata_cell_line | \n",
" Metadata_concentration_um | \n",
" Metadata_Plate | \n",
" Metadata_Well | \n",
" Metadata_ImageNumber | \n",
" Metadata_ObjectNumber | \n",
" Metadata_cqc_large_nuclear_size_is_outlier | \n",
" Metadata_cqc_small_nuclear_size_is_outlier | \n",
" Metadata_cqc_poor_segmentation_is_outlier | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" DMSO | \n",
" HeLa | \n",
" 0.0 | \n",
" Plate_1 | \n",
" B02 | \n",
" 1 | \n",
" 1 | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" | 200 | \n",
" DMSO | \n",
" HeLa | \n",
" 0.0 | \n",
" Plate_1 | \n",
" C02 | \n",
" 2 | \n",
" 1 | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" | 400 | \n",
" Compound_A | \n",
" HeLa | \n",
" 10.0 | \n",
" Plate_1 | \n",
" B03 | \n",
" 3 | \n",
" 1 | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" | 600 | \n",
" Compound_A | \n",
" HeLa | \n",
" 10.0 | \n",
" Plate_1 | \n",
" C03 | \n",
" 4 | \n",
" 1 | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" | 800 | \n",
" Compound_B | \n",
" HeLa | \n",
" 5.0 | \n",
" Plate_1 | \n",
" B04 | \n",
" 5 | \n",
" 1 | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Metadata_treatment Metadata_cell_line Metadata_concentration_um \\\n",
"0 DMSO HeLa 0.0 \n",
"200 DMSO HeLa 0.0 \n",
"400 Compound_A HeLa 10.0 \n",
"600 Compound_A HeLa 10.0 \n",
"800 Compound_B HeLa 5.0 \n",
"\n",
" Metadata_Plate Metadata_Well Metadata_ImageNumber Metadata_ObjectNumber \\\n",
"0 Plate_1 B02 1 1 \n",
"200 Plate_1 C02 2 1 \n",
"400 Plate_1 B03 3 1 \n",
"600 Plate_1 C03 4 1 \n",
"800 Plate_1 B04 5 1 \n",
"\n",
" Metadata_cqc_large_nuclear_size_is_outlier \\\n",
"0 False \n",
"200 False \n",
"400 False \n",
"600 False \n",
"800 False \n",
"\n",
" Metadata_cqc_small_nuclear_size_is_outlier \\\n",
"0 False \n",
"200 False \n",
"400 False \n",
"600 False \n",
"800 False \n",
"\n",
" Metadata_cqc_poor_segmentation_is_outlier \n",
"0 False \n",
"200 False \n",
"400 False \n",
"600 False \n",
"800 False "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"join_keys = [\n",
" \"Metadata_Plate\",\n",
" \"Metadata_Well\",\n",
" \"Metadata_ImageNumber\",\n",
" \"Metadata_ObjectNumber\",\n",
"]\n",
"\n",
"# annotate() merges the plate map AND the QC annotation file in a single call.\n",
"# The qc.parquet columns already carry the Metadata_ prefix, so they pass through\n",
"# prepare_external_metadata_for_annotate() unchanged.\n",
"annotated_cells = annotate(\n",
" profiles=single_cells,\n",
" platemap=platemap,\n",
" join_on=[\"Metadata_well_position\", \"Metadata_Well\"],\n",
" add_metadata_id_to_platemap=True,\n",
" external_metadata=str(qc_path),\n",
" external_join_on=join_keys,\n",
")\n",
"\n",
"new_cols = [c for c in annotated_cells.columns if c not in single_cells.columns]\n",
"qc_cols = [c for c in new_cols if \"cqc\" in c]\n",
"print(f\"New columns: {new_cols}\")\n",
"print(f\"QC flag columns: {qc_cols}\")\n",
"print(\n",
" f\"\\nCells flagged by any QC condition: \"\n",
" f\"{annotated_cells[qc_cols].any(axis=1).sum():,} \"\n",
" f\"({100 * annotated_cells[qc_cols].any(axis=1).mean():.1f}%)\"\n",
")\n",
"print()\n",
"annotated_cells[\n",
" [c for c in annotated_cells.columns if c.startswith(\"Metadata_\")]\n",
"].drop_duplicates(subset=[\"Metadata_Well\"]).head()"
]
},
{
"cell_type": "markdown",
"id": "cell-normalize-intro",
"metadata": {},
"source": [
"---\n",
"\n",
"## Step 2: Normalize\n",
"\n",
"Raw CellProfiler features vary in scale (cell area in pixels², intensities in 0–1) and are influenced by plate-to-plate technical effects. Normalization places all features on a common scale and limits plate-to-plate variation by z-scoring each feature relative to the **DMSO control cells**.\n",
"\n",
"Passing `drop_cosmicqc_rows=True` tells `normalize()` to drop every row where any `Metadata_cqc_*` flag is `True` before computing the z-scores, so QC filtering and normalization happen in a single call."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "cell-normalize",
"metadata": {
"execution": {
"iopub.execute_input": "2026-06-01T19:35:00.020223Z",
"iopub.status.busy": "2026-06-01T19:35:00.020114Z",
"iopub.status.idle": "2026-06-01T19:35:00.034092Z",
"shell.execute_reply": "2026-06-01T19:35:00.033754Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total cells 1,200\n",
"Removed (QC outliers) 26 (2.2%)\n",
"Retained 1,174\n",
"\n",
"Normalized shape: (1174, 22)\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Metadata_treatment | \n",
" Metadata_cell_line | \n",
" Metadata_concentration_um | \n",
" Metadata_Plate | \n",
" Metadata_Well | \n",
" Metadata_ImageNumber | \n",
" Metadata_ObjectNumber | \n",
" Metadata_cqc_large_nuclear_size_is_outlier | \n",
" Metadata_cqc_small_nuclear_size_is_outlier | \n",
" Metadata_cqc_poor_segmentation_is_outlier | \n",
" ... | \n",
" Cells_AreaShape_EulerNumber | \n",
" Cells_AreaShape_Eccentricity | \n",
" Cells_Intensity_MeanIntensity_Mito | \n",
" Cells_Texture_Correlation_RNA_3_0_256 | \n",
" Cytoplasm_AreaShape_Area | \n",
" Cytoplasm_Intensity_MeanIntensity_AGP | \n",
" Nuclei_AreaShape_Area | \n",
" Nuclei_AreaShape_Eccentricity | \n",
" Nuclei_Intensity_MeanIntensity_DNA | \n",
" Nuclei_Intensity_MassDisplacement_DNA | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" DMSO | \n",
" HeLa | \n",
" 0.0 | \n",
" Plate_1 | \n",
" B02 | \n",
" 1 | \n",
" 1 | \n",
" False | \n",
" False | \n",
" False | \n",
" ... | \n",
" 0.0 | \n",
" 1.292772 | \n",
" 0.100004 | \n",
" 0.733196 | \n",
" -2.107301 | \n",
" -0.009458 | \n",
" -0.380694 | \n",
" -0.907252 | \n",
" -1.166556 | \n",
" -1.028890 | \n",
"
\n",
" \n",
" | 1 | \n",
" DMSO | \n",
" HeLa | \n",
" 0.0 | \n",
" Plate_1 | \n",
" B02 | \n",
" 1 | \n",
" 2 | \n",
" False | \n",
" False | \n",
" False | \n",
" ... | \n",
" 0.0 | \n",
" 0.819530 | \n",
" -1.325156 | \n",
" 0.121467 | \n",
" -0.481165 | \n",
" -0.293417 | \n",
" 1.319231 | \n",
" 1.025113 | \n",
" 0.607217 | \n",
" 1.555226 | \n",
"
\n",
" \n",
" | 3 | \n",
" DMSO | \n",
" HeLa | \n",
" 0.0 | \n",
" Plate_1 | \n",
" B02 | \n",
" 1 | \n",
" 4 | \n",
" False | \n",
" False | \n",
" False | \n",
" ... | \n",
" 0.0 | \n",
" -0.883623 | \n",
" -0.280147 | \n",
" -1.368761 | \n",
" -0.591794 | \n",
" 0.361401 | \n",
" 0.749972 | \n",
" 1.237712 | \n",
" -0.672132 | \n",
" -0.767620 | \n",
"
\n",
" \n",
"
\n",
"
3 rows × 22 columns
\n",
"
"
],
"text/plain": [
" Metadata_treatment Metadata_cell_line Metadata_concentration_um \\\n",
"0 DMSO HeLa 0.0 \n",
"1 DMSO HeLa 0.0 \n",
"3 DMSO HeLa 0.0 \n",
"\n",
" Metadata_Plate Metadata_Well Metadata_ImageNumber Metadata_ObjectNumber \\\n",
"0 Plate_1 B02 1 1 \n",
"1 Plate_1 B02 1 2 \n",
"3 Plate_1 B02 1 4 \n",
"\n",
" Metadata_cqc_large_nuclear_size_is_outlier \\\n",
"0 False \n",
"1 False \n",
"3 False \n",
"\n",
" Metadata_cqc_small_nuclear_size_is_outlier \\\n",
"0 False \n",
"1 False \n",
"3 False \n",
"\n",
" Metadata_cqc_poor_segmentation_is_outlier ... \\\n",
"0 False ... \n",
"1 False ... \n",
"3 False ... \n",
"\n",
" Cells_AreaShape_EulerNumber Cells_AreaShape_Eccentricity \\\n",
"0 0.0 1.292772 \n",
"1 0.0 0.819530 \n",
"3 0.0 -0.883623 \n",
"\n",
" Cells_Intensity_MeanIntensity_Mito Cells_Texture_Correlation_RNA_3_0_256 \\\n",
"0 0.100004 0.733196 \n",
"1 -1.325156 0.121467 \n",
"3 -0.280147 -1.368761 \n",
"\n",
" Cytoplasm_AreaShape_Area Cytoplasm_Intensity_MeanIntensity_AGP \\\n",
"0 -2.107301 -0.009458 \n",
"1 -0.481165 -0.293417 \n",
"3 -0.591794 0.361401 \n",
"\n",
" Nuclei_AreaShape_Area Nuclei_AreaShape_Eccentricity \\\n",
"0 -0.380694 -0.907252 \n",
"1 1.319231 1.025113 \n",
"3 0.749972 1.237712 \n",
"\n",
" Nuclei_Intensity_MeanIntensity_DNA Nuclei_Intensity_MassDisplacement_DNA \n",
"0 -1.166556 -1.028890 \n",
"1 0.607217 1.555226 \n",
"3 -0.672132 -0.767620 \n",
"\n",
"[3 rows x 22 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# drop_cosmicqc_rows=True removes QC-flagged cells before z-scoring.\n",
"normalized_cells = normalize(\n",
" profiles=annotated_cells,\n",
" features=\"infer\",\n",
" meta_features=\"infer\",\n",
" samples=\"Metadata_treatment == 'DMSO'\",\n",
" method=\"standardize\",\n",
" drop_cosmicqc_rows=True,\n",
")\n",
"\n",
"n_removed = len(annotated_cells) - len(normalized_cells)\n",
"print(f\"{'Total cells':<22} {len(annotated_cells):>6,}\")\n",
"print(\n",
" f\"{'Removed (QC outliers)':<22} {n_removed:>6,} ({100 * n_removed / len(annotated_cells):.1f}%)\"\n",
")\n",
"print(f\"{'Retained':<22} {len(normalized_cells):>6,}\")\n",
"print()\n",
"print(f\"Normalized shape: {normalized_cells.shape}\")\n",
"normalized_cells.head(3)"
]
},
{
"cell_type": "markdown",
"id": "cell-featsel-intro",
"metadata": {},
"source": [
"---\n",
"\n",
"## Step 3: Feature Selection\n",
"\n",
"Even after QC and normalization, some features carry little information:\n",
"\n",
"- **Low-variance features** are nearly constant across all cells and cannot distinguish biological conditions.\n",
"- **Highly correlated feature pairs** are redundant; keeping both double-weights that axis of variation in clustering and embeddings.\n",
"- **Blocklisted features** are known to capture image artifacts rather than cell biology.\n",
"\n",
"`feature_select()` applies all three filters, producing a lean feature matrix ready for single-cell analyses such as UMAP or hierarchical clustering."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "cell-featsel",
"metadata": {
"execution": {
"iopub.execute_input": "2026-06-01T19:35:00.035826Z",
"iopub.status.busy": "2026-06-01T19:35:00.035694Z",
"iopub.status.idle": "2026-06-01T19:35:00.055266Z",
"shell.execute_reply": "2026-06-01T19:35:00.054966Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Features before selection: 12\n",
"Features after selection: 10\n",
"Features removed: {'Cells_AreaShape_Area', 'Cells_AreaShape_EulerNumber'}\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Metadata_treatment | \n",
" Metadata_cell_line | \n",
" Metadata_concentration_um | \n",
" Metadata_Plate | \n",
" Metadata_Well | \n",
" Metadata_ImageNumber | \n",
" Metadata_ObjectNumber | \n",
" Metadata_cqc_large_nuclear_size_is_outlier | \n",
" Metadata_cqc_small_nuclear_size_is_outlier | \n",
" Metadata_cqc_poor_segmentation_is_outlier | \n",
" Cells_AreaShape_BoundingBoxArea | \n",
" Cells_AreaShape_Eccentricity | \n",
" Cells_Intensity_MeanIntensity_Mito | \n",
" Cells_Texture_Correlation_RNA_3_0_256 | \n",
" Cytoplasm_AreaShape_Area | \n",
" Cytoplasm_Intensity_MeanIntensity_AGP | \n",
" Nuclei_AreaShape_Area | \n",
" Nuclei_AreaShape_Eccentricity | \n",
" Nuclei_Intensity_MeanIntensity_DNA | \n",
" Nuclei_Intensity_MassDisplacement_DNA | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" DMSO | \n",
" HeLa | \n",
" 0.0 | \n",
" Plate_1 | \n",
" B02 | \n",
" 1 | \n",
" 1 | \n",
" False | \n",
" False | \n",
" False | \n",
" 0.423873 | \n",
" 1.292772 | \n",
" 0.100004 | \n",
" 0.733196 | \n",
" -2.107301 | \n",
" -0.009458 | \n",
" -0.380694 | \n",
" -0.907252 | \n",
" -1.166556 | \n",
" -1.028890 | \n",
"
\n",
" \n",
" | 1 | \n",
" DMSO | \n",
" HeLa | \n",
" 0.0 | \n",
" Plate_1 | \n",
" B02 | \n",
" 1 | \n",
" 2 | \n",
" False | \n",
" False | \n",
" False | \n",
" -1.020114 | \n",
" 0.819530 | \n",
" -1.325156 | \n",
" 0.121467 | \n",
" -0.481165 | \n",
" -0.293417 | \n",
" 1.319231 | \n",
" 1.025113 | \n",
" 0.607217 | \n",
" 1.555226 | \n",
"
\n",
" \n",
" | 3 | \n",
" DMSO | \n",
" HeLa | \n",
" 0.0 | \n",
" Plate_1 | \n",
" B02 | \n",
" 1 | \n",
" 4 | \n",
" False | \n",
" False | \n",
" False | \n",
" 1.139881 | \n",
" -0.883623 | \n",
" -0.280147 | \n",
" -1.368761 | \n",
" -0.591794 | \n",
" 0.361401 | \n",
" 0.749972 | \n",
" 1.237712 | \n",
" -0.672132 | \n",
" -0.767620 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Metadata_treatment Metadata_cell_line Metadata_concentration_um \\\n",
"0 DMSO HeLa 0.0 \n",
"1 DMSO HeLa 0.0 \n",
"3 DMSO HeLa 0.0 \n",
"\n",
" Metadata_Plate Metadata_Well Metadata_ImageNumber Metadata_ObjectNumber \\\n",
"0 Plate_1 B02 1 1 \n",
"1 Plate_1 B02 1 2 \n",
"3 Plate_1 B02 1 4 \n",
"\n",
" Metadata_cqc_large_nuclear_size_is_outlier \\\n",
"0 False \n",
"1 False \n",
"3 False \n",
"\n",
" Metadata_cqc_small_nuclear_size_is_outlier \\\n",
"0 False \n",
"1 False \n",
"3 False \n",
"\n",
" Metadata_cqc_poor_segmentation_is_outlier Cells_AreaShape_BoundingBoxArea \\\n",
"0 False 0.423873 \n",
"1 False -1.020114 \n",
"3 False 1.139881 \n",
"\n",
" Cells_AreaShape_Eccentricity Cells_Intensity_MeanIntensity_Mito \\\n",
"0 1.292772 0.100004 \n",
"1 0.819530 -1.325156 \n",
"3 -0.883623 -0.280147 \n",
"\n",
" Cells_Texture_Correlation_RNA_3_0_256 Cytoplasm_AreaShape_Area \\\n",
"0 0.733196 -2.107301 \n",
"1 0.121467 -0.481165 \n",
"3 -1.368761 -0.591794 \n",
"\n",
" Cytoplasm_Intensity_MeanIntensity_AGP Nuclei_AreaShape_Area \\\n",
"0 -0.009458 -0.380694 \n",
"1 -0.293417 1.319231 \n",
"3 0.361401 0.749972 \n",
"\n",
" Nuclei_AreaShape_Eccentricity Nuclei_Intensity_MeanIntensity_DNA \\\n",
"0 -0.907252 -1.166556 \n",
"1 1.025113 0.607217 \n",
"3 1.237712 -0.672132 \n",
"\n",
" Nuclei_Intensity_MassDisplacement_DNA \n",
"0 -1.028890 \n",
"1 1.555226 \n",
"3 -0.767620 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"selected_cells = feature_select(\n",
" profiles=normalized_cells,\n",
" features=\"infer\",\n",
" operation=[\"variance_threshold\", \"correlation_threshold\", \"blocklist\"],\n",
")\n",
"\n",
"feature_cols_before = [\n",
" c for c in normalized_cells.columns if not c.startswith(\"Metadata_\")\n",
"]\n",
"feature_cols_after = [\n",
" c for c in selected_cells.columns if not c.startswith(\"Metadata_\")\n",
"]\n",
"\n",
"print(f\"Features before selection: {len(feature_cols_before)}\")\n",
"print(f\"Features after selection: {len(feature_cols_after)}\")\n",
"print(f\"Features removed: {set(feature_cols_before) - set(feature_cols_after)}\")\n",
"selected_cells.head(3)"
]
},
{
"cell_type": "markdown",
"id": "cell-summary",
"metadata": {},
"source": [
"---\n",
"\n",
"## Summary\n",
"\n",
"You have processed a CytoTable single-cell dataset through a complete quality-control and normalization pipeline, preserving single-cell resolution throughout:\n",
"\n",
"| Step | Function | Input | Output |\n",
"|---|---|---|---|\n",
"| Load | `pd.read_parquet` | CytoTable Parquet | 1,200 single cells |\n",
"| Annotate | `annotate()` | Single cells + platemap + `qc.parquet` | Cells with treatment labels and QC flags |\n",
"| Normalize | `normalize(drop_cosmicqc_rows=True)` | Annotated cells | ~1,176 passing cells, Z-scored |\n",
"| Feature select | `feature_select()` | 11 features | 9 features |\n",
"\n",
"The output is a **clean, normalized single-cell feature matrix**, `selected_cells`, where every row is one cell and every column is an informative morphological feature.\n",
"\n",
"### Next steps\n",
"\n",
"- **Embed**: run UMAP or t-SNE on `selected_cells` to visualize how treatments separate in morphological space at single-cell resolution.\n",
"- **Cluster**: apply k-means or Leiden clustering to discover subpopulations within each treatment condition.\n",
"- **Aggregate**: feed `selected_cells` into `aggregate()` if you need well-level profiles (e.g. for the consensus pipeline shown in the [Introduction to Image-based Profiling with Pycytominer](introduction_to_pycytominer.ipynb) tutorial).\n",
"- **Hit calling**: identify which compounds produce a statistically significant morphological change relative to controls. [Buscar](https://github.com/WayScience/Buscar) operates directly on single-cell distributions to capture cellular heterogeneity and flag off-target effects."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "qc_env",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}