{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cell-title",
   "metadata": {},
   "source": [
    "# Single-cell image-based profiling\n",
    "\n",
    "## A complete single-cell processing pipeline with Pycytominer\n",
    "\n",
    "High-content microscopy experiments can produce thousands of single-cell\n",
    "measurements per image. Working at single-cell resolution (rather than first\n",
    "aggregating cells into well-level profiles) preserves the full diversity of\n",
    "cellular responses: rare subpopulations, bimodal distributions, and heterogeneous\n",
    "drug effects that vanish in the average.\n",
    "\n",
    "Single-cell profiling introduces a challenge that well-level profiling sidesteps:\n",
    "**not every detected object is a real, well-segmented cell.** Debris, out-of-focus\n",
    "objects, and fused cells contaminate the feature matrix and distort downstream\n",
    "analyses. A quality-control step is therefore essential before dimensionality\n",
    "reduction, clustering, or hit calling.\n",
    "\n",
    "This tutorial walks through a complete single-cell processing pipeline starting\n",
    "from [CytoTable](https://cytomining.github.io/CytoTable/) output.\n",
    "[coSMicQC](https://cytomining.github.io/coSMicQC/) is used here for QC:\n",
    "\n",
    "1. **Load**: read the joined single-cell Parquet file produced by CytoTable\n",
    "2. **Annotate**: attach experimental metadata and QC flags from coSMicQC\n",
    "3. **Normalize**: drop QC outliers and z-score features against DMSO controls\n",
    "4. **Feature select**: drop redundant and uninformative features\n",
    "\n",
    "The result is a clean, normalized single-cell feature matrix ready for\n",
    "dimensionality reduction, clustering, or further aggregation.\n",
    "\n",
    "> **New to pycytominer?** Read the\n",
    "> [Introduction to Pycytominer](introduction_to_pycytominer.ipynb) tutorial first.\n",
    "> This tutorial assumes familiarity with the core pipeline steps."
   ]
  },
  {
   "cell_type": "raw",
   "id": "cell-pipeline-diagram",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. mermaid::\n",
    "   :align: center\n",
    "\n",
    "   flowchart TD\n",
    "       cytotable[\"CytoTable output<br/>single_cells.parquet, 1200 cells\"]\n",
    "       qcfile[\"coSMicQC output<br/>qc.parquet, QC annotations\"]\n",
    "       ann[\"annotate()<br/>Add platemap + QC flags\"]\n",
    "       nor[\"normalize()<br/>Drop QC outliers · Z-score vs DMSO\"]\n",
    "       fea[\"feature_select()<br/>Remove redundant features\"]\n",
    "       output[\"Single-cell profiles<br/>~1174 cells, 10 features\"]\n",
    "\n",
    "       cytotable --> ann\n",
    "       qcfile    --> ann\n",
    "       ann --> nor --> fea --> output\n",
    "\n",
    "       style cytotable fill:#f0d9fa,stroke:#88239A,color:#111\n",
    "       style qcfile    fill:#f0d9fa,stroke:#88239A,color:#111\n",
    "       style output    fill:#f0d9fa,stroke:#88239A,color:#111\n",
    "       style ann fill:#ffffff,stroke:#88239A,color:#111\n",
    "       style nor fill:#ffffff,stroke:#88239A,color:#111\n",
    "       style fea fill:#ffffff,stroke:#88239A,color:#111"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-prereqs",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "Install the required packages:\n",
    "\n",
    "```bash\n",
    "pip install pycytominer coSMicQC pyarrow pandas numpy\n",
    "```\n",
    "\n",
    "This tutorial uses **simulated data** that matches the exact schema produced by [CytoTable](https://cytomining.github.io/CytoTable/) and [coSMicQC](https://cytomining.github.io/coSMicQC/). In a real experiment, replace the simulation block with your own `single_cells.parquet` and `qc.parquet` files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "cell-imports",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-01T19:34:58.190044Z",
     "iopub.status.busy": "2026-06-01T19:34:58.189829Z",
     "iopub.status.idle": "2026-06-01T19:34:59.724211Z",
     "shell.execute_reply": "2026-06-01T19:34:59.723901Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Working directory: /var/folders/02/q30k_4wn2dqbz5pj_vvc8xn40000gp/T/tmp57clvnip\n"
     ]
    }
   ],
   "source": [
    "import tempfile\n",
    "from pathlib import Path\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from pycytominer import annotate, feature_select, normalize\n",
    "\n",
    "# Reproducible random state used throughout the simulation\n",
    "rng = np.random.default_rng(42)\n",
    "\n",
    "# Temporary directory — stands in for the output directory on your filesystem\n",
    "tmp_dir = Path(tempfile.mkdtemp())\n",
    "print(f\"Working directory: {tmp_dir}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-cytotable-intro",
   "metadata": {},
   "source": [
    "## Input: CytoTable Single-Cell Data\n",
    "\n",
    "[CytoTable](https://cytomining.github.io/CytoTable/) converts CellProfiler SQLite or CSV output into a single analysis-ready Parquet file. Each row represents one segmented object (a cell), and columns fall into three groups:\n",
    "\n",
    "| Group | Example columns | Purpose |\n",
    "|---|---|---|\n",
    "| `Metadata_*` | `Metadata_Plate`, `Metadata_Well`, `Metadata_ImageNumber`, `Metadata_ObjectNumber` | Describe the experiment |\n",
    "| `cytotable_meta_*` | `cytotable_meta_source_path`, `cytotable_meta_offset` | CytoTable provenance. Pycytominer ignores these automatically |\n",
    "| Feature columns | `Cells_AreaShape_Area`, `Nuclei_Intensity_MeanIntensity_DNA` | Morphology measurements per single-cell |\n",
    "\n",
    "`Metadata_ImageNumber` and `Metadata_ObjectNumber` together uniquely identify every cell and serve as the **join key** between the single-cell data and the coSMicQC annotations.\n",
    "\n",
    "> **Note on `cytotable_meta_*` columns:** These provenance columns track source-file offsets for CytoTable's internal bookkeeping. Pycytominer's feature inference uses CellProfiler compartment prefixes (`Cells_`, `Cytoplasm_`, `Nuclei_`) and ignores them automatically. They pass through `annotate()` unchanged and are  dropped at the `normalize()` step.\n",
    "\n",
    "The simulation code is available in the expandable block below. Skip it to go straight to the next step."
   ]
  },
  {
   "cell_type": "raw",
   "id": "cell-simulate-toggle",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    ".. toggle::\n",
    "\n",
    "   In a real experiment these files come from running\n",
    "   `CytoTable <https://cytomining.github.io/CytoTable/>`__ and\n",
    "   `coSMicQC <https://cytomining.github.io/coSMicQC/>`__ on your CellProfiler\n",
    "   output. The functions below reproduce their output schemas using synthetic data.\n",
    "\n",
    "   **Step A — simulate CytoTable single-cell data**\n",
    "\n",
    "   .. code-block:: python\n",
    "\n",
    "      WELLS = {\n",
    "          \"B02\": \"DMSO\",       \"C02\": \"DMSO\",\n",
    "          \"B03\": \"Compound_A\", \"C03\": \"Compound_A\",\n",
    "          \"B04\": \"Compound_B\", \"C04\": \"Compound_B\",\n",
    "      }\n",
    "      N_CELLS_PER_WELL = 100\n",
    "\n",
    "      def simulate_cytotable(plate_id: str) -> pd.DataFrame:\n",
    "          \"\"\"Generate a synthetic CytoTable-style single-cell DataFrame.\"\"\"\n",
    "          rows = []\n",
    "          for img_num, (well, treatment) in enumerate(WELLS.items(), start=1):\n",
    "              is_a = float(treatment == \"Compound_A\")\n",
    "              is_b = float(treatment == \"Compound_B\")\n",
    "              cell_areas   = rng.normal(500 + 180 * is_a - 90 * is_b, 120, N_CELLS_PER_WELL)\n",
    "              nuclei_areas = rng.normal(195, 55, N_CELLS_PER_WELL)\n",
    "              for obj_num in range(1, N_CELLS_PER_WELL + 1):\n",
    "                  rows.append(                {\n",
    "                    # ── CytoTable metadata ──────────────────────────────────\n",
    "                    \"Metadata_Plate\": plate_id,\n",
    "                    \"Metadata_Well\": well,\n",
    "                    \"Metadata_ImageNumber\": img_num,\n",
    "                    \"Metadata_ObjectNumber\": obj_num,\n",
    "                    # CytoTable provenance columns\n",
    "                    \"cytotable_meta_source_path\": f\"/data/{plate_id}/images/\",\n",
    "                    \"cytotable_meta_offset\": (img_num - 1) * N_CELLS_PER_WELL + obj_num,\n",
    "                    \"cytotable_meta_rownum\": obj_num,\n",
    "                    # ── Feature columns ─────────────────────────────────────\n",
    "                    \"Cells_AreaShape_Area\": cell_areas[obj_num - 1],\n",
    "                    \"Cells_AreaShape_BoundingBoxArea\": cell_areas[obj_num - 1] * 1.3\n",
    "                    + rng.normal(0, 4),\n",
    "                    \"Cells_AreaShape_EulerNumber\": 1,\n",
    "                    \"Cells_AreaShape_Eccentricity\": float(\n",
    "                        np.clip(rng.normal(0.55, 0.12), 0, 1)\n",
    "                    ),\n",
    "                    \"Cells_Intensity_MeanIntensity_Mito\": rng.normal(0.30, 0.06),\n",
    "                    \"Cells_Texture_Correlation_RNA_3_0_256\": rng.normal(0.22, 0.06),\n",
    "                    \"Cytoplasm_AreaShape_Area\": rng.normal(310, 80),\n",
    "                    \"Cytoplasm_Intensity_MeanIntensity_AGP\": rng.normal(0.25, 0.07),\n",
    "                    \"Nuclei_AreaShape_Area\": nuclei_areas[obj_num - 1],\n",
    "                    \"Nuclei_AreaShape_Eccentricity\": float(\n",
    "                        np.clip(rng.normal(0.40, 0.10), 0, 1)\n",
    "                    ),\n",
    "                    \"Nuclei_Intensity_MeanIntensity_DNA\": rng.normal(0.50, 0.08),\n",
    "                    \"Nuclei_Intensity_MassDisplacement_DNA\": abs(rng.normal(6, 4)),\n",
    "                })\n",
    "          return pd.DataFrame(rows)\n",
    "\n",
    "   **Step B — simulate coSMicQC QC annotations**\n",
    "\n",
    "   ``label_outliers(..., export_as_annotations=True)`` writes a compact Parquet\n",
    "   with only join-key columns and boolean ``Metadata_cqc_*`` flags.\n",
    "\n",
    "   .. code-block:: python\n",
    "\n",
    "    def simulate_qc_parquet(sc_df: pd.DataFrame) -> pd.DataFrame:\n",
    "        \"\"\"Reproduce the annotation schema produced by coSMicQC label_outliers().\"\"\"\n",
    "\n",
    "        join_keys = [\n",
    "            \"Metadata_Plate\",\n",
    "            \"Metadata_Well\",\n",
    "            \"Metadata_ImageNumber\",\n",
    "            \"Metadata_ObjectNumber\",\n",
    "        ]\n",
    "\n",
    "        qc = sc_df[join_keys].copy()\n",
    "\n",
    "        nuc_area = sc_df[\"Nuclei_AreaShape_Area\"]\n",
    "        nuc_z = (nuc_area - nuc_area.mean()) / nuc_area.std()\n",
    "\n",
    "        mass_disp = sc_df[\"Nuclei_Intensity_MassDisplacement_DNA\"]\n",
    "        mass_disp_z = (mass_disp - mass_disp.mean()) / mass_disp.std()\n",
    "\n",
    "        qc[\"Metadata_cqc_large_nuclear_size_is_outlier\"] = nuc_z > 2.5\n",
    "        qc[\"Metadata_cqc_small_nuclear_size_is_outlier\"] = nuc_z < -2.5\n",
    "        qc[\"Metadata_cqc_poor_segmentation_is_outlier\"] = mass_disp_z > 2.5\n",
    "\n",
    "        return qc\n",
    "\n",
    "   **Step C — build two plates and write to disk**\n",
    "\n",
    "   .. code-block:: python\n",
    "\n",
    "      plate1 = simulate_cytotable(\"Plate_1\")\n",
    "      plate2 = simulate_cytotable(\"Plate_2\")\n",
    "      single_cells_raw = pd.concat([plate1, plate2], ignore_index=True)\n",
    "\n",
    "      qc_annotations_raw = simulate_qc_parquet(single_cells_raw)\n",
    "\n",
    "      sc_path = tmp_dir / \"single_cells.parquet\"\n",
    "      qc_path = tmp_dir / \"qc.parquet\"\n",
    "      single_cells_raw.to_parquet(sc_path, index=False)\n",
    "      qc_annotations_raw.to_parquet(qc_path, index=False)\n",
    "\n",
    "      print(f\"single_cells.parquet  {single_cells_raw.shape[0]:,} rows x {single_cells_raw.shape[1]} cols\")\n",
    "      print(f\"qc.parquet            {qc_annotations_raw.shape[0]:,} rows x {qc_annotations_raw.shape[1]} cols\")\n",
    "      print(f\"\\nqc.parquet columns: {list(qc_annotations_raw.columns)}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "cell-simulate",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-01T19:34:59.726742Z",
     "iopub.status.busy": "2026-06-01T19:34:59.726457Z",
     "iopub.status.idle": "2026-06-01T19:34:59.852969Z",
     "shell.execute_reply": "2026-06-01T19:34:59.852647Z"
    },
    "nbsphinx": "hidden"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "single_cells.parquet  1,200 rows x 19 cols\n",
      "qc.parquet            1,200 rows x 7 cols\n",
      "\n",
      "qc.parquet columns: ['Metadata_Plate', 'Metadata_Well', 'Metadata_ImageNumber', 'Metadata_ObjectNumber', 'Metadata_cqc_large_nuclear_size_is_outlier', 'Metadata_cqc_small_nuclear_size_is_outlier', 'Metadata_cqc_poor_segmentation_is_outlier']\n"
     ]
    }
   ],
   "source": [
    "# ── Simulate CytoTable single-cell output ─────────────────────────────────\n",
    "#\n",
    "# In a real experiment this file is produced by:\n",
    "#   import cytotable\n",
    "#   cytotable.convert(source_path=\"...\", dest_path=\"single_cells.parquet\", ...)\n",
    "#\n",
    "# Here we generate synthetic data with the same column schema.\n",
    "\n",
    "WELLS = {\n",
    "    \"B02\": \"DMSO\",\n",
    "    \"C02\": \"DMSO\",\n",
    "    \"B03\": \"Compound_A\",\n",
    "    \"C03\": \"Compound_A\",\n",
    "    \"B04\": \"Compound_B\",\n",
    "    \"C04\": \"Compound_B\",\n",
    "}\n",
    "N_CELLS_PER_WELL = 100\n",
    "\n",
    "\n",
    "def simulate_cytotable(plate_id: str) -> pd.DataFrame:\n",
    "    \"\"\"Generate a synthetic CytoTable-style single-cell DataFrame.\"\"\"\n",
    "    rows = []\n",
    "    for img_num, (well, treatment) in enumerate(WELLS.items(), start=1):\n",
    "        is_a = float(treatment == \"Compound_A\")\n",
    "        is_b = float(treatment == \"Compound_B\")\n",
    "        cell_areas = rng.normal(500 + 180 * is_a - 90 * is_b, 120, N_CELLS_PER_WELL)\n",
    "        nuclei_areas = rng.normal(195, 55, N_CELLS_PER_WELL)\n",
    "        for obj_num in range(1, N_CELLS_PER_WELL + 1):\n",
    "            rows.append({\n",
    "                # ── CytoTable metadata ──────────────────────────────────\n",
    "                \"Metadata_Plate\": plate_id,\n",
    "                \"Metadata_Well\": well,\n",
    "                \"Metadata_ImageNumber\": img_num,\n",
    "                \"Metadata_ObjectNumber\": obj_num,\n",
    "                # CytoTable provenance columns\n",
    "                \"cytotable_meta_source_path\": f\"/data/{plate_id}/images/\",\n",
    "                \"cytotable_meta_offset\": (img_num - 1) * N_CELLS_PER_WELL + obj_num,\n",
    "                \"cytotable_meta_rownum\": obj_num,\n",
    "                # ── Feature columns ─────────────────────────────────────\n",
    "                \"Cells_AreaShape_Area\": cell_areas[obj_num - 1],\n",
    "                \"Cells_AreaShape_BoundingBoxArea\": cell_areas[obj_num - 1] * 1.3\n",
    "                + rng.normal(0, 4),\n",
    "                \"Cells_AreaShape_EulerNumber\": 1,\n",
    "                \"Cells_AreaShape_Eccentricity\": float(\n",
    "                    np.clip(rng.normal(0.55, 0.12), 0, 1)\n",
    "                ),\n",
    "                \"Cells_Intensity_MeanIntensity_Mito\": rng.normal(0.30, 0.06),\n",
    "                \"Cells_Texture_Correlation_RNA_3_0_256\": rng.normal(0.22, 0.06),\n",
    "                \"Cytoplasm_AreaShape_Area\": rng.normal(310, 80),\n",
    "                \"Cytoplasm_Intensity_MeanIntensity_AGP\": rng.normal(0.25, 0.07),\n",
    "                \"Nuclei_AreaShape_Area\": nuclei_areas[obj_num - 1],\n",
    "                \"Nuclei_AreaShape_Eccentricity\": float(\n",
    "                    np.clip(rng.normal(0.40, 0.10), 0, 1)\n",
    "                ),\n",
    "                \"Nuclei_Intensity_MeanIntensity_DNA\": rng.normal(0.50, 0.08),\n",
    "                \"Nuclei_Intensity_MassDisplacement_DNA\": abs(rng.normal(6, 4)),\n",
    "            })\n",
    "    return pd.DataFrame(rows)\n",
    "\n",
    "\n",
    "# ── Simulate coSMicQC annotation output (qc.parquet) ──────────────────────\n",
    "#\n",
    "# coSMicQC label_outliers(...) flags outliers using signed z-score thresholds\n",
    "# and writes a compact annotation file with Metadata_cqc_* boolean columns.\n",
    "# Here we reproduce that schema directly.\n",
    "\n",
    "\n",
    "def simulate_qc_parquet(sc_df: pd.DataFrame) -> pd.DataFrame:\n",
    "    \"\"\"Reproduce the annotation schema produced by coSMicQC label_outliers().\"\"\"\n",
    "\n",
    "    join_keys = [\n",
    "        \"Metadata_Plate\",\n",
    "        \"Metadata_Well\",\n",
    "        \"Metadata_ImageNumber\",\n",
    "        \"Metadata_ObjectNumber\",\n",
    "    ]\n",
    "\n",
    "    qc = sc_df[join_keys].copy()\n",
    "\n",
    "    nuc_area = sc_df[\"Nuclei_AreaShape_Area\"]\n",
    "    nuc_z = (nuc_area - nuc_area.mean()) / nuc_area.std()\n",
    "\n",
    "    mass_disp = sc_df[\"Nuclei_Intensity_MassDisplacement_DNA\"]\n",
    "    mass_disp_z = (mass_disp - mass_disp.mean()) / mass_disp.std()\n",
    "\n",
    "    qc[\"Metadata_cqc_large_nuclear_size_is_outlier\"] = nuc_z > 2.5\n",
    "    qc[\"Metadata_cqc_small_nuclear_size_is_outlier\"] = nuc_z < -2.5\n",
    "    qc[\"Metadata_cqc_poor_segmentation_is_outlier\"] = mass_disp_z > 2.5\n",
    "\n",
    "    return qc\n",
    "\n",
    "\n",
    "# Build two plates, concatenate, then write both files to disk\n",
    "plate1 = simulate_cytotable(\"Plate_1\")\n",
    "plate2 = simulate_cytotable(\"Plate_2\")\n",
    "single_cells_raw = pd.concat([plate1, plate2], ignore_index=True)\n",
    "\n",
    "qc_annotations_raw = simulate_qc_parquet(single_cells_raw)\n",
    "\n",
    "sc_path = tmp_dir / \"single_cells.parquet\"\n",
    "qc_path = tmp_dir / \"qc.parquet\"\n",
    "single_cells_raw.to_parquet(sc_path, index=False)\n",
    "qc_annotations_raw.to_parquet(qc_path, index=False)\n",
    "\n",
    "print(\n",
    "    f\"single_cells.parquet  {single_cells_raw.shape[0]:,} rows x {single_cells_raw.shape[1]} cols\"\n",
    ")\n",
    "print(\n",
    "    f\"qc.parquet            {qc_annotations_raw.shape[0]:,} rows x {qc_annotations_raw.shape[1]} cols\"\n",
    ")\n",
    "print(f\"\\nqc.parquet columns: {list(qc_annotations_raw.columns)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "cell-load-inspect",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-01T19:34:59.854384Z",
     "iopub.status.busy": "2026-06-01T19:34:59.854273Z",
     "iopub.status.idle": "2026-06-01T19:34:59.988267Z",
     "shell.execute_reply": "2026-06-01T19:34:59.987891Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded 1,200 single cells across 2 plates and 6 unique wells\n",
      "\n",
      "Feature columns (12): ['Cells_AreaShape_Area', 'Cells_AreaShape_BoundingBoxArea', 'Cells_AreaShape_EulerNumber', 'Cells_AreaShape_Eccentricity', 'Cells_Intensity_MeanIntensity_Mito', 'Cells_Texture_Correlation_RNA_3_0_256', 'Cytoplasm_AreaShape_Area', 'Cytoplasm_Intensity_MeanIntensity_AGP', 'Nuclei_AreaShape_Area', 'Nuclei_AreaShape_Eccentricity', 'Nuclei_Intensity_MeanIntensity_DNA', 'Nuclei_Intensity_MassDisplacement_DNA']\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Metadata_Plate</th>\n",
       "      <th>Metadata_Well</th>\n",
       "      <th>Metadata_ImageNumber</th>\n",
       "      <th>Metadata_ObjectNumber</th>\n",
       "      <th>cytotable_meta_source_path</th>\n",
       "      <th>cytotable_meta_offset</th>\n",
       "      <th>cytotable_meta_rownum</th>\n",
       "      <th>Cells_AreaShape_Area</th>\n",
       "      <th>Cells_AreaShape_BoundingBoxArea</th>\n",
       "      <th>Cells_AreaShape_EulerNumber</th>\n",
       "      <th>Cells_AreaShape_Eccentricity</th>\n",
       "      <th>Cells_Intensity_MeanIntensity_Mito</th>\n",
       "      <th>Cells_Texture_Correlation_RNA_3_0_256</th>\n",
       "      <th>Cytoplasm_AreaShape_Area</th>\n",
       "      <th>Cytoplasm_Intensity_MeanIntensity_AGP</th>\n",
       "      <th>Nuclei_AreaShape_Area</th>\n",
       "      <th>Nuclei_AreaShape_Eccentricity</th>\n",
       "      <th>Nuclei_Intensity_MeanIntensity_DNA</th>\n",
       "      <th>Nuclei_Intensity_MassDisplacement_DNA</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>/data/Plate_1/images/</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>536.566050</td>\n",
       "      <td>698.886163</td>\n",
       "      <td>1</td>\n",
       "      <td>0.718898</td>\n",
       "      <td>0.305435</td>\n",
       "      <td>0.258636</td>\n",
       "      <td>145.986232</td>\n",
       "      <td>0.246590</td>\n",
       "      <td>174.201060</td>\n",
       "      <td>0.315677</td>\n",
       "      <td>0.402495</td>\n",
       "      <td>2.487391</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>/data/Plate_1/images/</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>375.201907</td>\n",
       "      <td>486.425986</td>\n",
       "      <td>1</td>\n",
       "      <td>0.659908</td>\n",
       "      <td>0.220416</td>\n",
       "      <td>0.221838</td>\n",
       "      <td>271.266445</td>\n",
       "      <td>0.227063</td>\n",
       "      <td>266.457556</td>\n",
       "      <td>0.500276</td>\n",
       "      <td>0.543049</td>\n",
       "      <td>11.349592</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>/data/Plate_1/images/</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>590.054143</td>\n",
       "      <td>766.452364</td>\n",
       "      <td>1</td>\n",
       "      <td>0.466487</td>\n",
       "      <td>0.286568</td>\n",
       "      <td>0.234550</td>\n",
       "      <td>324.125869</td>\n",
       "      <td>0.174093</td>\n",
       "      <td>175.405482</td>\n",
       "      <td>0.409049</td>\n",
       "      <td>0.518258</td>\n",
       "      <td>16.069896</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Metadata_Plate Metadata_Well  Metadata_ImageNumber  Metadata_ObjectNumber  \\\n",
       "0        Plate_1           B02                     1                      1   \n",
       "1        Plate_1           B02                     1                      2   \n",
       "2        Plate_1           B02                     1                      3   \n",
       "\n",
       "  cytotable_meta_source_path  cytotable_meta_offset  cytotable_meta_rownum  \\\n",
       "0      /data/Plate_1/images/                      1                      1   \n",
       "1      /data/Plate_1/images/                      2                      2   \n",
       "2      /data/Plate_1/images/                      3                      3   \n",
       "\n",
       "   Cells_AreaShape_Area  Cells_AreaShape_BoundingBoxArea  \\\n",
       "0            536.566050                       698.886163   \n",
       "1            375.201907                       486.425986   \n",
       "2            590.054143                       766.452364   \n",
       "\n",
       "   Cells_AreaShape_EulerNumber  Cells_AreaShape_Eccentricity  \\\n",
       "0                            1                      0.718898   \n",
       "1                            1                      0.659908   \n",
       "2                            1                      0.466487   \n",
       "\n",
       "   Cells_Intensity_MeanIntensity_Mito  Cells_Texture_Correlation_RNA_3_0_256  \\\n",
       "0                            0.305435                               0.258636   \n",
       "1                            0.220416                               0.221838   \n",
       "2                            0.286568                               0.234550   \n",
       "\n",
       "   Cytoplasm_AreaShape_Area  Cytoplasm_Intensity_MeanIntensity_AGP  \\\n",
       "0                145.986232                               0.246590   \n",
       "1                271.266445                               0.227063   \n",
       "2                324.125869                               0.174093   \n",
       "\n",
       "   Nuclei_AreaShape_Area  Nuclei_AreaShape_Eccentricity  \\\n",
       "0             174.201060                       0.315677   \n",
       "1             266.457556                       0.500276   \n",
       "2             175.405482                       0.409049   \n",
       "\n",
       "   Nuclei_Intensity_MeanIntensity_DNA  Nuclei_Intensity_MassDisplacement_DNA  \n",
       "0                            0.402495                               2.487391  \n",
       "1                            0.543049                              11.349592  \n",
       "2                            0.518258                              16.069896  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load the CytoTable parquet from disk\n",
    "single_cells = pd.read_parquet(sc_path)\n",
    "\n",
    "print(\n",
    "    f\"Loaded {len(single_cells):,} single cells across \"\n",
    "    f\"{single_cells['Metadata_Plate'].nunique()} plates and \"\n",
    "    f\"{single_cells['Metadata_Well'].nunique()} unique wells\"\n",
    ")\n",
    "print(\n",
    "    f\"\\nFeature columns ({len([c for c in single_cells.columns if not c.startswith('Metadata_') and not c.startswith('cytotable_')])}): \"\n",
    "    f\"{[c for c in single_cells.columns if not c.startswith('Metadata_') and not c.startswith('cytotable_')]}\"\n",
    ")\n",
    "single_cells.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-cosmicqc-intro",
   "metadata": {},
   "source": [
    "## Background: Single-cell quality control with coSMicQC [Optional]\n",
    "\n",
    "[coSMicQC](https://github.com/WayScience/coSMicQC) ([GitHub](https://github.com/WayScience/coSMicQC) | [docs](https://cytomining.github.io/coSMicQC/) | [preprint](https://www.biorxiv.org/content/10.1101/2025.10.14.682427v1)) is a Python package from the Way Lab that systematically identifies segmentation artifacts, for example:\n",
    "\n",
    "| Artifact | Morphological signature | Biological cause |\n",
    "|---|---|---|\n",
    "| **Debris / background** | Very small nucleus; low DNA intensity | Out-of-focus plane, dust on coverslip |\n",
    "| **Over-segmented nucleus** | Nucleus area far above the population mean | One nucleus split into multiple objects |\n",
    "| **Touching / fused cells** | Very high mass displacement from multiple objects | Adjacent cells merged into a single object |\n",
    "\n",
    "### How coSMicQC flags outliers\n",
    "\n",
    "coSMicQC computes a **z-score** for each quality-relevant feature across the entire experiment. Cells whose z-scores fall outside user-defined thresholds are flagged as outliers. Thresholds are **signed**:\n",
    "\n",
    "- A **negative threshold** (e.g. `−2.5`) flags cells where the feature is *unusually small*   (debris, broken nuclei).\n",
    "- A **positive threshold** (e.g. `+2.5`) flags cells where the feature is *unusually large*   (fused or over-segmented objects).\n",
    "\n",
    "The main entry point is `label_outliers()`, which accepts a dictionary of **named QC conditions**. Each condition name becomes part of the output column name, making the reason for each flag explicit and auditable:\n",
    "\n",
    "```python\n",
    "import cosmicqc\n",
    "\n",
    "labeled = cosmicqc.label_outliers(\n",
    "    df=single_cells,\n",
    "    feature_thresholds={\n",
    "        # Flag nuclei that are too small (debris)\n",
    "        \"small_nuclear_size\": {\n",
    "            \"Nuclei_AreaShape_Area\": -2.5,\n",
    "        },\n",
    "        # Flag nuclei that are too large (over-segmented)\n",
    "        \"large_nuclear_size\": {\n",
    "            \"Nuclei_AreaShape_Area\": 2.5,\n",
    "        },\n",
    "        # Flag cells with an abnormally high nuclear mass displacement\n",
    "        # (a hallmark of touching or merged nuclei in one object)\n",
    "        \"poor_segmentation\": {\n",
    "            \"Nuclei_Intensity_MassDisplacement_DNA\": 2.5,\n",
    "        },\n",
    "    },\n",
    "    include_threshold_scores=True,   # also write z-score columns for auditing\n",
    "    export_path=\"qc.parquet\",\n",
    "    export_as_annotations=True,      # write compact annotation file only\n",
    "    annotation_metadata_columns=[\n",
    "        \"Metadata_Plate\", \"Metadata_Well\",\n",
    "        \"Metadata_ImageNumber\", \"Metadata_ObjectNumber\",\n",
    "    ],\n",
    ")\n",
    "```\n",
    "\n",
    "### The `qc.parquet` annotation file\n",
    "\n",
    "When `export_as_annotations=True`, coSMicQC writes a **compact annotation file** called `qc.parquet`, which contains only the join-key metadata columns and the `Metadata_cqc_*` flag columns (not the full feature table). This makes `qc.parquet` lightweight and easy to share independently of the raw single-cell data.\n",
    "\n",
    "Each `Metadata_cqc_<condition>_is_outlier` column is a boolean: `True` = flagged, `False` = passes that QC check. A cell must pass **all** conditions to be included in downstream analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "cell-apply-qc",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-01T19:34:59.989678Z",
     "iopub.status.busy": "2026-06-01T19:34:59.989573Z",
     "iopub.status.idle": "2026-06-01T19:34:59.994761Z",
     "shell.execute_reply": "2026-06-01T19:34:59.994475Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "coSMicQC annotation columns:\n",
      "  Metadata_Plate\n",
      "  Metadata_Well\n",
      "  Metadata_ImageNumber\n",
      "  Metadata_ObjectNumber\n",
      "  Metadata_cqc_large_nuclear_size_is_outlier\n",
      "  Metadata_cqc_small_nuclear_size_is_outlier\n",
      "  Metadata_cqc_poor_segmentation_is_outlier\n",
      "\n",
      "  Metadata_cqc_large_nuclear_size_is_outlier: 5 cells flagged (0.4%)\n",
      "  Metadata_cqc_small_nuclear_size_is_outlier: 9 cells flagged (0.8%)\n",
      "  Metadata_cqc_poor_segmentation_is_outlier: 12 cells flagged (1.0%)\n"
     ]
    }
   ],
   "source": [
    "# Load the coSMicQC annotation file and inspect its contents\n",
    "qc_annotations = pd.read_parquet(qc_path)\n",
    "\n",
    "print(\"coSMicQC annotation columns:\")\n",
    "for col in qc_annotations.columns:\n",
    "    print(f\"  {col}\")\n",
    "\n",
    "outlier_cols = [c for c in qc_annotations.columns if c.endswith(\"_is_outlier\")]\n",
    "print()\n",
    "for col in outlier_cols:\n",
    "    n_flagged = qc_annotations[col].sum()\n",
    "    print(\n",
    "        f\"  {col}: {n_flagged:,} cells flagged ({100 * n_flagged / len(qc_annotations):.1f}%)\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-annotate-intro",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 1: Annotate\n",
    "\n",
    "`annotate()` does two jobs at once via its `external_metadata` parameter:\n",
    "\n",
    "1. **Plate-map join** attaches the biological condition (treatment, cell line, concentration) recorded for each well to every cell in that well.\n",
    "2. **External metadata merge** merges any additional per-cell metadata DataFrame or file. The most common use case is a `qc.parquet` file from coSMicQC: passing it as `external_metadata` adds the `Metadata_cqc_*` flag columns directly to the annotated profiles.\n",
    "\n",
    "| Parameter | Description |\n",
    "|---|---|\n",
    "| `platemap` | Maps well positions to treatment conditions |\n",
    "| `join_on` | Column pair `[platemap_col, profiles_col]` for the well-position join |\n",
    "| `external_metadata` | Path to `qc.parquet` (or any additional metadata DataFrame) |\n",
    "| `external_join_on` | Column(s) shared by profiles and external metadata (here the four-part cell identity key) |\n",
    "\n",
    "After `annotate()` runs, the `Metadata_cqc_*` flag columns are present on every row and flow straight into `normalize()`, which applies the QC filter internally via `drop_cosmicqc_rows=True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "cell-platemap",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-01T19:34:59.996077Z",
     "iopub.status.busy": "2026-06-01T19:34:59.995987Z",
     "iopub.status.idle": "2026-06-01T19:35:00.000086Z",
     "shell.execute_reply": "2026-06-01T19:34:59.999817Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>well_position</th>\n",
       "      <th>treatment</th>\n",
       "      <th>cell_line</th>\n",
       "      <th>concentration_um</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>B02</td>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>C02</td>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>B03</td>\n",
       "      <td>Compound_A</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>C03</td>\n",
       "      <td>Compound_A</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B04</td>\n",
       "      <td>Compound_B</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>C04</td>\n",
       "      <td>Compound_B</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  well_position   treatment cell_line  concentration_um\n",
       "0           B02        DMSO      HeLa               0.0\n",
       "1           C02        DMSO      HeLa               0.0\n",
       "2           B03  Compound_A      HeLa              10.0\n",
       "3           C03  Compound_A      HeLa              10.0\n",
       "4           B04  Compound_B      HeLa               5.0\n",
       "5           C04  Compound_B      HeLa               5.0"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "platemap = pd.DataFrame({\n",
    "    \"well_position\": [\"B02\", \"C02\", \"B03\", \"C03\", \"B04\", \"C04\"],\n",
    "    \"treatment\": [\n",
    "        \"DMSO\",\n",
    "        \"DMSO\",\n",
    "        \"Compound_A\",\n",
    "        \"Compound_A\",\n",
    "        \"Compound_B\",\n",
    "        \"Compound_B\",\n",
    "    ],\n",
    "    \"cell_line\": [\"HeLa\"] * 6,\n",
    "    \"concentration_um\": [0.0, 0.0, 10.0, 10.0, 5.0, 5.0],\n",
    "})\n",
    "platemap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "cell-annotate",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-01T19:35:00.001369Z",
     "iopub.status.busy": "2026-06-01T19:35:00.001281Z",
     "iopub.status.idle": "2026-06-01T19:35:00.018843Z",
     "shell.execute_reply": "2026-06-01T19:35:00.018552Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "New columns: ['Metadata_treatment', 'Metadata_cell_line', 'Metadata_concentration_um', 'Metadata_cqc_large_nuclear_size_is_outlier', 'Metadata_cqc_small_nuclear_size_is_outlier', 'Metadata_cqc_poor_segmentation_is_outlier']\n",
      "QC flag columns: ['Metadata_cqc_large_nuclear_size_is_outlier', 'Metadata_cqc_small_nuclear_size_is_outlier', 'Metadata_cqc_poor_segmentation_is_outlier']\n",
      "\n",
      "Cells flagged by any QC condition: 26 (2.2%)\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Metadata_treatment</th>\n",
       "      <th>Metadata_cell_line</th>\n",
       "      <th>Metadata_concentration_um</th>\n",
       "      <th>Metadata_Plate</th>\n",
       "      <th>Metadata_Well</th>\n",
       "      <th>Metadata_ImageNumber</th>\n",
       "      <th>Metadata_ObjectNumber</th>\n",
       "      <th>Metadata_cqc_large_nuclear_size_is_outlier</th>\n",
       "      <th>Metadata_cqc_small_nuclear_size_is_outlier</th>\n",
       "      <th>Metadata_cqc_poor_segmentation_is_outlier</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>200</th>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>C02</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>400</th>\n",
       "      <td>Compound_A</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B03</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>600</th>\n",
       "      <td>Compound_A</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>C03</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>800</th>\n",
       "      <td>Compound_B</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>5.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B04</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Metadata_treatment Metadata_cell_line  Metadata_concentration_um  \\\n",
       "0                 DMSO               HeLa                        0.0   \n",
       "200               DMSO               HeLa                        0.0   \n",
       "400         Compound_A               HeLa                       10.0   \n",
       "600         Compound_A               HeLa                       10.0   \n",
       "800         Compound_B               HeLa                        5.0   \n",
       "\n",
       "    Metadata_Plate Metadata_Well  Metadata_ImageNumber  Metadata_ObjectNumber  \\\n",
       "0          Plate_1           B02                     1                      1   \n",
       "200        Plate_1           C02                     2                      1   \n",
       "400        Plate_1           B03                     3                      1   \n",
       "600        Plate_1           C03                     4                      1   \n",
       "800        Plate_1           B04                     5                      1   \n",
       "\n",
       "     Metadata_cqc_large_nuclear_size_is_outlier  \\\n",
       "0                                         False   \n",
       "200                                       False   \n",
       "400                                       False   \n",
       "600                                       False   \n",
       "800                                       False   \n",
       "\n",
       "     Metadata_cqc_small_nuclear_size_is_outlier  \\\n",
       "0                                         False   \n",
       "200                                       False   \n",
       "400                                       False   \n",
       "600                                       False   \n",
       "800                                       False   \n",
       "\n",
       "     Metadata_cqc_poor_segmentation_is_outlier  \n",
       "0                                        False  \n",
       "200                                      False  \n",
       "400                                      False  \n",
       "600                                      False  \n",
       "800                                      False  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "join_keys = [\n",
    "    \"Metadata_Plate\",\n",
    "    \"Metadata_Well\",\n",
    "    \"Metadata_ImageNumber\",\n",
    "    \"Metadata_ObjectNumber\",\n",
    "]\n",
    "\n",
    "# annotate() merges the plate map AND the QC annotation file in a single call.\n",
    "# The qc.parquet columns already carry the Metadata_ prefix, so they pass through\n",
    "# prepare_external_metadata_for_annotate() unchanged.\n",
    "annotated_cells = annotate(\n",
    "    profiles=single_cells,\n",
    "    platemap=platemap,\n",
    "    join_on=[\"Metadata_well_position\", \"Metadata_Well\"],\n",
    "    add_metadata_id_to_platemap=True,\n",
    "    external_metadata=str(qc_path),\n",
    "    external_join_on=join_keys,\n",
    ")\n",
    "\n",
    "new_cols = [c for c in annotated_cells.columns if c not in single_cells.columns]\n",
    "qc_cols = [c for c in new_cols if \"cqc\" in c]\n",
    "print(f\"New columns: {new_cols}\")\n",
    "print(f\"QC flag columns: {qc_cols}\")\n",
    "print(\n",
    "    f\"\\nCells flagged by any QC condition: \"\n",
    "    f\"{annotated_cells[qc_cols].any(axis=1).sum():,} \"\n",
    "    f\"({100 * annotated_cells[qc_cols].any(axis=1).mean():.1f}%)\"\n",
    ")\n",
    "print()\n",
    "annotated_cells[\n",
    "    [c for c in annotated_cells.columns if c.startswith(\"Metadata_\")]\n",
    "].drop_duplicates(subset=[\"Metadata_Well\"]).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-normalize-intro",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 2: Normalize\n",
    "\n",
    "Raw CellProfiler features vary in scale (cell area in pixels², intensities in 0–1) and are influenced by plate-to-plate technical effects. Normalization places all features on a common scale and limits plate-to-plate variation by z-scoring each feature relative to the **DMSO control cells**.\n",
    "\n",
    "Passing `drop_cosmicqc_rows=True` tells `normalize()` to drop every row where any `Metadata_cqc_*` flag is `True` before computing the z-scores, so QC filtering and normalization happen in a single call."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "cell-normalize",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-01T19:35:00.020223Z",
     "iopub.status.busy": "2026-06-01T19:35:00.020114Z",
     "iopub.status.idle": "2026-06-01T19:35:00.034092Z",
     "shell.execute_reply": "2026-06-01T19:35:00.033754Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total cells             1,200\n",
      "Removed (QC outliers)      26  (2.2%)\n",
      "Retained                1,174\n",
      "\n",
      "Normalized shape: (1174, 22)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Metadata_treatment</th>\n",
       "      <th>Metadata_cell_line</th>\n",
       "      <th>Metadata_concentration_um</th>\n",
       "      <th>Metadata_Plate</th>\n",
       "      <th>Metadata_Well</th>\n",
       "      <th>Metadata_ImageNumber</th>\n",
       "      <th>Metadata_ObjectNumber</th>\n",
       "      <th>Metadata_cqc_large_nuclear_size_is_outlier</th>\n",
       "      <th>Metadata_cqc_small_nuclear_size_is_outlier</th>\n",
       "      <th>Metadata_cqc_poor_segmentation_is_outlier</th>\n",
       "      <th>...</th>\n",
       "      <th>Cells_AreaShape_EulerNumber</th>\n",
       "      <th>Cells_AreaShape_Eccentricity</th>\n",
       "      <th>Cells_Intensity_MeanIntensity_Mito</th>\n",
       "      <th>Cells_Texture_Correlation_RNA_3_0_256</th>\n",
       "      <th>Cytoplasm_AreaShape_Area</th>\n",
       "      <th>Cytoplasm_Intensity_MeanIntensity_AGP</th>\n",
       "      <th>Nuclei_AreaShape_Area</th>\n",
       "      <th>Nuclei_AreaShape_Eccentricity</th>\n",
       "      <th>Nuclei_Intensity_MeanIntensity_DNA</th>\n",
       "      <th>Nuclei_Intensity_MassDisplacement_DNA</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.292772</td>\n",
       "      <td>0.100004</td>\n",
       "      <td>0.733196</td>\n",
       "      <td>-2.107301</td>\n",
       "      <td>-0.009458</td>\n",
       "      <td>-0.380694</td>\n",
       "      <td>-0.907252</td>\n",
       "      <td>-1.166556</td>\n",
       "      <td>-1.028890</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.819530</td>\n",
       "      <td>-1.325156</td>\n",
       "      <td>0.121467</td>\n",
       "      <td>-0.481165</td>\n",
       "      <td>-0.293417</td>\n",
       "      <td>1.319231</td>\n",
       "      <td>1.025113</td>\n",
       "      <td>0.607217</td>\n",
       "      <td>1.555226</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-0.883623</td>\n",
       "      <td>-0.280147</td>\n",
       "      <td>-1.368761</td>\n",
       "      <td>-0.591794</td>\n",
       "      <td>0.361401</td>\n",
       "      <td>0.749972</td>\n",
       "      <td>1.237712</td>\n",
       "      <td>-0.672132</td>\n",
       "      <td>-0.767620</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3 rows × 22 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "  Metadata_treatment Metadata_cell_line  Metadata_concentration_um  \\\n",
       "0               DMSO               HeLa                        0.0   \n",
       "1               DMSO               HeLa                        0.0   \n",
       "3               DMSO               HeLa                        0.0   \n",
       "\n",
       "  Metadata_Plate Metadata_Well  Metadata_ImageNumber  Metadata_ObjectNumber  \\\n",
       "0        Plate_1           B02                     1                      1   \n",
       "1        Plate_1           B02                     1                      2   \n",
       "3        Plate_1           B02                     1                      4   \n",
       "\n",
       "   Metadata_cqc_large_nuclear_size_is_outlier  \\\n",
       "0                                       False   \n",
       "1                                       False   \n",
       "3                                       False   \n",
       "\n",
       "   Metadata_cqc_small_nuclear_size_is_outlier  \\\n",
       "0                                       False   \n",
       "1                                       False   \n",
       "3                                       False   \n",
       "\n",
       "   Metadata_cqc_poor_segmentation_is_outlier  ...  \\\n",
       "0                                      False  ...   \n",
       "1                                      False  ...   \n",
       "3                                      False  ...   \n",
       "\n",
       "   Cells_AreaShape_EulerNumber  Cells_AreaShape_Eccentricity  \\\n",
       "0                          0.0                      1.292772   \n",
       "1                          0.0                      0.819530   \n",
       "3                          0.0                     -0.883623   \n",
       "\n",
       "   Cells_Intensity_MeanIntensity_Mito  Cells_Texture_Correlation_RNA_3_0_256  \\\n",
       "0                            0.100004                               0.733196   \n",
       "1                           -1.325156                               0.121467   \n",
       "3                           -0.280147                              -1.368761   \n",
       "\n",
       "   Cytoplasm_AreaShape_Area  Cytoplasm_Intensity_MeanIntensity_AGP  \\\n",
       "0                 -2.107301                              -0.009458   \n",
       "1                 -0.481165                              -0.293417   \n",
       "3                 -0.591794                               0.361401   \n",
       "\n",
       "   Nuclei_AreaShape_Area  Nuclei_AreaShape_Eccentricity  \\\n",
       "0              -0.380694                      -0.907252   \n",
       "1               1.319231                       1.025113   \n",
       "3               0.749972                       1.237712   \n",
       "\n",
       "   Nuclei_Intensity_MeanIntensity_DNA  Nuclei_Intensity_MassDisplacement_DNA  \n",
       "0                           -1.166556                              -1.028890  \n",
       "1                            0.607217                               1.555226  \n",
       "3                           -0.672132                              -0.767620  \n",
       "\n",
       "[3 rows x 22 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# drop_cosmicqc_rows=True removes QC-flagged cells before z-scoring.\n",
    "normalized_cells = normalize(\n",
    "    profiles=annotated_cells,\n",
    "    features=\"infer\",\n",
    "    meta_features=\"infer\",\n",
    "    samples=\"Metadata_treatment == 'DMSO'\",\n",
    "    method=\"standardize\",\n",
    "    drop_cosmicqc_rows=True,\n",
    ")\n",
    "\n",
    "n_removed = len(annotated_cells) - len(normalized_cells)\n",
    "print(f\"{'Total cells':<22} {len(annotated_cells):>6,}\")\n",
    "print(\n",
    "    f\"{'Removed (QC outliers)':<22} {n_removed:>6,}  ({100 * n_removed / len(annotated_cells):.1f}%)\"\n",
    ")\n",
    "print(f\"{'Retained':<22} {len(normalized_cells):>6,}\")\n",
    "print()\n",
    "print(f\"Normalized shape: {normalized_cells.shape}\")\n",
    "normalized_cells.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-featsel-intro",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 3: Feature Selection\n",
    "\n",
    "Even after QC and normalization, some features carry little information:\n",
    "\n",
    "- **Low-variance features** are nearly constant across all cells and cannot distinguish biological conditions.\n",
    "- **Highly correlated feature pairs** are redundant; keeping both double-weights that axis of variation in clustering and embeddings.\n",
    "- **Blocklisted features** are known to capture image artifacts rather than cell biology.\n",
    "\n",
    "`feature_select()` applies all three filters, producing a lean feature matrix ready for single-cell analyses such as UMAP or hierarchical clustering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "cell-featsel",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-01T19:35:00.035826Z",
     "iopub.status.busy": "2026-06-01T19:35:00.035694Z",
     "iopub.status.idle": "2026-06-01T19:35:00.055266Z",
     "shell.execute_reply": "2026-06-01T19:35:00.054966Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features before selection: 12\n",
      "Features after  selection: 10\n",
      "Features removed: {'Cells_AreaShape_Area', 'Cells_AreaShape_EulerNumber'}\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Metadata_treatment</th>\n",
       "      <th>Metadata_cell_line</th>\n",
       "      <th>Metadata_concentration_um</th>\n",
       "      <th>Metadata_Plate</th>\n",
       "      <th>Metadata_Well</th>\n",
       "      <th>Metadata_ImageNumber</th>\n",
       "      <th>Metadata_ObjectNumber</th>\n",
       "      <th>Metadata_cqc_large_nuclear_size_is_outlier</th>\n",
       "      <th>Metadata_cqc_small_nuclear_size_is_outlier</th>\n",
       "      <th>Metadata_cqc_poor_segmentation_is_outlier</th>\n",
       "      <th>Cells_AreaShape_BoundingBoxArea</th>\n",
       "      <th>Cells_AreaShape_Eccentricity</th>\n",
       "      <th>Cells_Intensity_MeanIntensity_Mito</th>\n",
       "      <th>Cells_Texture_Correlation_RNA_3_0_256</th>\n",
       "      <th>Cytoplasm_AreaShape_Area</th>\n",
       "      <th>Cytoplasm_Intensity_MeanIntensity_AGP</th>\n",
       "      <th>Nuclei_AreaShape_Area</th>\n",
       "      <th>Nuclei_AreaShape_Eccentricity</th>\n",
       "      <th>Nuclei_Intensity_MeanIntensity_DNA</th>\n",
       "      <th>Nuclei_Intensity_MassDisplacement_DNA</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>0.423873</td>\n",
       "      <td>1.292772</td>\n",
       "      <td>0.100004</td>\n",
       "      <td>0.733196</td>\n",
       "      <td>-2.107301</td>\n",
       "      <td>-0.009458</td>\n",
       "      <td>-0.380694</td>\n",
       "      <td>-0.907252</td>\n",
       "      <td>-1.166556</td>\n",
       "      <td>-1.028890</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>-1.020114</td>\n",
       "      <td>0.819530</td>\n",
       "      <td>-1.325156</td>\n",
       "      <td>0.121467</td>\n",
       "      <td>-0.481165</td>\n",
       "      <td>-0.293417</td>\n",
       "      <td>1.319231</td>\n",
       "      <td>1.025113</td>\n",
       "      <td>0.607217</td>\n",
       "      <td>1.555226</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>1.139881</td>\n",
       "      <td>-0.883623</td>\n",
       "      <td>-0.280147</td>\n",
       "      <td>-1.368761</td>\n",
       "      <td>-0.591794</td>\n",
       "      <td>0.361401</td>\n",
       "      <td>0.749972</td>\n",
       "      <td>1.237712</td>\n",
       "      <td>-0.672132</td>\n",
       "      <td>-0.767620</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Metadata_treatment Metadata_cell_line  Metadata_concentration_um  \\\n",
       "0               DMSO               HeLa                        0.0   \n",
       "1               DMSO               HeLa                        0.0   \n",
       "3               DMSO               HeLa                        0.0   \n",
       "\n",
       "  Metadata_Plate Metadata_Well  Metadata_ImageNumber  Metadata_ObjectNumber  \\\n",
       "0        Plate_1           B02                     1                      1   \n",
       "1        Plate_1           B02                     1                      2   \n",
       "3        Plate_1           B02                     1                      4   \n",
       "\n",
       "   Metadata_cqc_large_nuclear_size_is_outlier  \\\n",
       "0                                       False   \n",
       "1                                       False   \n",
       "3                                       False   \n",
       "\n",
       "   Metadata_cqc_small_nuclear_size_is_outlier  \\\n",
       "0                                       False   \n",
       "1                                       False   \n",
       "3                                       False   \n",
       "\n",
       "   Metadata_cqc_poor_segmentation_is_outlier  Cells_AreaShape_BoundingBoxArea  \\\n",
       "0                                      False                         0.423873   \n",
       "1                                      False                        -1.020114   \n",
       "3                                      False                         1.139881   \n",
       "\n",
       "   Cells_AreaShape_Eccentricity  Cells_Intensity_MeanIntensity_Mito  \\\n",
       "0                      1.292772                            0.100004   \n",
       "1                      0.819530                           -1.325156   \n",
       "3                     -0.883623                           -0.280147   \n",
       "\n",
       "   Cells_Texture_Correlation_RNA_3_0_256  Cytoplasm_AreaShape_Area  \\\n",
       "0                               0.733196                 -2.107301   \n",
       "1                               0.121467                 -0.481165   \n",
       "3                              -1.368761                 -0.591794   \n",
       "\n",
       "   Cytoplasm_Intensity_MeanIntensity_AGP  Nuclei_AreaShape_Area  \\\n",
       "0                              -0.009458              -0.380694   \n",
       "1                              -0.293417               1.319231   \n",
       "3                               0.361401               0.749972   \n",
       "\n",
       "   Nuclei_AreaShape_Eccentricity  Nuclei_Intensity_MeanIntensity_DNA  \\\n",
       "0                      -0.907252                           -1.166556   \n",
       "1                       1.025113                            0.607217   \n",
       "3                       1.237712                           -0.672132   \n",
       "\n",
       "   Nuclei_Intensity_MassDisplacement_DNA  \n",
       "0                              -1.028890  \n",
       "1                               1.555226  \n",
       "3                              -0.767620  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "selected_cells = feature_select(\n",
    "    profiles=normalized_cells,\n",
    "    features=\"infer\",\n",
    "    operation=[\"variance_threshold\", \"correlation_threshold\", \"blocklist\"],\n",
    ")\n",
    "\n",
    "feature_cols_before = [\n",
    "    c for c in normalized_cells.columns if not c.startswith(\"Metadata_\")\n",
    "]\n",
    "feature_cols_after = [\n",
    "    c for c in selected_cells.columns if not c.startswith(\"Metadata_\")\n",
    "]\n",
    "\n",
    "print(f\"Features before selection: {len(feature_cols_before)}\")\n",
    "print(f\"Features after  selection: {len(feature_cols_after)}\")\n",
    "print(f\"Features removed: {set(feature_cols_before) - set(feature_cols_after)}\")\n",
    "selected_cells.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-summary",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Summary\n",
    "\n",
    "You have processed a CytoTable single-cell dataset through a complete quality-control and normalization pipeline, preserving single-cell resolution throughout:\n",
    "\n",
    "| Step | Function | Input | Output |\n",
    "|---|---|---|---|\n",
    "| Load | `pd.read_parquet` | CytoTable Parquet | 1,200 single cells |\n",
    "| Annotate | `annotate()` | Single cells + platemap + `qc.parquet` | Cells with treatment labels and QC flags |\n",
    "| Normalize | `normalize(drop_cosmicqc_rows=True)` | Annotated cells | ~1,176 passing cells, Z-scored |\n",
    "| Feature select | `feature_select()` | 11 features | 9 features |\n",
    "\n",
    "The output is a **clean, normalized single-cell feature matrix**, `selected_cells`, where every row is one cell and every column is an informative morphological feature.\n",
    "\n",
    "### Next steps\n",
    "\n",
    "- **Embed**: run UMAP or t-SNE on `selected_cells` to visualize how treatments separate in morphological space at single-cell resolution.\n",
    "- **Cluster**: apply k-means or Leiden clustering to discover subpopulations within each treatment condition.\n",
    "- **Aggregate**: feed `selected_cells` into `aggregate()` if you need well-level profiles (e.g. for the consensus pipeline shown in the [Introduction to Image-based Profiling with Pycytominer](introduction_to_pycytominer.ipynb) tutorial).\n",
    "- **Hit calling**: identify which compounds produce a statistically significant morphological change relative to controls. [Buscar](https://github.com/WayScience/Buscar) operates directly on single-cell distributions to capture cellular heterogeneity and flag off-target effects."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "qc_env",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}