{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cell-title",
   "metadata": {},
   "source": [
    "# Introduction to image-based profiling with Pycytominer\n",
    "\n",
    "**Welcome!** This tutorial introduces [Pycytominer](https://pycytominer.readthedocs.io), a Python library\n",
    "for processing image-based profiling data from high-content microscopy experiments.\n",
    "\n",
    "## What You Will Learn\n",
    "\n",
    "By the end of this tutorial, you will know how to:\n",
    "\n",
    "1. **Aggregate** thousands of single-cell measurements into one representative profile per experimental well\n",
    "2. **Annotate** profiles with experimental metadata, such as which compound was applied to each well\n",
    "3. **Normalize** feature values to remove plate-to-plate technical variation\n",
    "4. **Select features** to remove uninformative or redundant measurements\n",
    "5. **Build consensus profiles** that collapse replicate experiments into a single representative vector\n",
    "\n",
    "## Background: What Is High-Content Microscopy?\n",
    "\n",
    "High-content microscopy measures hundreds to thousands of informative phenotypic features that\n",
    "represent the morphology state of cells under different biological conditions (e.g., healthy vs. disease). High-content microscopy\n",
    "is often paired with high-throughput screening experiments that perturb cells with small-molecule compounds or genetic perturbations.\n",
    "\n",
    "In a typical experiment:\n",
    "\n",
    "1. Cells are grown in multi-well plates and treated with a panel of perturbations.\n",
    "2. Optionally apply fluorescence dyes to stain distinct cellular compartments.\n",
    "3. Automated microscopes capture hundreds of images per plate.\n",
    "\n",
    "![Cell staining and imaging pipeline](images/cell_to_image.png)\n",
    "\n",
    "4. Image analysis software (such as [CellProfiler](https://cellprofiler.org/)) extracts several\n",
    "   thousand numerical features per detected cell, describing each compartment's and channel's shape, texture,\n",
    "   and fluorescence intensity.\n",
    "\n",
    "![From microscopy images to single-cell feature measurements](images/image_to_features.png)\n",
    "\n",
    "A single experiment can generate measurements from **millions of individual cells**, spanning\n",
    "hundreds to thousands of features. The central challenge is transforming this raw, high-dimensional\n",
    "data into clean, interpretable **image-based profiles** — compact, comparable vectors that\n",
    "summarise how each condition changed the appearance of cells.\n",
    "\n",
    "That is exactly what Pycytominer does ([Serrano et al., 2025](https://doi.org/10.1038/s41592-025-02611-8)), which has been grounded in image-based profiling methods established over the past decade ([Caicedo et al., 2017](https://doi.org/10.1038/nmeth.4397), [Serrano et al. 2026](https://doi.org/10.1038/s44320-026-00197-7))."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-prereqs",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "This tutorial assumes you have:\n",
    "\n",
    "- [Installed Pycytominer](https://pycytominer.readthedocs.io)\n",
    "- Familiarity with [pandas DataFrames](https://pandas.pydata.org/)\n",
    "- *(Optional)* Completed the\n",
    "  [CytoTable tutorial](https://cytomining.github.io/CytoTable/tutorials/cellprofiler_to_parquet.html),\n",
    "  which shows how to convert raw CellProfiler output into the Parquet format that\n",
    "  Pycytominer reads as input.\n",
    "\n",
    "## The Pycytominer Pipeline at a Glance\n",
    "\n",
    "Raw single-cell data travels through five sequential steps:\n",
    "\n",
    "| Step | Pycytominer function | What changes |\n",
    "|------|----------------------|--------------|\n",
    "| 1. Aggregate     | `aggregate()`     | One row per cell → one row per well |\n",
    "| 2. Annotate      | `annotate()`      | Well positions → biological treatment labels |\n",
    "| 3. Normalize     | `normalize()`     | Raw feature values → z-scores relative to controls |\n",
    "| 4. Feature Select| `feature_select()`| Hundreds of features → only the informative ones |\n",
    "| 5. Consensus     | `consensus()`     | One row per well → one row per treatment condition |\n",
    "\n",
    "At the end, you have a compact, analysis-ready table where each row is a unique biological\n",
    "condition and each column is an informative morphological measurement."
   ]
  },
  {
   "cell_type": "raw",
   "id": "cell-pipeline-diagram",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. mermaid::\n",
    "   :align: center\n",
    "\n",
    "   flowchart TD\n",
    "       input[\"🔬 Single-cell data<br/>1,200 cells, 11 features\"]\n",
    "       agg[\"🪣 aggregate()<br/>Pool cells per well, 12 profiles\"]\n",
    "       ann[\"🏷️ annotate()<br/>Join plate map, add treatment labels\"]\n",
    "       nor[\"⚖️ normalize()<br/>Z-score vs DMSO controls\"]\n",
    "       fea[\"✂️ feature_select()<br/>Remove redundant, 9 of 11 features kept\"]\n",
    "       con[\"🤝 consensus()<br/>Median across plates, 3 conditions\"]\n",
    "       output[\"📊 Morphological profiles<br/>3 conditions, 9 features\"]\n",
    "\n",
    "       input --> agg --> ann --> nor --> fea --> con --> output\n",
    "\n",
    "       style input  fill:#f0d9fa,stroke:#88239A,color:#111\n",
    "       style output fill:#f0d9fa,stroke:#88239A,color:#111\n",
    "       style agg fill:#ffffff,stroke:#88239A,color:#111\n",
    "       style ann fill:#ffffff,stroke:#88239A,color:#111\n",
    "       style nor fill:#ffffff,stroke:#88239A,color:#111\n",
    "       style fea fill:#ffffff,stroke:#88239A,color:#111\n",
    "       style con fill:#ffffff,stroke:#88239A,color:#111"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "cell-imports",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-26T20:08:25.462328Z",
     "iopub.status.busy": "2026-05-26T20:08:25.462230Z",
     "iopub.status.idle": "2026-05-26T20:08:27.029270Z",
     "shell.execute_reply": "2026-05-26T20:08:27.028920Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from pycytominer import aggregate, annotate, consensus, feature_select, normalize\n",
    "\n",
    "# Fix the random seed so this tutorial produces identical results every time it is run\n",
    "np.random.seed(42)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-data-intro",
   "metadata": {},
   "source": [
    "## Tutorial Data\n",
    "\n",
    "In a real workflow, you would start from the Parquet file produced by\n",
    "[CytoTable](https://cytomining.github.io/CytoTable/tutorials/cellprofiler_to_parquet.html):\n",
    "\n",
    "```python\n",
    "from pycytominer.cyto_utils import load_profiles\n",
    "\n",
    "# Load single-cell measurements exported by CytoTable\n",
    "single_cells = load_profiles(\"outputs/examplehuman.parquet\")\n",
    "```\n",
    "\n",
    "For this tutorial we generate a small **synthetic dataset** that mirrors the exact structure of\n",
    "a real high-content microscopy experiment. The column names, data types, and naming conventions are\n",
    "identical to what CellProfiler and CytoTable produce — only the numerical values are simulated.\n",
    "\n",
    "**Experiment design:**\n",
    "\n",
    "| Property | Value |\n",
    "|----------|-------|\n",
    "| Plates (biological replicates) | 2 |\n",
    "| Wells per plate | 6 (2 × DMSO vehicle control, 2 × Compound A, 2 × Compound B) |\n",
    "| Cells per well | ~100 |\n",
    "| Total single-cell measurements | ~1,200 |\n",
    "| Morphological features | 11 (across three compartments) |\n",
    "\n",
    "> **Note on the features:** Two of the eleven features are intentionally designed to be\n",
    "> uninformative — one is constant across all cells, and one is nearly perfectly correlated\n",
    "> with another. You will see these removed automatically in Step 4 (Feature Selection).\n",
    "\n",
    "The simulation function is available in the expandable block below if you'd like to inspect it — you can skip it and go straight to Step 1."
   ]
  },
  {
   "cell_type": "raw",
   "id": "cell-simulate-toggle",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. toggle::\n",
    "\n",
    "   .. code-block:: python\n",
    "\n",
    "      def simulate_single_cells(plate_id, n_cells_per_well=100):\n",
    "          \"\"\"\n",
    "          Generate synthetic single-cell morphology measurements for one plate.\n",
    "\n",
    "          Column naming follows the CellProfiler convention:\n",
    "            Metadata_*  — experimental context (plate, well, object identity)\n",
    "            Cells_*     — measurements of the whole-cell boundary\n",
    "            Cytoplasm_* — measurements of the cytoplasmic region\n",
    "            Nuclei_*    — measurements of the nuclear region\n",
    "\n",
    "          To keep this tutorial focused on the pipeline rather than biology,\n",
    "          only the cell-area features respond to treatment. All other features\n",
    "          are independent noise sampled from realistic distributions.\n",
    "          In a real experiment every feature may carry some biological signal.\n",
    "          \"\"\"\n",
    "          well_treatments = {\n",
    "              'B02': 'DMSO',\n",
    "              'C02': 'DMSO',\n",
    "              'B03': 'Compound_A',\n",
    "              'C03': 'Compound_A',\n",
    "              'B04': 'Compound_B',\n",
    "              'C04': 'Compound_B',\n",
    "          }\n",
    "\n",
    "          rows = []\n",
    "          for image_number, (well, treatment) in enumerate(well_treatments.items(), start=1):\n",
    "              is_a = float(treatment == 'Compound_A')\n",
    "              is_b = float(treatment == 'Compound_B')\n",
    "\n",
    "              # Only the Area family of features responds to treatment.\n",
    "              # This ensures only the intentionally correlated pair is removed in Step 4.\n",
    "              cell_area_base = np.random.normal(500 + 180 * is_a - 90 * is_b, 120, n_cells_per_well)\n",
    "\n",
    "              for obj_num in range(1, n_cells_per_well + 1):\n",
    "                  cell_area = cell_area_base[obj_num - 1]\n",
    "                  rows.append({\n",
    "                      # ── Metadata columns ──────────────────────────────────────────\n",
    "                      'Metadata_Plate':        plate_id,\n",
    "                      'Metadata_Well':         well,\n",
    "                      'Metadata_ImageNumber':  image_number,\n",
    "                      'Metadata_ObjectNumber': obj_num,\n",
    "                      # ── Cell-level features ───────────────────────────────────────\n",
    "                      # Area and BoundingBoxArea are intentionally correlated (r ≈ 0.99);\n",
    "                      # one of the pair will be removed during feature selection.\n",
    "                      'Cells_AreaShape_Area':            cell_area,\n",
    "                      'Cells_AreaShape_BoundingBoxArea': cell_area * 1.3 + np.random.normal(0, 4),\n",
    "                      # EulerNumber = 1 for virtually all cells (topological invariant);\n",
    "                      # zero variance → removed during feature selection.\n",
    "                      'Cells_AreaShape_EulerNumber':     1,\n",
    "                      # All remaining features: independent noise with realistic distributions\n",
    "                      'Cells_AreaShape_Eccentricity':          float(np.clip(np.random.normal(0.55, 0.12), 0, 1)),\n",
    "                      'Cells_Intensity_MeanIntensity_Mito':    np.random.normal(0.30, 0.06),\n",
    "                      'Cells_Texture_Correlation_RNA_3_0_256': np.random.normal(0.22, 0.06),\n",
    "                      # ── Cytoplasm features ────────────────────────────────────────\n",
    "                      'Cytoplasm_AreaShape_Area':              np.random.normal(310, 80),\n",
    "                      'Cytoplasm_Intensity_MeanIntensity_AGP': np.random.normal(0.25, 0.07),\n",
    "                      # ── Nuclei features ───────────────────────────────────────────\n",
    "                      'Nuclei_AreaShape_Area':                 np.random.normal(195, 55),\n",
    "                      'Nuclei_AreaShape_Eccentricity':         float(np.clip(np.random.normal(0.40, 0.10), 0, 1)),\n",
    "                      'Nuclei_Intensity_MeanIntensity_DNA':    np.random.normal(0.50, 0.08),\n",
    "                  })\n",
    "          return pd.DataFrame(rows)\n",
    "\n",
    "\n",
    "      # Generate data for two plates to simulate biological replicates\n",
    "      plate1 = simulate_single_cells('Plate_1')\n",
    "      plate2 = simulate_single_cells('Plate_2')\n",
    "      single_cells = pd.concat([plate1, plate2], ignore_index=True)\n",
    "\n",
    "      print(f'Single-cell dataset: {single_cells.shape[0]:,} rows (cells) x {single_cells.shape[1]} columns')\n",
    "      print(f'Plates:  {single_cells[\"Metadata_Plate\"].unique().tolist()}')\n",
    "      print(f'Wells:   {sorted(single_cells[\"Metadata_Well\"].unique().tolist())}')\n",
    "      print()\n",
    "      single_cells.head()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "cell-generate-data",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-26T20:08:27.031060Z",
     "iopub.status.busy": "2026-05-26T20:08:27.030897Z",
     "iopub.status.idle": "2026-05-26T20:08:27.065745Z",
     "shell.execute_reply": "2026-05-26T20:08:27.065459Z"
    },
    "nbsphinx": "hidden",
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Single-cell dataset: 1,200 rows (cells) x 15 columns\n",
      "Plates:  ['Plate_1', 'Plate_2']\n",
      "Wells:   ['B02', 'B03', 'B04', 'C02', 'C03', 'C04']\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Metadata_Plate</th>\n",
       "      <th>Metadata_Well</th>\n",
       "      <th>Metadata_ImageNumber</th>\n",
       "      <th>Metadata_ObjectNumber</th>\n",
       "      <th>Cells_AreaShape_Area</th>\n",
       "      <th>Cells_AreaShape_BoundingBoxArea</th>\n",
       "      <th>Cells_AreaShape_EulerNumber</th>\n",
       "      <th>Cells_AreaShape_Eccentricity</th>\n",
       "      <th>Cells_Intensity_MeanIntensity_Mito</th>\n",
       "      <th>Cells_Texture_Correlation_RNA_3_0_256</th>\n",
       "      <th>Cytoplasm_AreaShape_Area</th>\n",
       "      <th>Cytoplasm_Intensity_MeanIntensity_AGP</th>\n",
       "      <th>Nuclei_AreaShape_Area</th>\n",
       "      <th>Nuclei_AreaShape_Eccentricity</th>\n",
       "      <th>Nuclei_Intensity_MeanIntensity_DNA</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>559.605698</td>\n",
       "      <td>721.825925</td>\n",
       "      <td>1</td>\n",
       "      <td>0.499523</td>\n",
       "      <td>0.279437</td>\n",
       "      <td>0.171863</td>\n",
       "      <td>297.097143</td>\n",
       "      <td>0.278284</td>\n",
       "      <td>298.740225</td>\n",
       "      <td>0.417458</td>\n",
       "      <td>0.520604</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>483.408284</td>\n",
       "      <td>628.132985</td>\n",
       "      <td>1</td>\n",
       "      <td>0.319747</td>\n",
       "      <td>0.298409</td>\n",
       "      <td>0.223614</td>\n",
       "      <td>507.059369</td>\n",
       "      <td>0.236535</td>\n",
       "      <td>211.585104</td>\n",
       "      <td>0.396529</td>\n",
       "      <td>0.406506</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>577.722625</td>\n",
       "      <td>755.610703</td>\n",
       "      <td>1</td>\n",
       "      <td>0.640232</td>\n",
       "      <td>0.347462</td>\n",
       "      <td>0.165437</td>\n",
       "      <td>422.223545</td>\n",
       "      <td>0.151870</td>\n",
       "      <td>227.277140</td>\n",
       "      <td>0.619046</td>\n",
       "      <td>0.420757</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>682.763583</td>\n",
       "      <td>885.327467</td>\n",
       "      <td>1</td>\n",
       "      <td>0.561958</td>\n",
       "      <td>0.269791</td>\n",
       "      <td>0.126960</td>\n",
       "      <td>315.485038</td>\n",
       "      <td>0.175639</td>\n",
       "      <td>221.047584</td>\n",
       "      <td>0.308058</td>\n",
       "      <td>0.623995</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>471.901595</td>\n",
       "      <td>610.339060</td>\n",
       "      <td>1</td>\n",
       "      <td>0.511353</td>\n",
       "      <td>0.348811</td>\n",
       "      <td>0.146148</td>\n",
       "      <td>328.196795</td>\n",
       "      <td>0.341500</td>\n",
       "      <td>106.588422</td>\n",
       "      <td>0.418463</td>\n",
       "      <td>0.520791</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Metadata_Plate Metadata_Well  Metadata_ImageNumber  Metadata_ObjectNumber  \\\n",
       "0        Plate_1           B02                     1                      1   \n",
       "1        Plate_1           B02                     1                      2   \n",
       "2        Plate_1           B02                     1                      3   \n",
       "3        Plate_1           B02                     1                      4   \n",
       "4        Plate_1           B02                     1                      5   \n",
       "\n",
       "   Cells_AreaShape_Area  Cells_AreaShape_BoundingBoxArea  \\\n",
       "0            559.605698                       721.825925   \n",
       "1            483.408284                       628.132985   \n",
       "2            577.722625                       755.610703   \n",
       "3            682.763583                       885.327467   \n",
       "4            471.901595                       610.339060   \n",
       "\n",
       "   Cells_AreaShape_EulerNumber  Cells_AreaShape_Eccentricity  \\\n",
       "0                            1                      0.499523   \n",
       "1                            1                      0.319747   \n",
       "2                            1                      0.640232   \n",
       "3                            1                      0.561958   \n",
       "4                            1                      0.511353   \n",
       "\n",
       "   Cells_Intensity_MeanIntensity_Mito  Cells_Texture_Correlation_RNA_3_0_256  \\\n",
       "0                            0.279437                               0.171863   \n",
       "1                            0.298409                               0.223614   \n",
       "2                            0.347462                               0.165437   \n",
       "3                            0.269791                               0.126960   \n",
       "4                            0.348811                               0.146148   \n",
       "\n",
       "   Cytoplasm_AreaShape_Area  Cytoplasm_Intensity_MeanIntensity_AGP  \\\n",
       "0                297.097143                               0.278284   \n",
       "1                507.059369                               0.236535   \n",
       "2                422.223545                               0.151870   \n",
       "3                315.485038                               0.175639   \n",
       "4                328.196795                               0.341500   \n",
       "\n",
       "   Nuclei_AreaShape_Area  Nuclei_AreaShape_Eccentricity  \\\n",
       "0             298.740225                       0.417458   \n",
       "1             211.585104                       0.396529   \n",
       "2             227.277140                       0.619046   \n",
       "3             221.047584                       0.308058   \n",
       "4             106.588422                       0.418463   \n",
       "\n",
       "   Nuclei_Intensity_MeanIntensity_DNA  \n",
       "0                            0.520604  \n",
       "1                            0.406506  \n",
       "2                            0.420757  \n",
       "3                            0.623995  \n",
       "4                            0.520791  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def simulate_single_cells(plate_id, n_cells_per_well=100):\n",
    "    \"\"\"\n",
    "    Generate synthetic single-cell morphology measurements for one plate.\n",
    "\n",
    "    Column naming follows the CellProfiler convention:\n",
    "      Metadata_*  — experimental context (plate, well, object identity)\n",
    "      Cells_*     — measurements of the whole-cell boundary\n",
    "      Cytoplasm_* — measurements of the cytoplasmic region\n",
    "      Nuclei_*    — measurements of the nuclear region\n",
    "\n",
    "    To keep this tutorial focused on the pipeline rather than biology,\n",
    "    only the cell-area features respond to treatment. All other features\n",
    "    are independent noise sampled from realistic distributions.\n",
    "    In a real experiment every feature may carry some biological signal.\n",
    "    \"\"\"\n",
    "    well_treatments = {\n",
    "        \"B02\": \"DMSO\",\n",
    "        \"C02\": \"DMSO\",\n",
    "        \"B03\": \"Compound_A\",\n",
    "        \"C03\": \"Compound_A\",\n",
    "        \"B04\": \"Compound_B\",\n",
    "        \"C04\": \"Compound_B\",\n",
    "    }\n",
    "\n",
    "    rows = []\n",
    "    for image_number, (well, treatment) in enumerate(well_treatments.items(), start=1):\n",
    "        is_a = float(treatment == \"Compound_A\")\n",
    "        is_b = float(treatment == \"Compound_B\")\n",
    "\n",
    "        # Only the Area family of features responds to treatment.\n",
    "        # This ensures only the intentionally correlated pair is removed in Step 4.\n",
    "        cell_area_base = np.random.normal(\n",
    "            500 + 180 * is_a - 90 * is_b, 120, n_cells_per_well\n",
    "        )\n",
    "\n",
    "        for obj_num in range(1, n_cells_per_well + 1):\n",
    "            cell_area = cell_area_base[obj_num - 1]\n",
    "            rows.append({\n",
    "                # ── Metadata columns ──────────────────────────────────────────\n",
    "                \"Metadata_Plate\": plate_id,\n",
    "                \"Metadata_Well\": well,\n",
    "                \"Metadata_ImageNumber\": image_number,\n",
    "                \"Metadata_ObjectNumber\": obj_num,\n",
    "                # ── Cell-level features ───────────────────────────────────────\n",
    "                # Area and BoundingBoxArea are intentionally correlated (r ≈ 0.99);\n",
    "                # one of the pair will be removed during feature selection.\n",
    "                \"Cells_AreaShape_Area\": cell_area,\n",
    "                \"Cells_AreaShape_BoundingBoxArea\": cell_area * 1.3\n",
    "                + np.random.normal(0, 4),\n",
    "                # EulerNumber = 1 for virtually all cells (topological invariant);\n",
    "                # zero variance → removed during feature selection.\n",
    "                \"Cells_AreaShape_EulerNumber\": 1,\n",
    "                # All remaining features: independent noise with realistic distributions\n",
    "                \"Cells_AreaShape_Eccentricity\": float(\n",
    "                    np.clip(np.random.normal(0.55, 0.12), 0, 1)\n",
    "                ),\n",
    "                \"Cells_Intensity_MeanIntensity_Mito\": np.random.normal(0.30, 0.06),\n",
    "                \"Cells_Texture_Correlation_RNA_3_0_256\": np.random.normal(0.22, 0.06),\n",
    "                # ── Cytoplasm features ────────────────────────────────────────\n",
    "                \"Cytoplasm_AreaShape_Area\": np.random.normal(310, 80),\n",
    "                \"Cytoplasm_Intensity_MeanIntensity_AGP\": np.random.normal(0.25, 0.07),\n",
    "                # ── Nuclei features ───────────────────────────────────────────\n",
    "                \"Nuclei_AreaShape_Area\": np.random.normal(195, 55),\n",
    "                \"Nuclei_AreaShape_Eccentricity\": float(\n",
    "                    np.clip(np.random.normal(0.40, 0.10), 0, 1)\n",
    "                ),\n",
    "                \"Nuclei_Intensity_MeanIntensity_DNA\": np.random.normal(0.50, 0.08),\n",
    "            })\n",
    "    return pd.DataFrame(rows)\n",
    "\n",
    "\n",
    "# Generate data for two plates to simulate biological replicates\n",
    "plate1 = simulate_single_cells(\"Plate_1\")\n",
    "plate2 = simulate_single_cells(\"Plate_2\")\n",
    "single_cells = pd.concat([plate1, plate2], ignore_index=True)\n",
    "\n",
    "print(\n",
    "    f\"Single-cell dataset: {single_cells.shape[0]:,} rows (cells) x {single_cells.shape[1]} columns\"\n",
    ")\n",
    "print(f\"Plates:  {single_cells['Metadata_Plate'].unique().tolist()}\")\n",
    "print(f\"Wells:   {sorted(single_cells['Metadata_Well'].unique().tolist())}\")\n",
    "print()\n",
    "single_cells.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-aggregate-intro",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 1: Aggregate — From Cells to Wells\n",
    "\n",
    "The single-cell table contains one row for every detected cell — in a real experiment this can\n",
    "easily reach hundreds of thousands of rows. However, biological interpretation happens at\n",
    "the level of the *well* (which treatment was applied), not the individual cell.\n",
    "\n",
    "`aggregate()` summarises all cells within the same well into a **single representative profile**\n",
    "by computing the median of each feature across all cells in that well.\n",
    "\n",
    "| Parameter | Description |\n",
    "|-----------|-------------|\n",
    "| `population_df` | The single-cell DataFrame |\n",
    "| `strata` | Columns that identify each well — cells sharing the same strata values are pooled together |\n",
    "| `features='infer'` | Automatically detect feature columns (any column whose name starts with a compartment prefix such as `Cells_`, `Cytoplasm_`, or `Nuclei_`) |\n",
    "| `operation` | Summary statistic: `'median'` (default) or `'mean'` |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "cell-aggregate",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-26T20:08:27.067683Z",
     "iopub.status.busy": "2026-05-26T20:08:27.067540Z",
     "iopub.status.idle": "2026-05-26T20:08:27.078438Z",
     "shell.execute_reply": "2026-05-26T20:08:27.078153Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Single cells:  1,200 rows  →  Well profiles: 12 rows\n",
      "Columns:       15          →               13\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Metadata_Plate</th>\n",
       "      <th>Metadata_Well</th>\n",
       "      <th>Cells_AreaShape_Area</th>\n",
       "      <th>Cells_AreaShape_BoundingBoxArea</th>\n",
       "      <th>Cells_AreaShape_EulerNumber</th>\n",
       "      <th>Cells_AreaShape_Eccentricity</th>\n",
       "      <th>Cells_Intensity_MeanIntensity_Mito</th>\n",
       "      <th>Cells_Texture_Correlation_RNA_3_0_256</th>\n",
       "      <th>Cytoplasm_AreaShape_Area</th>\n",
       "      <th>Cytoplasm_Intensity_MeanIntensity_AGP</th>\n",
       "      <th>Nuclei_AreaShape_Area</th>\n",
       "      <th>Nuclei_AreaShape_Eccentricity</th>\n",
       "      <th>Nuclei_Intensity_MeanIntensity_DNA</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>484.765245</td>\n",
       "      <td>634.635906</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.532796</td>\n",
       "      <td>0.300860</td>\n",
       "      <td>0.216580</td>\n",
       "      <td>315.443258</td>\n",
       "      <td>0.263187</td>\n",
       "      <td>193.043658</td>\n",
       "      <td>0.418260</td>\n",
       "      <td>0.517625</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B03</td>\n",
       "      <td>669.260838</td>\n",
       "      <td>870.994323</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.536427</td>\n",
       "      <td>0.304651</td>\n",
       "      <td>0.210215</td>\n",
       "      <td>314.055342</td>\n",
       "      <td>0.256479</td>\n",
       "      <td>190.671898</td>\n",
       "      <td>0.405648</td>\n",
       "      <td>0.498834</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B04</td>\n",
       "      <td>399.970485</td>\n",
       "      <td>523.433308</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.533963</td>\n",
       "      <td>0.306359</td>\n",
       "      <td>0.222698</td>\n",
       "      <td>304.712780</td>\n",
       "      <td>0.252233</td>\n",
       "      <td>196.243981</td>\n",
       "      <td>0.394866</td>\n",
       "      <td>0.495179</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>C02</td>\n",
       "      <td>523.730623</td>\n",
       "      <td>685.309911</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.552298</td>\n",
       "      <td>0.307376</td>\n",
       "      <td>0.214065</td>\n",
       "      <td>328.921311</td>\n",
       "      <td>0.247123</td>\n",
       "      <td>197.812467</td>\n",
       "      <td>0.418402</td>\n",
       "      <td>0.489309</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>C03</td>\n",
       "      <td>671.001479</td>\n",
       "      <td>874.309123</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.557177</td>\n",
       "      <td>0.297985</td>\n",
       "      <td>0.212211</td>\n",
       "      <td>327.148733</td>\n",
       "      <td>0.235335</td>\n",
       "      <td>195.742017</td>\n",
       "      <td>0.406123</td>\n",
       "      <td>0.504755</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>C04</td>\n",
       "      <td>411.397559</td>\n",
       "      <td>535.575562</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.537410</td>\n",
       "      <td>0.296999</td>\n",
       "      <td>0.211021</td>\n",
       "      <td>307.751566</td>\n",
       "      <td>0.237890</td>\n",
       "      <td>187.333217</td>\n",
       "      <td>0.406370</td>\n",
       "      <td>0.501055</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Plate_2</td>\n",
       "      <td>B02</td>\n",
       "      <td>510.859835</td>\n",
       "      <td>663.499468</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.560717</td>\n",
       "      <td>0.279879</td>\n",
       "      <td>0.210651</td>\n",
       "      <td>297.559100</td>\n",
       "      <td>0.245555</td>\n",
       "      <td>186.979153</td>\n",
       "      <td>0.395230</td>\n",
       "      <td>0.509640</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Plate_2</td>\n",
       "      <td>B03</td>\n",
       "      <td>671.687004</td>\n",
       "      <td>869.272541</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.539517</td>\n",
       "      <td>0.310271</td>\n",
       "      <td>0.223312</td>\n",
       "      <td>323.749429</td>\n",
       "      <td>0.256434</td>\n",
       "      <td>193.420235</td>\n",
       "      <td>0.392844</td>\n",
       "      <td>0.503184</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Plate_2</td>\n",
       "      <td>B04</td>\n",
       "      <td>420.512564</td>\n",
       "      <td>542.359150</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.556893</td>\n",
       "      <td>0.309685</td>\n",
       "      <td>0.215100</td>\n",
       "      <td>297.933806</td>\n",
       "      <td>0.239380</td>\n",
       "      <td>185.984035</td>\n",
       "      <td>0.389872</td>\n",
       "      <td>0.498350</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Plate_2</td>\n",
       "      <td>C02</td>\n",
       "      <td>515.312157</td>\n",
       "      <td>667.242484</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.571428</td>\n",
       "      <td>0.292990</td>\n",
       "      <td>0.219456</td>\n",
       "      <td>304.836628</td>\n",
       "      <td>0.249090</td>\n",
       "      <td>195.434170</td>\n",
       "      <td>0.387314</td>\n",
       "      <td>0.501398</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Plate_2</td>\n",
       "      <td>C03</td>\n",
       "      <td>677.517456</td>\n",
       "      <td>882.379424</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.548610</td>\n",
       "      <td>0.296525</td>\n",
       "      <td>0.225144</td>\n",
       "      <td>295.410839</td>\n",
       "      <td>0.250994</td>\n",
       "      <td>197.704177</td>\n",
       "      <td>0.397884</td>\n",
       "      <td>0.514447</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Plate_2</td>\n",
       "      <td>C04</td>\n",
       "      <td>407.760144</td>\n",
       "      <td>528.536944</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.514893</td>\n",
       "      <td>0.294377</td>\n",
       "      <td>0.209433</td>\n",
       "      <td>333.205131</td>\n",
       "      <td>0.251384</td>\n",
       "      <td>201.713063</td>\n",
       "      <td>0.417950</td>\n",
       "      <td>0.493228</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Metadata_Plate Metadata_Well  Cells_AreaShape_Area  \\\n",
       "0         Plate_1           B02            484.765245   \n",
       "1         Plate_1           B03            669.260838   \n",
       "2         Plate_1           B04            399.970485   \n",
       "3         Plate_1           C02            523.730623   \n",
       "4         Plate_1           C03            671.001479   \n",
       "5         Plate_1           C04            411.397559   \n",
       "6         Plate_2           B02            510.859835   \n",
       "7         Plate_2           B03            671.687004   \n",
       "8         Plate_2           B04            420.512564   \n",
       "9         Plate_2           C02            515.312157   \n",
       "10        Plate_2           C03            677.517456   \n",
       "11        Plate_2           C04            407.760144   \n",
       "\n",
       "    Cells_AreaShape_BoundingBoxArea  Cells_AreaShape_EulerNumber  \\\n",
       "0                        634.635906                          1.0   \n",
       "1                        870.994323                          1.0   \n",
       "2                        523.433308                          1.0   \n",
       "3                        685.309911                          1.0   \n",
       "4                        874.309123                          1.0   \n",
       "5                        535.575562                          1.0   \n",
       "6                        663.499468                          1.0   \n",
       "7                        869.272541                          1.0   \n",
       "8                        542.359150                          1.0   \n",
       "9                        667.242484                          1.0   \n",
       "10                       882.379424                          1.0   \n",
       "11                       528.536944                          1.0   \n",
       "\n",
       "    Cells_AreaShape_Eccentricity  Cells_Intensity_MeanIntensity_Mito  \\\n",
       "0                       0.532796                            0.300860   \n",
       "1                       0.536427                            0.304651   \n",
       "2                       0.533963                            0.306359   \n",
       "3                       0.552298                            0.307376   \n",
       "4                       0.557177                            0.297985   \n",
       "5                       0.537410                            0.296999   \n",
       "6                       0.560717                            0.279879   \n",
       "7                       0.539517                            0.310271   \n",
       "8                       0.556893                            0.309685   \n",
       "9                       0.571428                            0.292990   \n",
       "10                      0.548610                            0.296525   \n",
       "11                      0.514893                            0.294377   \n",
       "\n",
       "    Cells_Texture_Correlation_RNA_3_0_256  Cytoplasm_AreaShape_Area  \\\n",
       "0                                0.216580                315.443258   \n",
       "1                                0.210215                314.055342   \n",
       "2                                0.222698                304.712780   \n",
       "3                                0.214065                328.921311   \n",
       "4                                0.212211                327.148733   \n",
       "5                                0.211021                307.751566   \n",
       "6                                0.210651                297.559100   \n",
       "7                                0.223312                323.749429   \n",
       "8                                0.215100                297.933806   \n",
       "9                                0.219456                304.836628   \n",
       "10                               0.225144                295.410839   \n",
       "11                               0.209433                333.205131   \n",
       "\n",
       "    Cytoplasm_Intensity_MeanIntensity_AGP  Nuclei_AreaShape_Area  \\\n",
       "0                                0.263187             193.043658   \n",
       "1                                0.256479             190.671898   \n",
       "2                                0.252233             196.243981   \n",
       "3                                0.247123             197.812467   \n",
       "4                                0.235335             195.742017   \n",
       "5                                0.237890             187.333217   \n",
       "6                                0.245555             186.979153   \n",
       "7                                0.256434             193.420235   \n",
       "8                                0.239380             185.984035   \n",
       "9                                0.249090             195.434170   \n",
       "10                               0.250994             197.704177   \n",
       "11                               0.251384             201.713063   \n",
       "\n",
       "    Nuclei_AreaShape_Eccentricity  Nuclei_Intensity_MeanIntensity_DNA  \n",
       "0                        0.418260                            0.517625  \n",
       "1                        0.405648                            0.498834  \n",
       "2                        0.394866                            0.495179  \n",
       "3                        0.418402                            0.489309  \n",
       "4                        0.406123                            0.504755  \n",
       "5                        0.406370                            0.501055  \n",
       "6                        0.395230                            0.509640  \n",
       "7                        0.392844                            0.503184  \n",
       "8                        0.389872                            0.498350  \n",
       "9                        0.387314                            0.501398  \n",
       "10                       0.397884                            0.514447  \n",
       "11                       0.417950                            0.493228  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "well_profiles = aggregate(\n",
    "    population_df=single_cells,\n",
    "    strata=[\"Metadata_Plate\", \"Metadata_Well\"],\n",
    "    features=\"infer\",\n",
    "    operation=\"median\",\n",
    ")\n",
    "\n",
    "print(\n",
    "    f\"Single cells:  {single_cells.shape[0]:,} rows  →  Well profiles: {well_profiles.shape[0]} rows\"\n",
    ")\n",
    "print(\n",
    "    f\"Columns:       {single_cells.shape[1]}          →               {well_profiles.shape[1]}\"\n",
    ")\n",
    "print()\n",
    "well_profiles"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-annotate-intro",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 2: Annotate — Adding Experimental Context\n",
    "\n",
    "After aggregation, each row represents a well, but the DataFrame only records *where* the\n",
    "measurement came from (plate and well position) — not *what* biological condition was in\n",
    "that well.\n",
    "\n",
    "The connection between well positions and experimental conditions is stored in a\n",
    "**plate map** — a lookup table prepared by the researcher that records which compound,\n",
    "genetic perturbation, concentration, or other variable was assigned to each well.\n",
    "\n",
    "`annotate()` merges the plate map onto the well profiles, adding a `Metadata_` column\n",
    "for each piece of experimental information.\n",
    "\n",
    "> In real experiments, plate maps are usually supplied as CSV files from a Laboratory\n",
    "> Information Management System (LIMS) or prepared manually. Here we create one directly\n",
    "> as a DataFrame to show its structure.\n",
    "\n",
    "First, let us define the plate map:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "cell-platemap",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-26T20:08:27.080288Z",
     "iopub.status.busy": "2026-05-26T20:08:27.079979Z",
     "iopub.status.idle": "2026-05-26T20:08:27.084991Z",
     "shell.execute_reply": "2026-05-26T20:08:27.084683Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Plate map:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>well_position</th>\n",
       "      <th>treatment</th>\n",
       "      <th>cell_line</th>\n",
       "      <th>concentration_um</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>B02</td>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>C02</td>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>B03</td>\n",
       "      <td>Compound_A</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>C03</td>\n",
       "      <td>Compound_A</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B04</td>\n",
       "      <td>Compound_B</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>C04</td>\n",
       "      <td>Compound_B</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  well_position   treatment cell_line  concentration_um\n",
       "0           B02        DMSO      HeLa               0.0\n",
       "1           C02        DMSO      HeLa               0.0\n",
       "2           B03  Compound_A      HeLa              10.0\n",
       "3           C03  Compound_A      HeLa              10.0\n",
       "4           B04  Compound_B      HeLa              10.0\n",
       "5           C04  Compound_B      HeLa              10.0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The plate map records the biological condition in each well position.\n",
    "# The same layout was used for both plates in this experiment.\n",
    "platemap = pd.DataFrame({\n",
    "    # 'well_position' is the standard column name expected by annotate()\n",
    "    \"well_position\": [\"B02\", \"C02\", \"B03\", \"C03\", \"B04\", \"C04\"],\n",
    "    \"treatment\": [\n",
    "        \"DMSO\",\n",
    "        \"DMSO\",\n",
    "        \"Compound_A\",\n",
    "        \"Compound_A\",\n",
    "        \"Compound_B\",\n",
    "        \"Compound_B\",\n",
    "    ],\n",
    "    \"cell_line\": [\"HeLa\"] * 6,\n",
    "    \"concentration_um\": [0.0, 0.0, 10.0, 10.0, 10.0, 10.0],\n",
    "})\n",
    "\n",
    "print(\"Plate map:\")\n",
    "platemap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "cell-annotate",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-26T20:08:27.086330Z",
     "iopub.status.busy": "2026-05-26T20:08:27.086227Z",
     "iopub.status.idle": "2026-05-26T20:08:27.096375Z",
     "shell.execute_reply": "2026-05-26T20:08:27.096081Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Annotated profiles: 12 rows x 16 columns\n",
      "\n",
      "Well-to-treatment mapping after annotation:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Metadata_Plate</th>\n",
       "      <th>Metadata_Well</th>\n",
       "      <th>Metadata_treatment</th>\n",
       "      <th>Metadata_cell_line</th>\n",
       "      <th>Metadata_concentration_um</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Plate_2</td>\n",
       "      <td>B02</td>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B03</td>\n",
       "      <td>Compound_A</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Plate_2</td>\n",
       "      <td>B03</td>\n",
       "      <td>Compound_A</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B04</td>\n",
       "      <td>Compound_B</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Plate_2</td>\n",
       "      <td>B04</td>\n",
       "      <td>Compound_B</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>C02</td>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Plate_2</td>\n",
       "      <td>C02</td>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>C03</td>\n",
       "      <td>Compound_A</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Plate_2</td>\n",
       "      <td>C03</td>\n",
       "      <td>Compound_A</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>C04</td>\n",
       "      <td>Compound_B</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Plate_2</td>\n",
       "      <td>C04</td>\n",
       "      <td>Compound_B</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Metadata_Plate Metadata_Well Metadata_treatment Metadata_cell_line  \\\n",
       "0         Plate_1           B02               DMSO               HeLa   \n",
       "1         Plate_2           B02               DMSO               HeLa   \n",
       "4         Plate_1           B03         Compound_A               HeLa   \n",
       "5         Plate_2           B03         Compound_A               HeLa   \n",
       "8         Plate_1           B04         Compound_B               HeLa   \n",
       "9         Plate_2           B04         Compound_B               HeLa   \n",
       "2         Plate_1           C02               DMSO               HeLa   \n",
       "3         Plate_2           C02               DMSO               HeLa   \n",
       "6         Plate_1           C03         Compound_A               HeLa   \n",
       "7         Plate_2           C03         Compound_A               HeLa   \n",
       "10        Plate_1           C04         Compound_B               HeLa   \n",
       "11        Plate_2           C04         Compound_B               HeLa   \n",
       "\n",
       "    Metadata_concentration_um  \n",
       "0                         0.0  \n",
       "1                         0.0  \n",
       "4                        10.0  \n",
       "5                        10.0  \n",
       "8                        10.0  \n",
       "9                        10.0  \n",
       "2                         0.0  \n",
       "3                         0.0  \n",
       "6                        10.0  \n",
       "7                        10.0  \n",
       "10                       10.0  \n",
       "11                       10.0  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# annotate() joins the plate map onto the well profiles.\n",
    "#\n",
    "# join_on specifies [platemap_column, profiles_column] used for matching wells.\n",
    "# add_metadata_id_to_platemap=True prepends 'Metadata_' to all plate map column names,\n",
    "# following the pycytominer convention that all non-feature columns start with 'Metadata_'.\n",
    "annotated_profiles = annotate(\n",
    "    profiles=well_profiles,\n",
    "    platemap=platemap,\n",
    "    join_on=[\"Metadata_well_position\", \"Metadata_Well\"],\n",
    "    add_metadata_id_to_platemap=True,\n",
    ")\n",
    "\n",
    "print(\n",
    "    f\"Annotated profiles: {annotated_profiles.shape[0]} rows x {annotated_profiles.shape[1]} columns\"\n",
    ")\n",
    "print()\n",
    "\n",
    "# Show the metadata columns that were added\n",
    "meta_cols = [\n",
    "    \"Metadata_Plate\",\n",
    "    \"Metadata_Well\",\n",
    "    \"Metadata_treatment\",\n",
    "    \"Metadata_cell_line\",\n",
    "    \"Metadata_concentration_um\",\n",
    "]\n",
    "print(\"Well-to-treatment mapping after annotation:\")\n",
    "annotated_profiles[meta_cols].drop_duplicates().sort_values(\"Metadata_Well\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-normalize-intro",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 3: Normalize — Removing Technical Variation\n",
    "\n",
    "CellProfiler features differ widely in scale and units. For example:\n",
    "- `Cells_AreaShape_Area` might range from 200 to 1,000 (pixels²)\n",
    "- `Nuclei_Intensity_MeanIntensity_DNA` might range from 0.1 to 0.9 (arbitrary fluorescence units)\n",
    "\n",
    "Without normalization, features with large absolute values would dominate any downstream\n",
    "distance calculation or machine-learning model, regardless of whether they carry biological signal.\n",
    "Normalization also corrects for **plate-to-plate technical variation** caused by differences\n",
    "in staining efficiency, imaging conditions, or cell density between experimental batches.\n",
    "\n",
    "`normalize()` rescales each feature using the **distribution of control wells** as a reference.\n",
    "The default method (`'standardize'`) subtracts the control mean and divides by the control\n",
    "standard deviation — a standard z-score transformation. After normalization, control wells\n",
    "cluster around zero, and treated wells are expressed in units of\n",
    "*standard deviations away from the control*.\n",
    "\n",
    "> **What is a vehicle control?**\n",
    "> DMSO (dimethyl sulfoxide) is the standard solvent used to dissolve most small-molecule\n",
    "> compounds. Adding DMSO at the same concentration as the compound solvent, but without any\n",
    "> active compound, defines the biological baseline — what cells look like when nothing\n",
    "> meaningful has been done to them.\n",
    "\n",
    "| Parameter | Description |\n",
    "|-----------|-------------|\n",
    "| `samples` | A pandas query string selecting the control wells used to compute normalization statistics |\n",
    "| `method` | `'standardize'` (z-score), `'robustize'` (median-based), or `'mad_robustize'` |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "cell-normalize",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-26T20:08:27.097819Z",
     "iopub.status.busy": "2026-05-26T20:08:27.097709Z",
     "iopub.status.idle": "2026-05-26T20:08:27.105943Z",
     "shell.execute_reply": "2026-05-26T20:08:27.105680Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Normalized profiles: (12, 16)\n",
      "\n",
      "Mean feature values in DMSO wells (should be ≈ 0.0 after normalization):\n",
      "Cells_AreaShape_Area                     0.0\n",
      "Cells_AreaShape_BoundingBoxArea         -0.0\n",
      "Cells_AreaShape_EulerNumber              0.0\n",
      "Cells_AreaShape_Eccentricity             0.0\n",
      "Cells_Intensity_MeanIntensity_Mito       0.0\n",
      "Cells_Texture_Correlation_RNA_3_0_256    0.0\n",
      "Cytoplasm_AreaShape_Area                -0.0\n",
      "Cytoplasm_Intensity_MeanIntensity_AGP    0.0\n",
      "Nuclei_AreaShape_Area                    0.0\n",
      "Nuclei_AreaShape_Eccentricity           -0.0\n",
      "Nuclei_Intensity_MeanIntensity_DNA      -0.0\n"
     ]
    }
   ],
   "source": [
    "normalized_profiles = normalize(\n",
    "    profiles=annotated_profiles,\n",
    "    features=\"infer\",\n",
    "    meta_features=\"infer\",\n",
    "    samples=\"Metadata_treatment == 'DMSO'\",  # use DMSO wells as the normalization reference\n",
    "    method=\"standardize\",\n",
    ")\n",
    "\n",
    "print(f\"Normalized profiles: {normalized_profiles.shape}\")\n",
    "\n",
    "# Sanity check: DMSO wells should be centred near 0 after normalization\n",
    "feature_cols = [c for c in normalized_profiles.columns if not c.startswith(\"Metadata_\")]\n",
    "dmso_rows = normalized_profiles[\"Metadata_treatment\"] == \"DMSO\"\n",
    "\n",
    "print()\n",
    "print(\"Mean feature values in DMSO wells (should be ≈ 0.0 after normalization):\")\n",
    "print(normalized_profiles.loc[dmso_rows, feature_cols].mean().round(3).to_string())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-featsel-intro",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 4: Feature Selection — Keeping Only Informative Features\n",
    "\n",
    "A typical high-content microscopy dataset contains 500–1,500 features per well. Many of these features\n",
    "carry little or no biological information and can actively harm downstream analyses by\n",
    "adding noise:\n",
    "\n",
    "- **Constant or near-constant features** have the same value in every well and therefore\n",
    "  cannot distinguish one treatment from another.\n",
    "- **Highly correlated features** (e.g., `Cells_AreaShape_Area` and\n",
    "  `Cells_AreaShape_BoundingBoxArea`) measure essentially the same property through slightly\n",
    "  different calculations. Retaining both adds redundancy without adding biological information.\n",
    "- **Blocklisted features** are features empirically identified as technically unreliable across\n",
    "  many published CellProfiler pipelines.\n",
    "\n",
    "`feature_select()` applies these removal criteria in sequence:\n",
    "\n",
    "| Operation | What it removes |\n",
    "|-----------|----------------|\n",
    "| `'variance_threshold'` | Features with variance below a minimum threshold (effectively constant) |\n",
    "| `'correlation_threshold'` | One feature from every highly correlated pair (Pearson *r* > 0.9) |\n",
    "| `'blocklist'` | Features on the community-curated Pycytominer blocklist |\n",
    "\n",
    "Recall from the data generation step that we deliberately included two uninformative features:\n",
    "\n",
    "- `Cells_AreaShape_EulerNumber` — constant (= 1) for all cells → removed by `variance_threshold`\n",
    "- `Cells_AreaShape_Area` / `Cells_AreaShape_BoundingBoxArea` — nearly perfectly correlated\n",
    "  (r ≈ 0.99) → one of the pair is removed by `correlation_threshold`\n",
    "\n",
    "> **Which of the correlated pair is kept?**\n",
    "> The algorithm retains the feature with the lower average correlation to all other features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "cell-featsel",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-26T20:08:27.108121Z",
     "iopub.status.busy": "2026-05-26T20:08:27.107949Z",
     "iopub.status.idle": "2026-05-26T20:08:27.118868Z",
     "shell.execute_reply": "2026-05-26T20:08:27.118522Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features before selection: 11\n",
      "Features after  selection: 9\n",
      "Features removed:          2\n",
      "\n",
      "Retained features:\n",
      "  Cells_AreaShape_BoundingBoxArea\n",
      "  Cells_AreaShape_Eccentricity\n",
      "  Cells_Intensity_MeanIntensity_Mito\n",
      "  Cells_Texture_Correlation_RNA_3_0_256\n",
      "  Cytoplasm_AreaShape_Area\n",
      "  Cytoplasm_Intensity_MeanIntensity_AGP\n",
      "  Nuclei_AreaShape_Area\n",
      "  Nuclei_AreaShape_Eccentricity\n",
      "  Nuclei_Intensity_MeanIntensity_DNA\n"
     ]
    }
   ],
   "source": [
    "selected_profiles = feature_select(\n",
    "    profiles=normalized_profiles,\n",
    "    features=\"infer\",\n",
    "    operation=[\"variance_threshold\", \"correlation_threshold\", \"blocklist\"],\n",
    ")\n",
    "\n",
    "feature_cols_before = [\n",
    "    c for c in normalized_profiles.columns if not c.startswith(\"Metadata_\")\n",
    "]\n",
    "feature_cols_after = [\n",
    "    c for c in selected_profiles.columns if not c.startswith(\"Metadata_\")\n",
    "]\n",
    "\n",
    "print(f\"Features before selection: {len(feature_cols_before)}\")\n",
    "print(f\"Features after  selection: {len(feature_cols_after)}\")\n",
    "print(\n",
    "    f\"Features removed:          {len(feature_cols_before) - len(feature_cols_after)}\"\n",
    ")\n",
    "print()\n",
    "print(\"Retained features:\")\n",
    "for col in sorted(feature_cols_after):\n",
    "    print(f\"  {col}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-consensus-intro",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 5: Consensus — Collapsing Replicates\n",
    "\n",
    "At this point we have one profile per well. Because our experiment was run across **two plates**\n",
    "(biological replicates), we have four profiles for each treatment condition — two wells per plate\n",
    "times two plates. Some downstream analyses expect a single, definitive profile per condition.\n",
    "\n",
    "`consensus()` collapses replicate profiles into one **consensus profile** per treatment group\n",
    "by computing the median across all replicates.\n",
    "\n",
    "Using the consensus profile instead of individual replicates:\n",
    "\n",
    "- Reduces the influence of plate-specific technical artefacts that survived normalization\n",
    "- Produces a lower-variance, higher-confidence representation of the treatment effect\n",
    "- Simplifies downstream analysis by reducing the number of rows to cluster or classify\n",
    "\n",
    "The `replicate_columns` parameter specifies which metadata columns **define a unique condition**.\n",
    "Profiles that share the same values in these columns are treated as replicates and collapsed\n",
    "into a single consensus row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "cell-consensus",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-26T20:08:27.121466Z",
     "iopub.status.busy": "2026-05-26T20:08:27.121301Z",
     "iopub.status.idle": "2026-05-26T20:08:27.128233Z",
     "shell.execute_reply": "2026-05-26T20:08:27.127943Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Profiles before consensus: 12 rows  (one per well)\n",
      "Profiles after  consensus: 3 rows  (one per treatment)\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Metadata_treatment</th>\n",
       "      <th>Metadata_cell_line</th>\n",
       "      <th>Metadata_concentration_um</th>\n",
       "      <th>Cells_AreaShape_BoundingBoxArea</th>\n",
       "      <th>Cells_AreaShape_Eccentricity</th>\n",
       "      <th>Cells_Intensity_MeanIntensity_Mito</th>\n",
       "      <th>Cells_Texture_Correlation_RNA_3_0_256</th>\n",
       "      <th>Cytoplasm_AreaShape_Area</th>\n",
       "      <th>Cytoplasm_Intensity_MeanIntensity_AGP</th>\n",
       "      <th>Nuclei_AreaShape_Area</th>\n",
       "      <th>Nuclei_AreaShape_Eccentricity</th>\n",
       "      <th>Nuclei_Intensity_MeanIntensity_DNA</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Compound_A</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "      <td>11.558693</td>\n",
       "      <td>-0.724044</td>\n",
       "      <td>0.589681</td>\n",
       "      <td>0.794162</td>\n",
       "      <td>0.610831</td>\n",
       "      <td>0.353012</td>\n",
       "      <td>0.313659</td>\n",
       "      <td>-0.219728</td>\n",
       "      <td>-0.049983</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Compound_B</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "      <td>-7.189962</td>\n",
       "      <td>-1.316052</td>\n",
       "      <td>0.624929</td>\n",
       "      <td>-0.656629</td>\n",
       "      <td>-0.462245</td>\n",
       "      <td>-0.835377</td>\n",
       "      <td>-0.379430</td>\n",
       "      <td>-0.302806</td>\n",
       "      <td>-0.737681</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.148573</td>\n",
       "      <td>0.155330</td>\n",
       "      <td>0.160923</td>\n",
       "      <td>0.041498</td>\n",
       "      <td>-0.131285</td>\n",
       "      <td>-0.446743</td>\n",
       "      <td>0.228724</td>\n",
       "      <td>0.140684</td>\n",
       "      <td>0.097928</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Metadata_treatment Metadata_cell_line  Metadata_concentration_um  \\\n",
       "0         Compound_A               HeLa                       10.0   \n",
       "1         Compound_B               HeLa                       10.0   \n",
       "2               DMSO               HeLa                        0.0   \n",
       "\n",
       "   Cells_AreaShape_BoundingBoxArea  Cells_AreaShape_Eccentricity  \\\n",
       "0                        11.558693                     -0.724044   \n",
       "1                        -7.189962                     -1.316052   \n",
       "2                         0.148573                      0.155330   \n",
       "\n",
       "   Cells_Intensity_MeanIntensity_Mito  Cells_Texture_Correlation_RNA_3_0_256  \\\n",
       "0                            0.589681                               0.794162   \n",
       "1                            0.624929                              -0.656629   \n",
       "2                            0.160923                               0.041498   \n",
       "\n",
       "   Cytoplasm_AreaShape_Area  Cytoplasm_Intensity_MeanIntensity_AGP  \\\n",
       "0                  0.610831                               0.353012   \n",
       "1                 -0.462245                              -0.835377   \n",
       "2                 -0.131285                              -0.446743   \n",
       "\n",
       "   Nuclei_AreaShape_Area  Nuclei_AreaShape_Eccentricity  \\\n",
       "0               0.313659                      -0.219728   \n",
       "1              -0.379430                      -0.302806   \n",
       "2               0.228724                       0.140684   \n",
       "\n",
       "   Nuclei_Intensity_MeanIntensity_DNA  \n",
       "0                           -0.049983  \n",
       "1                           -0.737681  \n",
       "2                            0.097928  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "consensus_profiles = consensus(\n",
    "    profiles=selected_profiles,\n",
    "    replicate_columns=[\n",
    "        \"Metadata_treatment\",\n",
    "        \"Metadata_cell_line\",\n",
    "        \"Metadata_concentration_um\",\n",
    "    ],\n",
    "    operation=\"median\",\n",
    "    features=\"infer\",\n",
    ")\n",
    "\n",
    "print(f\"Profiles before consensus: {selected_profiles.shape[0]} rows  (one per well)\")\n",
    "print(\n",
    "    f\"Profiles after  consensus: {consensus_profiles.shape[0]} rows  (one per treatment)\"\n",
    ")\n",
    "print()\n",
    "consensus_profiles"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-summary",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Summary\n",
    "\n",
    "You have now processed a Cell Painting dataset through the complete Pycytominer pipeline.\n",
    "Here is a recap of what each step accomplished:\n",
    "\n",
    "| Step | Function | Rows | Features |\n",
    "|------|----------|------|----------|\n",
    "| Raw single-cell data | — | 1,200 cells | 11 |\n",
    "| After `aggregate()` | pool cells per well | 12 wells | 11 |\n",
    "| After `annotate()` | add treatment labels | 12 wells | 11 |\n",
    "| After `normalize()` | z-score vs DMSO | 12 wells | 11 |\n",
    "| After `feature_select()` | remove uninformative features | 12 wells | 9 |\n",
    "| After `consensus()` | collapse replicates | **3 conditions** | **9** |\n",
    "\n",
    "The final `consensus_profiles` DataFrame contains one row per biological treatment condition\n",
    "and nine informative morphological features — a compact, analysis-ready representation of\n",
    "how each treatment changed the appearance of cells.\n",
    "\n",
    "### Saving Your Profiles\n",
    "\n",
    "Pycytominer provides `cyto_utils.output()` as its canonical function for writing profiles to disk — the same function each pipeline step calls internally when you pass an `output_file` argument. It handles compression, format selection, and file naming in one call, and supports four output types:\n",
    "\n",
    "| `output_type` | Extension | Best for |\n",
    "|---|---|---|\n",
    "| `\"csv\"` (default) | `.csv.gz` | Gzip-compressed, readable by any tool |\n",
    "| `\"parquet\"` | `.parquet` | Faster reads and smaller files for large screens |\n",
    "| `\"anndata_h5ad\"` | `.h5ad` | [AnnData](https://anndata.readthedocs.io/) / [scanpy](https://scanpy.readthedocs.io/) workflows |\n",
    "| `\"anndata_zarr\"` | `.zarr` | Cloud-native AnnData storage |\n",
    "\n",
    "```python\n",
    "from pycytominer.cyto_utils import output\n",
    "\n",
    "# Gzip-compressed CSV (default) — small footprint, readable by any tool\n",
    "output(\n",
    "    df=consensus_profiles,\n",
    "    output_filename=\"consensus_profiles.csv.gz\",\n",
    "    output_type=\"csv\",\n",
    ")\n",
    "\n",
    "# Parquet — fast reads and efficient storage for large screens\n",
    "output(\n",
    "    df=consensus_profiles,\n",
    "    output_filename=\"consensus_profiles.parquet\",\n",
    "    output_type=\"parquet\",\n",
    ")\n",
    "\n",
    "# AnnData HDF5 — ready for scanpy, scverse, and single-cell workflows\n",
    "output(\n",
    "    df=consensus_profiles,\n",
    "    output_filename=\"consensus_profiles.h5ad\",\n",
    "    output_type=\"anndata_h5ad\",\n",
    ")\n",
    "```\n",
    "\n",
    "> **Pro tip:** Every pipeline function accepts an `output_file` argument that writes directly to disk and returns the file path instead of a DataFrame. This avoids storing intermediate results in memory for large datasets:\n",
    ">\n",
    "> ```python\n",
    "> consensus_profiles = consensus(\n",
    ">     profiles=selected_profiles,\n",
    ">     replicate_columns=[\"Metadata_treatment\", \"Metadata_cell_line\", \"Metadata_concentration_um\"],\n",
    ">     operation=\"median\",\n",
    ">     features=\"infer\",\n",
    ">     output_file=\"consensus_profiles.parquet\",\n",
    ">     output_type=\"parquet\",\n",
    "> )\n",
    "> # consensus_profiles is now the file path string, not a DataFrame\n",
    "> ```\n",
    "\n",
    "### What to Do Next\n",
    "\n",
    "With morphology profiles in hand, common next steps include:\n",
    "\n",
    "- **Phenotypic clustering** — group treatments by morphological similarity using\n",
    "  hierarchical clustering or UMAP\n",
    "- **Similarity analysis** — identify compounds that produce the same cellular phenotype\n",
    "  using correlation or cosine similarity metrics\n",
    "- **Classification** — train machine-learning models to predict a compound's mechanism\n",
    "  of action from its morphological profile\n",
    "- **Dimensionality reduction** — visualise the morphological space of an entire compound\n",
    "  library in two dimensions using PCA or UMAP\n",
    "- **Hit calling** — identify which compounds produce a statistically significant\n",
    "  morphological change relative to controls.\n",
    "  [copairs](https://github.com/cytomining/copairs) computes mean Average Precision (mAP)\n",
    "  to score phenotypic activity and consistency at the well/profile level;\n",
    "  [Buscar](https://github.com/WayScience/Buscar) operates directly on single-cell\n",
    "  distributions to capture cellular heterogeneity and flag off-target effects\n",
    "\n",
    "### Further Reading\n",
    "\n",
    "- [Pycytominer API Reference](https://pycytominer.readthedocs.io/en/latest/modules.html) —\n",
    "  full documentation for every function used in this tutorial\n",
    "- [CytoTable tutorial](https://cytomining.github.io/CytoTable/tutorials/cellprofiler_to_parquet.html) —\n",
    "  how to convert raw CellProfiler output into the Parquet format that Pycytominer reads\n",
    "- [Cell Painting Gallery](https://cellpaintinggallery.org/) —\n",
    "  a public repository of Cell Painting datasets ready for analysis\n",
    "\n",
    "---\n",
    "\n",
    "### Pycytominer in the Wild\n",
    "\n",
    "Pycytominer is used across some of the largest and most impactful image-based profiling initiatives in the world.\n",
    "Here are a few to spark your curiosity:\n",
    "\n",
    "---\n",
    "\n",
    "**🧬 [JUMP-CP](https://jump-cellpainting.broadinstitute.org/) — Joint Undertaking for Morphological Profiling**\n",
    "\n",
    "The largest public Cell Painting dataset ever produced, generated by a consortium of 13 pharmaceutical companies and academic institutions (including AstraZeneca, Bayer, Pfizer, Merck KGaA, and the Broad Institute). JUMP-CP profiled over 116,000 compounds and ~15,000 genetic perturbations, with all profiles processed using Pycytominer. The resulting resource is used to predict compound activity, identify drug mechanisms, and match small molecules to disease phenotypes — at industrial scale.\n",
    "\n",
    "---\n",
    "\n",
    "**🔬 [LINCS Cell Painting](https://github.com/broadinstitute/lincs-cell-painting) — Library of Integrated Network-based Cellular Signatures**\n",
    "\n",
    "An NIH-funded initiative that profiled 1,571 bioactive compounds across six doses and five replicates in A549 lung cancer cells. Pycytominer was adopted as the **primary profiling tool** for this dataset, producing normalized and feature-selected profiles (Levels 3–5) that are publicly available for download. LINCS demonstrated that image-based profiles could serve as a systematic, reproducible reference map of cellular responses to chemical perturbation.\n",
    "\n",
    "---\n",
    "\n",
    "**🌍 [EU-OPENSCREEN](https://www.eu-openscreen.eu/) — European Chemical Biology Research Infrastructure**\n",
    "\n",
    "A distributed pan-European research infrastructure spanning 30 partner sites across eight countries. EU-OPENSCREEN has integrated Cell Painting into its screening platform, enabling European academic and industry researchers to access high-content imaging and morphological profiling as a service. Their contributions to the JUMP-CP consortium extended the reach of image-based profiling into the broader European drug discovery community.\n",
    "\n",
    "---\n",
    "\n",
    "**🖼️ [Cell Painting Gallery](https://registry.opendata.aws/cellpainting-gallery/) — Broad Institute Open Dataset Collection**\n",
    "\n",
    "A growing public repository of Cell Painting datasets, hosted on AWS as open data and maintained by the Carpenter–Singh and Cimini labs at the Broad Institute. The gallery spans tens of thousands of small-molecule treatments across diverse cell lines and experimental designs — all freely accessible and ready for analysis. It is the canonical reference point for new Cell Painting datasets produced by the community.\n",
    "\n",
    "---\n",
    "\n",
    "> These resources process their raw CellProfiler outputs through the same Pycytominer pipeline\n",
    "> you just ran — the only difference is scale.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}