{
"cells": [
{
"cell_type": "markdown",
"id": "cell-title",
"metadata": {},
"source": [
"# Introduction to image-based profiling with Pycytominer\n",
"\n",
"**Welcome!** This tutorial introduces [Pycytominer](https://pycytominer.readthedocs.io), a Python library\n",
"for processing image-based profiling data from high-content microscopy experiments.\n",
"\n",
"## What You Will Learn\n",
"\n",
"By the end of this tutorial, you will know how to:\n",
"\n",
"1. **Aggregate** thousands of single-cell measurements into one representative profile per experimental well\n",
"2. **Annotate** profiles with experimental metadata, such as which compound was applied to each well\n",
"3. **Normalize** feature values to remove plate-to-plate technical variation\n",
"4. **Select features** to remove uninformative or redundant measurements\n",
"5. **Build consensus profiles** that collapse replicate experiments into a single representative vector\n",
"\n",
"## Background: What Is High-Content Microscopy?\n",
"\n",
"High-content microscopy measures hundreds to thousands of informative phenotypic features that\n",
"represent the morphology state of cells under different biological conditions (e.g., healthy vs. disease). High-content microscopy\n",
"is often paired with high-throughput screening experiments that perturb cells with small-molecule compounds or genetic perturbations.\n",
"\n",
"In a typical experiment:\n",
"\n",
"1. Cells are grown in multi-well plates and treated with a panel of perturbations.\n",
"2. Optionally apply fluorescence dyes to stain distinct cellular compartments.\n",
"3. Automated microscopes capture hundreds of images per plate.\n",
"\n",
"\n",
"\n",
"4. Image analysis software (such as [CellProfiler](https://cellprofiler.org/)) extracts several\n",
" thousand numerical features per detected cell, describing each compartment's and channel's shape, texture,\n",
" and fluorescence intensity.\n",
"\n",
"\n",
"\n",
"A single experiment can generate measurements from **millions of individual cells**, spanning\n",
"hundreds to thousands of features. The central challenge is transforming this raw, high-dimensional\n",
"data into clean, interpretable **image-based profiles** — compact, comparable vectors that\n",
"summarise how each condition changed the appearance of cells.\n",
"\n",
"That is exactly what Pycytominer does ([Serrano et al., 2025](https://doi.org/10.1038/s41592-025-02611-8)), which has been grounded in image-based profiling methods established over the past decade ([Caicedo et al., 2017](https://doi.org/10.1038/nmeth.4397), [Serrano et al. 2026](https://doi.org/10.1038/s44320-026-00197-7))."
]
},
{
"cell_type": "markdown",
"id": "cell-prereqs",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"This tutorial assumes you have:\n",
"\n",
"- [Installed Pycytominer](https://pycytominer.readthedocs.io)\n",
"- Familiarity with [pandas DataFrames](https://pandas.pydata.org/)\n",
"- *(Optional)* Completed the\n",
" [CytoTable tutorial](https://cytomining.github.io/CytoTable/tutorials/cellprofiler_to_parquet.html),\n",
" which shows how to convert raw CellProfiler output into the Parquet format that\n",
" Pycytominer reads as input.\n",
"\n",
"## The Pycytominer Pipeline at a Glance\n",
"\n",
"Raw single-cell data travels through five sequential steps:\n",
"\n",
"| Step | Pycytominer function | What changes |\n",
"|------|----------------------|--------------|\n",
"| 1. Aggregate | `aggregate()` | One row per cell → one row per well |\n",
"| 2. Annotate | `annotate()` | Well positions → biological treatment labels |\n",
"| 3. Normalize | `normalize()` | Raw feature values → z-scores relative to controls |\n",
"| 4. Feature Select| `feature_select()`| Hundreds of features → only the informative ones |\n",
"| 5. Consensus | `consensus()` | One row per well → one row per treatment condition |\n",
"\n",
"At the end, you have a compact, analysis-ready table where each row is a unique biological\n",
"condition and each column is an informative morphological measurement."
]
},
{
"cell_type": "raw",
"id": "cell-pipeline-diagram",
"metadata": {
"raw_mimetype": "text/restructuredtext"
},
"source": [
".. mermaid::\n",
" :align: center\n",
"\n",
" flowchart TD\n",
" input[\"🔬 Single-cell data
1,200 cells, 11 features\"]\n",
" agg[\"🪣 aggregate()
Pool cells per well, 12 profiles\"]\n",
" ann[\"🏷️ annotate()
Join plate map, add treatment labels\"]\n",
" nor[\"⚖️ normalize()
Z-score vs DMSO controls\"]\n",
" fea[\"✂️ feature_select()
Remove redundant, 9 of 11 features kept\"]\n",
" con[\"🤝 consensus()
Median across plates, 3 conditions\"]\n",
" output[\"📊 Morphological profiles
3 conditions, 9 features\"]\n",
"\n",
" input --> agg --> ann --> nor --> fea --> con --> output\n",
"\n",
" style input fill:#f0d9fa,stroke:#88239A,color:#111\n",
" style output fill:#f0d9fa,stroke:#88239A,color:#111\n",
" style agg fill:#ffffff,stroke:#88239A,color:#111\n",
" style ann fill:#ffffff,stroke:#88239A,color:#111\n",
" style nor fill:#ffffff,stroke:#88239A,color:#111\n",
" style fea fill:#ffffff,stroke:#88239A,color:#111\n",
" style con fill:#ffffff,stroke:#88239A,color:#111"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "cell-imports",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T20:08:25.462328Z",
"iopub.status.busy": "2026-05-26T20:08:25.462230Z",
"iopub.status.idle": "2026-05-26T20:08:27.029270Z",
"shell.execute_reply": "2026-05-26T20:08:27.028920Z"
}
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"from pycytominer import aggregate, annotate, consensus, feature_select, normalize\n",
"\n",
"# Fix the random seed so this tutorial produces identical results every time it is run\n",
"np.random.seed(42)"
]
},
{
"cell_type": "markdown",
"id": "cell-data-intro",
"metadata": {},
"source": [
"## Tutorial Data\n",
"\n",
"In a real workflow, you would start from the Parquet file produced by\n",
"[CytoTable](https://cytomining.github.io/CytoTable/tutorials/cellprofiler_to_parquet.html):\n",
"\n",
"```python\n",
"from pycytominer.cyto_utils import load_profiles\n",
"\n",
"# Load single-cell measurements exported by CytoTable\n",
"single_cells = load_profiles(\"outputs/examplehuman.parquet\")\n",
"```\n",
"\n",
"For this tutorial we generate a small **synthetic dataset** that mirrors the exact structure of\n",
"a real high-content microscopy experiment. The column names, data types, and naming conventions are\n",
"identical to what CellProfiler and CytoTable produce — only the numerical values are simulated.\n",
"\n",
"**Experiment design:**\n",
"\n",
"| Property | Value |\n",
"|----------|-------|\n",
"| Plates (biological replicates) | 2 |\n",
"| Wells per plate | 6 (2 × DMSO vehicle control, 2 × Compound A, 2 × Compound B) |\n",
"| Cells per well | ~100 |\n",
"| Total single-cell measurements | ~1,200 |\n",
"| Morphological features | 11 (across three compartments) |\n",
"\n",
"> **Note on the features:** Two of the eleven features are intentionally designed to be\n",
"> uninformative — one is constant across all cells, and one is nearly perfectly correlated\n",
"> with another. You will see these removed automatically in Step 4 (Feature Selection).\n",
"\n",
"The simulation function is available in the expandable block below if you'd like to inspect it — you can skip it and go straight to Step 1."
]
},
{
"cell_type": "raw",
"id": "cell-simulate-toggle",
"metadata": {
"raw_mimetype": "text/restructuredtext"
},
"source": [
".. toggle::\n",
"\n",
" .. code-block:: python\n",
"\n",
" def simulate_single_cells(plate_id, n_cells_per_well=100):\n",
" \"\"\"\n",
" Generate synthetic single-cell morphology measurements for one plate.\n",
"\n",
" Column naming follows the CellProfiler convention:\n",
" Metadata_* — experimental context (plate, well, object identity)\n",
" Cells_* — measurements of the whole-cell boundary\n",
" Cytoplasm_* — measurements of the cytoplasmic region\n",
" Nuclei_* — measurements of the nuclear region\n",
"\n",
" To keep this tutorial focused on the pipeline rather than biology,\n",
" only the cell-area features respond to treatment. All other features\n",
" are independent noise sampled from realistic distributions.\n",
" In a real experiment every feature may carry some biological signal.\n",
" \"\"\"\n",
" well_treatments = {\n",
" 'B02': 'DMSO',\n",
" 'C02': 'DMSO',\n",
" 'B03': 'Compound_A',\n",
" 'C03': 'Compound_A',\n",
" 'B04': 'Compound_B',\n",
" 'C04': 'Compound_B',\n",
" }\n",
"\n",
" rows = []\n",
" for image_number, (well, treatment) in enumerate(well_treatments.items(), start=1):\n",
" is_a = float(treatment == 'Compound_A')\n",
" is_b = float(treatment == 'Compound_B')\n",
"\n",
" # Only the Area family of features responds to treatment.\n",
" # This ensures only the intentionally correlated pair is removed in Step 4.\n",
" cell_area_base = np.random.normal(500 + 180 * is_a - 90 * is_b, 120, n_cells_per_well)\n",
"\n",
" for obj_num in range(1, n_cells_per_well + 1):\n",
" cell_area = cell_area_base[obj_num - 1]\n",
" rows.append({\n",
" # ── Metadata columns ──────────────────────────────────────────\n",
" 'Metadata_Plate': plate_id,\n",
" 'Metadata_Well': well,\n",
" 'Metadata_ImageNumber': image_number,\n",
" 'Metadata_ObjectNumber': obj_num,\n",
" # ── Cell-level features ───────────────────────────────────────\n",
" # Area and BoundingBoxArea are intentionally correlated (r ≈ 0.99);\n",
" # one of the pair will be removed during feature selection.\n",
" 'Cells_AreaShape_Area': cell_area,\n",
" 'Cells_AreaShape_BoundingBoxArea': cell_area * 1.3 + np.random.normal(0, 4),\n",
" # EulerNumber = 1 for virtually all cells (topological invariant);\n",
" # zero variance → removed during feature selection.\n",
" 'Cells_AreaShape_EulerNumber': 1,\n",
" # All remaining features: independent noise with realistic distributions\n",
" 'Cells_AreaShape_Eccentricity': float(np.clip(np.random.normal(0.55, 0.12), 0, 1)),\n",
" 'Cells_Intensity_MeanIntensity_Mito': np.random.normal(0.30, 0.06),\n",
" 'Cells_Texture_Correlation_RNA_3_0_256': np.random.normal(0.22, 0.06),\n",
" # ── Cytoplasm features ────────────────────────────────────────\n",
" 'Cytoplasm_AreaShape_Area': np.random.normal(310, 80),\n",
" 'Cytoplasm_Intensity_MeanIntensity_AGP': np.random.normal(0.25, 0.07),\n",
" # ── Nuclei features ───────────────────────────────────────────\n",
" 'Nuclei_AreaShape_Area': np.random.normal(195, 55),\n",
" 'Nuclei_AreaShape_Eccentricity': float(np.clip(np.random.normal(0.40, 0.10), 0, 1)),\n",
" 'Nuclei_Intensity_MeanIntensity_DNA': np.random.normal(0.50, 0.08),\n",
" })\n",
" return pd.DataFrame(rows)\n",
"\n",
"\n",
" # Generate data for two plates to simulate biological replicates\n",
" plate1 = simulate_single_cells('Plate_1')\n",
" plate2 = simulate_single_cells('Plate_2')\n",
" single_cells = pd.concat([plate1, plate2], ignore_index=True)\n",
"\n",
" print(f'Single-cell dataset: {single_cells.shape[0]:,} rows (cells) x {single_cells.shape[1]} columns')\n",
" print(f'Plates: {single_cells[\"Metadata_Plate\"].unique().tolist()}')\n",
" print(f'Wells: {sorted(single_cells[\"Metadata_Well\"].unique().tolist())}')\n",
" print()\n",
" single_cells.head()\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "cell-generate-data",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T20:08:27.031060Z",
"iopub.status.busy": "2026-05-26T20:08:27.030897Z",
"iopub.status.idle": "2026-05-26T20:08:27.065745Z",
"shell.execute_reply": "2026-05-26T20:08:27.065459Z"
},
"nbsphinx": "hidden",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Single-cell dataset: 1,200 rows (cells) x 15 columns\n",
"Plates: ['Plate_1', 'Plate_2']\n",
"Wells: ['B02', 'B03', 'B04', 'C02', 'C03', 'C04']\n",
"\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Metadata_Plate | \n",
" Metadata_Well | \n",
" Metadata_ImageNumber | \n",
" Metadata_ObjectNumber | \n",
" Cells_AreaShape_Area | \n",
" Cells_AreaShape_BoundingBoxArea | \n",
" Cells_AreaShape_EulerNumber | \n",
" Cells_AreaShape_Eccentricity | \n",
" Cells_Intensity_MeanIntensity_Mito | \n",
" Cells_Texture_Correlation_RNA_3_0_256 | \n",
" Cytoplasm_AreaShape_Area | \n",
" Cytoplasm_Intensity_MeanIntensity_AGP | \n",
" Nuclei_AreaShape_Area | \n",
" Nuclei_AreaShape_Eccentricity | \n",
" Nuclei_Intensity_MeanIntensity_DNA | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" Plate_1 | \n",
" B02 | \n",
" 1 | \n",
" 1 | \n",
" 559.605698 | \n",
" 721.825925 | \n",
" 1 | \n",
" 0.499523 | \n",
" 0.279437 | \n",
" 0.171863 | \n",
" 297.097143 | \n",
" 0.278284 | \n",
" 298.740225 | \n",
" 0.417458 | \n",
" 0.520604 | \n",
"
\n",
" \n",
" | 1 | \n",
" Plate_1 | \n",
" B02 | \n",
" 1 | \n",
" 2 | \n",
" 483.408284 | \n",
" 628.132985 | \n",
" 1 | \n",
" 0.319747 | \n",
" 0.298409 | \n",
" 0.223614 | \n",
" 507.059369 | \n",
" 0.236535 | \n",
" 211.585104 | \n",
" 0.396529 | \n",
" 0.406506 | \n",
"
\n",
" \n",
" | 2 | \n",
" Plate_1 | \n",
" B02 | \n",
" 1 | \n",
" 3 | \n",
" 577.722625 | \n",
" 755.610703 | \n",
" 1 | \n",
" 0.640232 | \n",
" 0.347462 | \n",
" 0.165437 | \n",
" 422.223545 | \n",
" 0.151870 | \n",
" 227.277140 | \n",
" 0.619046 | \n",
" 0.420757 | \n",
"
\n",
" \n",
" | 3 | \n",
" Plate_1 | \n",
" B02 | \n",
" 1 | \n",
" 4 | \n",
" 682.763583 | \n",
" 885.327467 | \n",
" 1 | \n",
" 0.561958 | \n",
" 0.269791 | \n",
" 0.126960 | \n",
" 315.485038 | \n",
" 0.175639 | \n",
" 221.047584 | \n",
" 0.308058 | \n",
" 0.623995 | \n",
"
\n",
" \n",
" | 4 | \n",
" Plate_1 | \n",
" B02 | \n",
" 1 | \n",
" 5 | \n",
" 471.901595 | \n",
" 610.339060 | \n",
" 1 | \n",
" 0.511353 | \n",
" 0.348811 | \n",
" 0.146148 | \n",
" 328.196795 | \n",
" 0.341500 | \n",
" 106.588422 | \n",
" 0.418463 | \n",
" 0.520791 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Metadata_Plate Metadata_Well Metadata_ImageNumber Metadata_ObjectNumber \\\n",
"0 Plate_1 B02 1 1 \n",
"1 Plate_1 B02 1 2 \n",
"2 Plate_1 B02 1 3 \n",
"3 Plate_1 B02 1 4 \n",
"4 Plate_1 B02 1 5 \n",
"\n",
" Cells_AreaShape_Area Cells_AreaShape_BoundingBoxArea \\\n",
"0 559.605698 721.825925 \n",
"1 483.408284 628.132985 \n",
"2 577.722625 755.610703 \n",
"3 682.763583 885.327467 \n",
"4 471.901595 610.339060 \n",
"\n",
" Cells_AreaShape_EulerNumber Cells_AreaShape_Eccentricity \\\n",
"0 1 0.499523 \n",
"1 1 0.319747 \n",
"2 1 0.640232 \n",
"3 1 0.561958 \n",
"4 1 0.511353 \n",
"\n",
" Cells_Intensity_MeanIntensity_Mito Cells_Texture_Correlation_RNA_3_0_256 \\\n",
"0 0.279437 0.171863 \n",
"1 0.298409 0.223614 \n",
"2 0.347462 0.165437 \n",
"3 0.269791 0.126960 \n",
"4 0.348811 0.146148 \n",
"\n",
" Cytoplasm_AreaShape_Area Cytoplasm_Intensity_MeanIntensity_AGP \\\n",
"0 297.097143 0.278284 \n",
"1 507.059369 0.236535 \n",
"2 422.223545 0.151870 \n",
"3 315.485038 0.175639 \n",
"4 328.196795 0.341500 \n",
"\n",
" Nuclei_AreaShape_Area Nuclei_AreaShape_Eccentricity \\\n",
"0 298.740225 0.417458 \n",
"1 211.585104 0.396529 \n",
"2 227.277140 0.619046 \n",
"3 221.047584 0.308058 \n",
"4 106.588422 0.418463 \n",
"\n",
" Nuclei_Intensity_MeanIntensity_DNA \n",
"0 0.520604 \n",
"1 0.406506 \n",
"2 0.420757 \n",
"3 0.623995 \n",
"4 0.520791 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def simulate_single_cells(plate_id, n_cells_per_well=100):\n",
" \"\"\"\n",
" Generate synthetic single-cell morphology measurements for one plate.\n",
"\n",
" Column naming follows the CellProfiler convention:\n",
" Metadata_* — experimental context (plate, well, object identity)\n",
" Cells_* — measurements of the whole-cell boundary\n",
" Cytoplasm_* — measurements of the cytoplasmic region\n",
" Nuclei_* — measurements of the nuclear region\n",
"\n",
" To keep this tutorial focused on the pipeline rather than biology,\n",
" only the cell-area features respond to treatment. All other features\n",
" are independent noise sampled from realistic distributions.\n",
" In a real experiment every feature may carry some biological signal.\n",
" \"\"\"\n",
" well_treatments = {\n",
" \"B02\": \"DMSO\",\n",
" \"C02\": \"DMSO\",\n",
" \"B03\": \"Compound_A\",\n",
" \"C03\": \"Compound_A\",\n",
" \"B04\": \"Compound_B\",\n",
" \"C04\": \"Compound_B\",\n",
" }\n",
"\n",
" rows = []\n",
" for image_number, (well, treatment) in enumerate(well_treatments.items(), start=1):\n",
" is_a = float(treatment == \"Compound_A\")\n",
" is_b = float(treatment == \"Compound_B\")\n",
"\n",
" # Only the Area family of features responds to treatment.\n",
" # This ensures only the intentionally correlated pair is removed in Step 4.\n",
" cell_area_base = np.random.normal(\n",
" 500 + 180 * is_a - 90 * is_b, 120, n_cells_per_well\n",
" )\n",
"\n",
" for obj_num in range(1, n_cells_per_well + 1):\n",
" cell_area = cell_area_base[obj_num - 1]\n",
" rows.append({\n",
" # ── Metadata columns ──────────────────────────────────────────\n",
" \"Metadata_Plate\": plate_id,\n",
" \"Metadata_Well\": well,\n",
" \"Metadata_ImageNumber\": image_number,\n",
" \"Metadata_ObjectNumber\": obj_num,\n",
" # ── Cell-level features ───────────────────────────────────────\n",
" # Area and BoundingBoxArea are intentionally correlated (r ≈ 0.99);\n",
" # one of the pair will be removed during feature selection.\n",
" \"Cells_AreaShape_Area\": cell_area,\n",
" \"Cells_AreaShape_BoundingBoxArea\": cell_area * 1.3\n",
" + np.random.normal(0, 4),\n",
" # EulerNumber = 1 for virtually all cells (topological invariant);\n",
" # zero variance → removed during feature selection.\n",
" \"Cells_AreaShape_EulerNumber\": 1,\n",
" # All remaining features: independent noise with realistic distributions\n",
" \"Cells_AreaShape_Eccentricity\": float(\n",
" np.clip(np.random.normal(0.55, 0.12), 0, 1)\n",
" ),\n",
" \"Cells_Intensity_MeanIntensity_Mito\": np.random.normal(0.30, 0.06),\n",
" \"Cells_Texture_Correlation_RNA_3_0_256\": np.random.normal(0.22, 0.06),\n",
" # ── Cytoplasm features ────────────────────────────────────────\n",
" \"Cytoplasm_AreaShape_Area\": np.random.normal(310, 80),\n",
" \"Cytoplasm_Intensity_MeanIntensity_AGP\": np.random.normal(0.25, 0.07),\n",
" # ── Nuclei features ───────────────────────────────────────────\n",
" \"Nuclei_AreaShape_Area\": np.random.normal(195, 55),\n",
" \"Nuclei_AreaShape_Eccentricity\": float(\n",
" np.clip(np.random.normal(0.40, 0.10), 0, 1)\n",
" ),\n",
" \"Nuclei_Intensity_MeanIntensity_DNA\": np.random.normal(0.50, 0.08),\n",
" })\n",
" return pd.DataFrame(rows)\n",
"\n",
"\n",
"# Generate data for two plates to simulate biological replicates\n",
"plate1 = simulate_single_cells(\"Plate_1\")\n",
"plate2 = simulate_single_cells(\"Plate_2\")\n",
"single_cells = pd.concat([plate1, plate2], ignore_index=True)\n",
"\n",
"print(\n",
" f\"Single-cell dataset: {single_cells.shape[0]:,} rows (cells) x {single_cells.shape[1]} columns\"\n",
")\n",
"print(f\"Plates: {single_cells['Metadata_Plate'].unique().tolist()}\")\n",
"print(f\"Wells: {sorted(single_cells['Metadata_Well'].unique().tolist())}\")\n",
"print()\n",
"single_cells.head()"
]
},
{
"cell_type": "markdown",
"id": "cell-aggregate-intro",
"metadata": {},
"source": [
"---\n",
"\n",
"## Step 1: Aggregate — From Cells to Wells\n",
"\n",
"The single-cell table contains one row for every detected cell — in a real experiment this can\n",
"easily reach hundreds of thousands of rows. However, biological interpretation happens at\n",
"the level of the *well* (which treatment was applied), not the individual cell.\n",
"\n",
"`aggregate()` summarises all cells within the same well into a **single representative profile**\n",
"by computing the median of each feature across all cells in that well.\n",
"\n",
"| Parameter | Description |\n",
"|-----------|-------------|\n",
"| `population_df` | The single-cell DataFrame |\n",
"| `strata` | Columns that identify each well — cells sharing the same strata values are pooled together |\n",
"| `features='infer'` | Automatically detect feature columns (any column whose name starts with a compartment prefix such as `Cells_`, `Cytoplasm_`, or `Nuclei_`) |\n",
"| `operation` | Summary statistic: `'median'` (default) or `'mean'` |"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "cell-aggregate",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T20:08:27.067683Z",
"iopub.status.busy": "2026-05-26T20:08:27.067540Z",
"iopub.status.idle": "2026-05-26T20:08:27.078438Z",
"shell.execute_reply": "2026-05-26T20:08:27.078153Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Single cells: 1,200 rows → Well profiles: 12 rows\n",
"Columns: 15 → 13\n",
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Metadata_Plate | \n",
" Metadata_Well | \n",
" Cells_AreaShape_Area | \n",
" Cells_AreaShape_BoundingBoxArea | \n",
" Cells_AreaShape_EulerNumber | \n",
" Cells_AreaShape_Eccentricity | \n",
" Cells_Intensity_MeanIntensity_Mito | \n",
" Cells_Texture_Correlation_RNA_3_0_256 | \n",
" Cytoplasm_AreaShape_Area | \n",
" Cytoplasm_Intensity_MeanIntensity_AGP | \n",
" Nuclei_AreaShape_Area | \n",
" Nuclei_AreaShape_Eccentricity | \n",
" Nuclei_Intensity_MeanIntensity_DNA | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" Plate_1 | \n",
" B02 | \n",
" 484.765245 | \n",
" 634.635906 | \n",
" 1.0 | \n",
" 0.532796 | \n",
" 0.300860 | \n",
" 0.216580 | \n",
" 315.443258 | \n",
" 0.263187 | \n",
" 193.043658 | \n",
" 0.418260 | \n",
" 0.517625 | \n",
"
\n",
" \n",
" | 1 | \n",
" Plate_1 | \n",
" B03 | \n",
" 669.260838 | \n",
" 870.994323 | \n",
" 1.0 | \n",
" 0.536427 | \n",
" 0.304651 | \n",
" 0.210215 | \n",
" 314.055342 | \n",
" 0.256479 | \n",
" 190.671898 | \n",
" 0.405648 | \n",
" 0.498834 | \n",
"
\n",
" \n",
" | 2 | \n",
" Plate_1 | \n",
" B04 | \n",
" 399.970485 | \n",
" 523.433308 | \n",
" 1.0 | \n",
" 0.533963 | \n",
" 0.306359 | \n",
" 0.222698 | \n",
" 304.712780 | \n",
" 0.252233 | \n",
" 196.243981 | \n",
" 0.394866 | \n",
" 0.495179 | \n",
"
\n",
" \n",
" | 3 | \n",
" Plate_1 | \n",
" C02 | \n",
" 523.730623 | \n",
" 685.309911 | \n",
" 1.0 | \n",
" 0.552298 | \n",
" 0.307376 | \n",
" 0.214065 | \n",
" 328.921311 | \n",
" 0.247123 | \n",
" 197.812467 | \n",
" 0.418402 | \n",
" 0.489309 | \n",
"
\n",
" \n",
" | 4 | \n",
" Plate_1 | \n",
" C03 | \n",
" 671.001479 | \n",
" 874.309123 | \n",
" 1.0 | \n",
" 0.557177 | \n",
" 0.297985 | \n",
" 0.212211 | \n",
" 327.148733 | \n",
" 0.235335 | \n",
" 195.742017 | \n",
" 0.406123 | \n",
" 0.504755 | \n",
"
\n",
" \n",
" | 5 | \n",
" Plate_1 | \n",
" C04 | \n",
" 411.397559 | \n",
" 535.575562 | \n",
" 1.0 | \n",
" 0.537410 | \n",
" 0.296999 | \n",
" 0.211021 | \n",
" 307.751566 | \n",
" 0.237890 | \n",
" 187.333217 | \n",
" 0.406370 | \n",
" 0.501055 | \n",
"
\n",
" \n",
" | 6 | \n",
" Plate_2 | \n",
" B02 | \n",
" 510.859835 | \n",
" 663.499468 | \n",
" 1.0 | \n",
" 0.560717 | \n",
" 0.279879 | \n",
" 0.210651 | \n",
" 297.559100 | \n",
" 0.245555 | \n",
" 186.979153 | \n",
" 0.395230 | \n",
" 0.509640 | \n",
"
\n",
" \n",
" | 7 | \n",
" Plate_2 | \n",
" B03 | \n",
" 671.687004 | \n",
" 869.272541 | \n",
" 1.0 | \n",
" 0.539517 | \n",
" 0.310271 | \n",
" 0.223312 | \n",
" 323.749429 | \n",
" 0.256434 | \n",
" 193.420235 | \n",
" 0.392844 | \n",
" 0.503184 | \n",
"
\n",
" \n",
" | 8 | \n",
" Plate_2 | \n",
" B04 | \n",
" 420.512564 | \n",
" 542.359150 | \n",
" 1.0 | \n",
" 0.556893 | \n",
" 0.309685 | \n",
" 0.215100 | \n",
" 297.933806 | \n",
" 0.239380 | \n",
" 185.984035 | \n",
" 0.389872 | \n",
" 0.498350 | \n",
"
\n",
" \n",
" | 9 | \n",
" Plate_2 | \n",
" C02 | \n",
" 515.312157 | \n",
" 667.242484 | \n",
" 1.0 | \n",
" 0.571428 | \n",
" 0.292990 | \n",
" 0.219456 | \n",
" 304.836628 | \n",
" 0.249090 | \n",
" 195.434170 | \n",
" 0.387314 | \n",
" 0.501398 | \n",
"
\n",
" \n",
" | 10 | \n",
" Plate_2 | \n",
" C03 | \n",
" 677.517456 | \n",
" 882.379424 | \n",
" 1.0 | \n",
" 0.548610 | \n",
" 0.296525 | \n",
" 0.225144 | \n",
" 295.410839 | \n",
" 0.250994 | \n",
" 197.704177 | \n",
" 0.397884 | \n",
" 0.514447 | \n",
"
\n",
" \n",
" | 11 | \n",
" Plate_2 | \n",
" C04 | \n",
" 407.760144 | \n",
" 528.536944 | \n",
" 1.0 | \n",
" 0.514893 | \n",
" 0.294377 | \n",
" 0.209433 | \n",
" 333.205131 | \n",
" 0.251384 | \n",
" 201.713063 | \n",
" 0.417950 | \n",
" 0.493228 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Metadata_Plate Metadata_Well Cells_AreaShape_Area \\\n",
"0 Plate_1 B02 484.765245 \n",
"1 Plate_1 B03 669.260838 \n",
"2 Plate_1 B04 399.970485 \n",
"3 Plate_1 C02 523.730623 \n",
"4 Plate_1 C03 671.001479 \n",
"5 Plate_1 C04 411.397559 \n",
"6 Plate_2 B02 510.859835 \n",
"7 Plate_2 B03 671.687004 \n",
"8 Plate_2 B04 420.512564 \n",
"9 Plate_2 C02 515.312157 \n",
"10 Plate_2 C03 677.517456 \n",
"11 Plate_2 C04 407.760144 \n",
"\n",
" Cells_AreaShape_BoundingBoxArea Cells_AreaShape_EulerNumber \\\n",
"0 634.635906 1.0 \n",
"1 870.994323 1.0 \n",
"2 523.433308 1.0 \n",
"3 685.309911 1.0 \n",
"4 874.309123 1.0 \n",
"5 535.575562 1.0 \n",
"6 663.499468 1.0 \n",
"7 869.272541 1.0 \n",
"8 542.359150 1.0 \n",
"9 667.242484 1.0 \n",
"10 882.379424 1.0 \n",
"11 528.536944 1.0 \n",
"\n",
" Cells_AreaShape_Eccentricity Cells_Intensity_MeanIntensity_Mito \\\n",
"0 0.532796 0.300860 \n",
"1 0.536427 0.304651 \n",
"2 0.533963 0.306359 \n",
"3 0.552298 0.307376 \n",
"4 0.557177 0.297985 \n",
"5 0.537410 0.296999 \n",
"6 0.560717 0.279879 \n",
"7 0.539517 0.310271 \n",
"8 0.556893 0.309685 \n",
"9 0.571428 0.292990 \n",
"10 0.548610 0.296525 \n",
"11 0.514893 0.294377 \n",
"\n",
" Cells_Texture_Correlation_RNA_3_0_256 Cytoplasm_AreaShape_Area \\\n",
"0 0.216580 315.443258 \n",
"1 0.210215 314.055342 \n",
"2 0.222698 304.712780 \n",
"3 0.214065 328.921311 \n",
"4 0.212211 327.148733 \n",
"5 0.211021 307.751566 \n",
"6 0.210651 297.559100 \n",
"7 0.223312 323.749429 \n",
"8 0.215100 297.933806 \n",
"9 0.219456 304.836628 \n",
"10 0.225144 295.410839 \n",
"11 0.209433 333.205131 \n",
"\n",
" Cytoplasm_Intensity_MeanIntensity_AGP Nuclei_AreaShape_Area \\\n",
"0 0.263187 193.043658 \n",
"1 0.256479 190.671898 \n",
"2 0.252233 196.243981 \n",
"3 0.247123 197.812467 \n",
"4 0.235335 195.742017 \n",
"5 0.237890 187.333217 \n",
"6 0.245555 186.979153 \n",
"7 0.256434 193.420235 \n",
"8 0.239380 185.984035 \n",
"9 0.249090 195.434170 \n",
"10 0.250994 197.704177 \n",
"11 0.251384 201.713063 \n",
"\n",
" Nuclei_AreaShape_Eccentricity Nuclei_Intensity_MeanIntensity_DNA \n",
"0 0.418260 0.517625 \n",
"1 0.405648 0.498834 \n",
"2 0.394866 0.495179 \n",
"3 0.418402 0.489309 \n",
"4 0.406123 0.504755 \n",
"5 0.406370 0.501055 \n",
"6 0.395230 0.509640 \n",
"7 0.392844 0.503184 \n",
"8 0.389872 0.498350 \n",
"9 0.387314 0.501398 \n",
"10 0.397884 0.514447 \n",
"11 0.417950 0.493228 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"well_profiles = aggregate(\n",
" population_df=single_cells,\n",
" strata=[\"Metadata_Plate\", \"Metadata_Well\"],\n",
" features=\"infer\",\n",
" operation=\"median\",\n",
")\n",
"\n",
"print(\n",
" f\"Single cells: {single_cells.shape[0]:,} rows → Well profiles: {well_profiles.shape[0]} rows\"\n",
")\n",
"print(\n",
" f\"Columns: {single_cells.shape[1]} → {well_profiles.shape[1]}\"\n",
")\n",
"print()\n",
"well_profiles"
]
},
{
"cell_type": "markdown",
"id": "cell-annotate-intro",
"metadata": {},
"source": [
"---\n",
"\n",
"## Step 2: Annotate — Adding Experimental Context\n",
"\n",
"After aggregation, each row represents a well, but the DataFrame only records *where* the\n",
"measurement came from (plate and well position) — not *what* biological condition was in\n",
"that well.\n",
"\n",
"The connection between well positions and experimental conditions is stored in a\n",
"**plate map** — a lookup table prepared by the researcher that records which compound,\n",
"genetic perturbation, concentration, or other variable was assigned to each well.\n",
"\n",
"`annotate()` merges the plate map onto the well profiles, adding a `Metadata_` column\n",
"for each piece of experimental information.\n",
"\n",
"> In real experiments, plate maps are usually supplied as CSV files from a Laboratory\n",
"> Information Management System (LIMS) or prepared manually. Here we create one directly\n",
"> as a DataFrame to show its structure.\n",
"\n",
"First, let us define the plate map:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "cell-platemap",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T20:08:27.080288Z",
"iopub.status.busy": "2026-05-26T20:08:27.079979Z",
"iopub.status.idle": "2026-05-26T20:08:27.084991Z",
"shell.execute_reply": "2026-05-26T20:08:27.084683Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Plate map:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" well_position | \n",
" treatment | \n",
" cell_line | \n",
" concentration_um | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" B02 | \n",
" DMSO | \n",
" HeLa | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 1 | \n",
" C02 | \n",
" DMSO | \n",
" HeLa | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 2 | \n",
" B03 | \n",
" Compound_A | \n",
" HeLa | \n",
" 10.0 | \n",
"
\n",
" \n",
" | 3 | \n",
" C03 | \n",
" Compound_A | \n",
" HeLa | \n",
" 10.0 | \n",
"
\n",
" \n",
" | 4 | \n",
" B04 | \n",
" Compound_B | \n",
" HeLa | \n",
" 10.0 | \n",
"
\n",
" \n",
" | 5 | \n",
" C04 | \n",
" Compound_B | \n",
" HeLa | \n",
" 10.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" well_position treatment cell_line concentration_um\n",
"0 B02 DMSO HeLa 0.0\n",
"1 C02 DMSO HeLa 0.0\n",
"2 B03 Compound_A HeLa 10.0\n",
"3 C03 Compound_A HeLa 10.0\n",
"4 B04 Compound_B HeLa 10.0\n",
"5 C04 Compound_B HeLa 10.0"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# The plate map records the biological condition in each well position.\n",
"# The same layout was used for both plates in this experiment.\n",
"platemap = pd.DataFrame({\n",
" # 'well_position' is the standard column name expected by annotate()\n",
" \"well_position\": [\"B02\", \"C02\", \"B03\", \"C03\", \"B04\", \"C04\"],\n",
" \"treatment\": [\n",
" \"DMSO\",\n",
" \"DMSO\",\n",
" \"Compound_A\",\n",
" \"Compound_A\",\n",
" \"Compound_B\",\n",
" \"Compound_B\",\n",
" ],\n",
" \"cell_line\": [\"HeLa\"] * 6,\n",
" \"concentration_um\": [0.0, 0.0, 10.0, 10.0, 10.0, 10.0],\n",
"})\n",
"\n",
"print(\"Plate map:\")\n",
"platemap"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "cell-annotate",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T20:08:27.086330Z",
"iopub.status.busy": "2026-05-26T20:08:27.086227Z",
"iopub.status.idle": "2026-05-26T20:08:27.096375Z",
"shell.execute_reply": "2026-05-26T20:08:27.096081Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Annotated profiles: 12 rows x 16 columns\n",
"\n",
"Well-to-treatment mapping after annotation:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Metadata_Plate | \n",
" Metadata_Well | \n",
" Metadata_treatment | \n",
" Metadata_cell_line | \n",
" Metadata_concentration_um | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" Plate_1 | \n",
" B02 | \n",
" DMSO | \n",
" HeLa | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 1 | \n",
" Plate_2 | \n",
" B02 | \n",
" DMSO | \n",
" HeLa | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 4 | \n",
" Plate_1 | \n",
" B03 | \n",
" Compound_A | \n",
" HeLa | \n",
" 10.0 | \n",
"
\n",
" \n",
" | 5 | \n",
" Plate_2 | \n",
" B03 | \n",
" Compound_A | \n",
" HeLa | \n",
" 10.0 | \n",
"
\n",
" \n",
" | 8 | \n",
" Plate_1 | \n",
" B04 | \n",
" Compound_B | \n",
" HeLa | \n",
" 10.0 | \n",
"
\n",
" \n",
" | 9 | \n",
" Plate_2 | \n",
" B04 | \n",
" Compound_B | \n",
" HeLa | \n",
" 10.0 | \n",
"
\n",
" \n",
" | 2 | \n",
" Plate_1 | \n",
" C02 | \n",
" DMSO | \n",
" HeLa | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 3 | \n",
" Plate_2 | \n",
" C02 | \n",
" DMSO | \n",
" HeLa | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 6 | \n",
" Plate_1 | \n",
" C03 | \n",
" Compound_A | \n",
" HeLa | \n",
" 10.0 | \n",
"
\n",
" \n",
" | 7 | \n",
" Plate_2 | \n",
" C03 | \n",
" Compound_A | \n",
" HeLa | \n",
" 10.0 | \n",
"
\n",
" \n",
" | 10 | \n",
" Plate_1 | \n",
" C04 | \n",
" Compound_B | \n",
" HeLa | \n",
" 10.0 | \n",
"
\n",
" \n",
" | 11 | \n",
" Plate_2 | \n",
" C04 | \n",
" Compound_B | \n",
" HeLa | \n",
" 10.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Metadata_Plate Metadata_Well Metadata_treatment Metadata_cell_line \\\n",
"0 Plate_1 B02 DMSO HeLa \n",
"1 Plate_2 B02 DMSO HeLa \n",
"4 Plate_1 B03 Compound_A HeLa \n",
"5 Plate_2 B03 Compound_A HeLa \n",
"8 Plate_1 B04 Compound_B HeLa \n",
"9 Plate_2 B04 Compound_B HeLa \n",
"2 Plate_1 C02 DMSO HeLa \n",
"3 Plate_2 C02 DMSO HeLa \n",
"6 Plate_1 C03 Compound_A HeLa \n",
"7 Plate_2 C03 Compound_A HeLa \n",
"10 Plate_1 C04 Compound_B HeLa \n",
"11 Plate_2 C04 Compound_B HeLa \n",
"\n",
" Metadata_concentration_um \n",
"0 0.0 \n",
"1 0.0 \n",
"4 10.0 \n",
"5 10.0 \n",
"8 10.0 \n",
"9 10.0 \n",
"2 0.0 \n",
"3 0.0 \n",
"6 10.0 \n",
"7 10.0 \n",
"10 10.0 \n",
"11 10.0 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# annotate() joins the plate map onto the well profiles.\n",
"#\n",
"# join_on specifies [platemap_column, profiles_column] used for matching wells.\n",
"# add_metadata_id_to_platemap=True prepends 'Metadata_' to all plate map column names,\n",
"# following the pycytominer convention that all non-feature columns start with 'Metadata_'.\n",
"annotated_profiles = annotate(\n",
" profiles=well_profiles,\n",
" platemap=platemap,\n",
" join_on=[\"Metadata_well_position\", \"Metadata_Well\"],\n",
" add_metadata_id_to_platemap=True,\n",
")\n",
"\n",
"print(\n",
" f\"Annotated profiles: {annotated_profiles.shape[0]} rows x {annotated_profiles.shape[1]} columns\"\n",
")\n",
"print()\n",
"\n",
"# Show the metadata columns that were added\n",
"meta_cols = [\n",
" \"Metadata_Plate\",\n",
" \"Metadata_Well\",\n",
" \"Metadata_treatment\",\n",
" \"Metadata_cell_line\",\n",
" \"Metadata_concentration_um\",\n",
"]\n",
"print(\"Well-to-treatment mapping after annotation:\")\n",
"annotated_profiles[meta_cols].drop_duplicates().sort_values(\"Metadata_Well\")"
]
},
{
"cell_type": "markdown",
"id": "cell-normalize-intro",
"metadata": {},
"source": [
"---\n",
"\n",
"## Step 3: Normalize — Removing Technical Variation\n",
"\n",
"CellProfiler features differ widely in scale and units. For example:\n",
"- `Cells_AreaShape_Area` might range from 200 to 1,000 (pixels²)\n",
"- `Nuclei_Intensity_MeanIntensity_DNA` might range from 0.1 to 0.9 (arbitrary fluorescence units)\n",
"\n",
"Without normalization, features with large absolute values would dominate any downstream\n",
"distance calculation or machine-learning model, regardless of whether they carry biological signal.\n",
"Normalization also corrects for **plate-to-plate technical variation** caused by differences\n",
"in staining efficiency, imaging conditions, or cell density between experimental batches.\n",
"\n",
"`normalize()` rescales each feature using the **distribution of control wells** as a reference.\n",
"The default method (`'standardize'`) subtracts the control mean and divides by the control\n",
"standard deviation — a standard z-score transformation. After normalization, control wells\n",
"cluster around zero, and treated wells are expressed in units of\n",
"*standard deviations away from the control*.\n",
"\n",
"> **What is a vehicle control?**\n",
"> DMSO (dimethyl sulfoxide) is the standard solvent used to dissolve most small-molecule\n",
"> compounds. Adding DMSO at the same concentration as the compound solvent, but without any\n",
"> active compound, defines the biological baseline — what cells look like when nothing\n",
"> meaningful has been done to them.\n",
"\n",
"| Parameter | Description |\n",
"|-----------|-------------|\n",
"| `samples` | A pandas query string selecting the control wells used to compute normalization statistics |\n",
"| `method` | `'standardize'` (z-score), `'robustize'` (median-based), or `'mad_robustize'` |"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "cell-normalize",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T20:08:27.097819Z",
"iopub.status.busy": "2026-05-26T20:08:27.097709Z",
"iopub.status.idle": "2026-05-26T20:08:27.105943Z",
"shell.execute_reply": "2026-05-26T20:08:27.105680Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Normalized profiles: (12, 16)\n",
"\n",
"Mean feature values in DMSO wells (should be ≈ 0.0 after normalization):\n",
"Cells_AreaShape_Area 0.0\n",
"Cells_AreaShape_BoundingBoxArea -0.0\n",
"Cells_AreaShape_EulerNumber 0.0\n",
"Cells_AreaShape_Eccentricity 0.0\n",
"Cells_Intensity_MeanIntensity_Mito 0.0\n",
"Cells_Texture_Correlation_RNA_3_0_256 0.0\n",
"Cytoplasm_AreaShape_Area -0.0\n",
"Cytoplasm_Intensity_MeanIntensity_AGP 0.0\n",
"Nuclei_AreaShape_Area 0.0\n",
"Nuclei_AreaShape_Eccentricity -0.0\n",
"Nuclei_Intensity_MeanIntensity_DNA -0.0\n"
]
}
],
"source": [
"normalized_profiles = normalize(\n",
" profiles=annotated_profiles,\n",
" features=\"infer\",\n",
" meta_features=\"infer\",\n",
" samples=\"Metadata_treatment == 'DMSO'\", # use DMSO wells as the normalization reference\n",
" method=\"standardize\",\n",
")\n",
"\n",
"print(f\"Normalized profiles: {normalized_profiles.shape}\")\n",
"\n",
"# Sanity check: DMSO wells should be centred near 0 after normalization\n",
"feature_cols = [c for c in normalized_profiles.columns if not c.startswith(\"Metadata_\")]\n",
"dmso_rows = normalized_profiles[\"Metadata_treatment\"] == \"DMSO\"\n",
"\n",
"print()\n",
"print(\"Mean feature values in DMSO wells (should be ≈ 0.0 after normalization):\")\n",
"print(normalized_profiles.loc[dmso_rows, feature_cols].mean().round(3).to_string())"
]
},
{
"cell_type": "markdown",
"id": "cell-featsel-intro",
"metadata": {},
"source": [
"---\n",
"\n",
"## Step 4: Feature Selection — Keeping Only Informative Features\n",
"\n",
"A typical high-content microscopy dataset contains 500–1,500 features per well. Many of these features\n",
"carry little or no biological information and can actively harm downstream analyses by\n",
"adding noise:\n",
"\n",
"- **Constant or near-constant features** have the same value in every well and therefore\n",
" cannot distinguish one treatment from another.\n",
"- **Highly correlated features** (e.g., `Cells_AreaShape_Area` and\n",
" `Cells_AreaShape_BoundingBoxArea`) measure essentially the same property through slightly\n",
" different calculations. Retaining both adds redundancy without adding biological information.\n",
"- **Blocklisted features** are features empirically identified as technically unreliable across\n",
" many published CellProfiler pipelines.\n",
"\n",
"`feature_select()` applies these removal criteria in sequence:\n",
"\n",
"| Operation | What it removes |\n",
"|-----------|----------------|\n",
"| `'variance_threshold'` | Features with variance below a minimum threshold (effectively constant) |\n",
"| `'correlation_threshold'` | One feature from every highly correlated pair (Pearson *r* > 0.9) |\n",
"| `'blocklist'` | Features on the community-curated Pycytominer blocklist |\n",
"\n",
"Recall from the data generation step that we deliberately included two uninformative features:\n",
"\n",
"- `Cells_AreaShape_EulerNumber` — constant (= 1) for all cells → removed by `variance_threshold`\n",
"- `Cells_AreaShape_Area` / `Cells_AreaShape_BoundingBoxArea` — nearly perfectly correlated\n",
" (r ≈ 0.99) → one of the pair is removed by `correlation_threshold`\n",
"\n",
"> **Which of the correlated pair is kept?**\n",
"> The algorithm retains the feature with the lower average correlation to all other features."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "cell-featsel",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T20:08:27.108121Z",
"iopub.status.busy": "2026-05-26T20:08:27.107949Z",
"iopub.status.idle": "2026-05-26T20:08:27.118868Z",
"shell.execute_reply": "2026-05-26T20:08:27.118522Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Features before selection: 11\n",
"Features after selection: 9\n",
"Features removed: 2\n",
"\n",
"Retained features:\n",
" Cells_AreaShape_BoundingBoxArea\n",
" Cells_AreaShape_Eccentricity\n",
" Cells_Intensity_MeanIntensity_Mito\n",
" Cells_Texture_Correlation_RNA_3_0_256\n",
" Cytoplasm_AreaShape_Area\n",
" Cytoplasm_Intensity_MeanIntensity_AGP\n",
" Nuclei_AreaShape_Area\n",
" Nuclei_AreaShape_Eccentricity\n",
" Nuclei_Intensity_MeanIntensity_DNA\n"
]
}
],
"source": [
"selected_profiles = feature_select(\n",
" profiles=normalized_profiles,\n",
" features=\"infer\",\n",
" operation=[\"variance_threshold\", \"correlation_threshold\", \"blocklist\"],\n",
")\n",
"\n",
"feature_cols_before = [\n",
" c for c in normalized_profiles.columns if not c.startswith(\"Metadata_\")\n",
"]\n",
"feature_cols_after = [\n",
" c for c in selected_profiles.columns if not c.startswith(\"Metadata_\")\n",
"]\n",
"\n",
"print(f\"Features before selection: {len(feature_cols_before)}\")\n",
"print(f\"Features after selection: {len(feature_cols_after)}\")\n",
"print(\n",
" f\"Features removed: {len(feature_cols_before) - len(feature_cols_after)}\"\n",
")\n",
"print()\n",
"print(\"Retained features:\")\n",
"for col in sorted(feature_cols_after):\n",
" print(f\" {col}\")"
]
},
{
"cell_type": "markdown",
"id": "cell-consensus-intro",
"metadata": {},
"source": [
"---\n",
"\n",
"## Step 5: Consensus — Collapsing Replicates\n",
"\n",
"At this point we have one profile per well. Because our experiment was run across **two plates**\n",
"(biological replicates), we have four profiles for each treatment condition — two wells per plate\n",
"times two plates. Some downstream analyses expect a single, definitive profile per condition.\n",
"\n",
"`consensus()` collapses replicate profiles into one **consensus profile** per treatment group\n",
"by computing the median across all replicates.\n",
"\n",
"Using the consensus profile instead of individual replicates:\n",
"\n",
"- Reduces the influence of plate-specific technical artefacts that survived normalization\n",
"- Produces a lower-variance, higher-confidence representation of the treatment effect\n",
"- Simplifies downstream analysis by reducing the number of rows to cluster or classify\n",
"\n",
"The `replicate_columns` parameter specifies which metadata columns **define a unique condition**.\n",
"Profiles that share the same values in these columns are treated as replicates and collapsed\n",
"into a single consensus row."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "cell-consensus",
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-26T20:08:27.121466Z",
"iopub.status.busy": "2026-05-26T20:08:27.121301Z",
"iopub.status.idle": "2026-05-26T20:08:27.128233Z",
"shell.execute_reply": "2026-05-26T20:08:27.127943Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Profiles before consensus: 12 rows (one per well)\n",
"Profiles after consensus: 3 rows (one per treatment)\n",
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Metadata_treatment | \n",
" Metadata_cell_line | \n",
" Metadata_concentration_um | \n",
" Cells_AreaShape_BoundingBoxArea | \n",
" Cells_AreaShape_Eccentricity | \n",
" Cells_Intensity_MeanIntensity_Mito | \n",
" Cells_Texture_Correlation_RNA_3_0_256 | \n",
" Cytoplasm_AreaShape_Area | \n",
" Cytoplasm_Intensity_MeanIntensity_AGP | \n",
" Nuclei_AreaShape_Area | \n",
" Nuclei_AreaShape_Eccentricity | \n",
" Nuclei_Intensity_MeanIntensity_DNA | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" Compound_A | \n",
" HeLa | \n",
" 10.0 | \n",
" 11.558693 | \n",
" -0.724044 | \n",
" 0.589681 | \n",
" 0.794162 | \n",
" 0.610831 | \n",
" 0.353012 | \n",
" 0.313659 | \n",
" -0.219728 | \n",
" -0.049983 | \n",
"
\n",
" \n",
" | 1 | \n",
" Compound_B | \n",
" HeLa | \n",
" 10.0 | \n",
" -7.189962 | \n",
" -1.316052 | \n",
" 0.624929 | \n",
" -0.656629 | \n",
" -0.462245 | \n",
" -0.835377 | \n",
" -0.379430 | \n",
" -0.302806 | \n",
" -0.737681 | \n",
"
\n",
" \n",
" | 2 | \n",
" DMSO | \n",
" HeLa | \n",
" 0.0 | \n",
" 0.148573 | \n",
" 0.155330 | \n",
" 0.160923 | \n",
" 0.041498 | \n",
" -0.131285 | \n",
" -0.446743 | \n",
" 0.228724 | \n",
" 0.140684 | \n",
" 0.097928 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Metadata_treatment Metadata_cell_line Metadata_concentration_um \\\n",
"0 Compound_A HeLa 10.0 \n",
"1 Compound_B HeLa 10.0 \n",
"2 DMSO HeLa 0.0 \n",
"\n",
" Cells_AreaShape_BoundingBoxArea Cells_AreaShape_Eccentricity \\\n",
"0 11.558693 -0.724044 \n",
"1 -7.189962 -1.316052 \n",
"2 0.148573 0.155330 \n",
"\n",
" Cells_Intensity_MeanIntensity_Mito Cells_Texture_Correlation_RNA_3_0_256 \\\n",
"0 0.589681 0.794162 \n",
"1 0.624929 -0.656629 \n",
"2 0.160923 0.041498 \n",
"\n",
" Cytoplasm_AreaShape_Area Cytoplasm_Intensity_MeanIntensity_AGP \\\n",
"0 0.610831 0.353012 \n",
"1 -0.462245 -0.835377 \n",
"2 -0.131285 -0.446743 \n",
"\n",
" Nuclei_AreaShape_Area Nuclei_AreaShape_Eccentricity \\\n",
"0 0.313659 -0.219728 \n",
"1 -0.379430 -0.302806 \n",
"2 0.228724 0.140684 \n",
"\n",
" Nuclei_Intensity_MeanIntensity_DNA \n",
"0 -0.049983 \n",
"1 -0.737681 \n",
"2 0.097928 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"consensus_profiles = consensus(\n",
" profiles=selected_profiles,\n",
" replicate_columns=[\n",
" \"Metadata_treatment\",\n",
" \"Metadata_cell_line\",\n",
" \"Metadata_concentration_um\",\n",
" ],\n",
" operation=\"median\",\n",
" features=\"infer\",\n",
")\n",
"\n",
"print(f\"Profiles before consensus: {selected_profiles.shape[0]} rows (one per well)\")\n",
"print(\n",
" f\"Profiles after consensus: {consensus_profiles.shape[0]} rows (one per treatment)\"\n",
")\n",
"print()\n",
"consensus_profiles"
]
},
{
"cell_type": "markdown",
"id": "cell-summary",
"metadata": {},
"source": [
"---\n",
"\n",
"## Summary\n",
"\n",
"You have now processed a Cell Painting dataset through the complete Pycytominer pipeline.\n",
"Here is a recap of what each step accomplished:\n",
"\n",
"| Step | Function | Rows | Features |\n",
"|------|----------|------|----------|\n",
"| Raw single-cell data | — | 1,200 cells | 11 |\n",
"| After `aggregate()` | pool cells per well | 12 wells | 11 |\n",
"| After `annotate()` | add treatment labels | 12 wells | 11 |\n",
"| After `normalize()` | z-score vs DMSO | 12 wells | 11 |\n",
"| After `feature_select()` | remove uninformative features | 12 wells | 9 |\n",
"| After `consensus()` | collapse replicates | **3 conditions** | **9** |\n",
"\n",
"The final `consensus_profiles` DataFrame contains one row per biological treatment condition\n",
"and nine informative morphological features — a compact, analysis-ready representation of\n",
"how each treatment changed the appearance of cells.\n",
"\n",
"### Saving Your Profiles\n",
"\n",
"Pycytominer provides `cyto_utils.output()` as its canonical function for writing profiles to disk — the same function each pipeline step calls internally when you pass an `output_file` argument. It handles compression, format selection, and file naming in one call, and supports four output types:\n",
"\n",
"| `output_type` | Extension | Best for |\n",
"|---|---|---|\n",
"| `\"csv\"` (default) | `.csv.gz` | Gzip-compressed, readable by any tool |\n",
"| `\"parquet\"` | `.parquet` | Faster reads and smaller files for large screens |\n",
"| `\"anndata_h5ad\"` | `.h5ad` | [AnnData](https://anndata.readthedocs.io/) / [scanpy](https://scanpy.readthedocs.io/) workflows |\n",
"| `\"anndata_zarr\"` | `.zarr` | Cloud-native AnnData storage |\n",
"\n",
"```python\n",
"from pycytominer.cyto_utils import output\n",
"\n",
"# Gzip-compressed CSV (default) — small footprint, readable by any tool\n",
"output(\n",
" df=consensus_profiles,\n",
" output_filename=\"consensus_profiles.csv.gz\",\n",
" output_type=\"csv\",\n",
")\n",
"\n",
"# Parquet — fast reads and efficient storage for large screens\n",
"output(\n",
" df=consensus_profiles,\n",
" output_filename=\"consensus_profiles.parquet\",\n",
" output_type=\"parquet\",\n",
")\n",
"\n",
"# AnnData HDF5 — ready for scanpy, scverse, and single-cell workflows\n",
"output(\n",
" df=consensus_profiles,\n",
" output_filename=\"consensus_profiles.h5ad\",\n",
" output_type=\"anndata_h5ad\",\n",
")\n",
"```\n",
"\n",
"> **Pro tip:** Every pipeline function accepts an `output_file` argument that writes directly to disk and returns the file path instead of a DataFrame. This avoids storing intermediate results in memory for large datasets:\n",
">\n",
"> ```python\n",
"> consensus_profiles = consensus(\n",
"> profiles=selected_profiles,\n",
"> replicate_columns=[\"Metadata_treatment\", \"Metadata_cell_line\", \"Metadata_concentration_um\"],\n",
"> operation=\"median\",\n",
"> features=\"infer\",\n",
"> output_file=\"consensus_profiles.parquet\",\n",
"> output_type=\"parquet\",\n",
"> )\n",
"> # consensus_profiles is now the file path string, not a DataFrame\n",
"> ```\n",
"\n",
"### What to Do Next\n",
"\n",
"With morphology profiles in hand, common next steps include:\n",
"\n",
"- **Phenotypic clustering** — group treatments by morphological similarity using\n",
" hierarchical clustering or UMAP\n",
"- **Similarity analysis** — identify compounds that produce the same cellular phenotype\n",
" using correlation or cosine similarity metrics\n",
"- **Classification** — train machine-learning models to predict a compound's mechanism\n",
" of action from its morphological profile\n",
"- **Dimensionality reduction** — visualise the morphological space of an entire compound\n",
" library in two dimensions using PCA or UMAP\n",
"- **Hit calling** — identify which compounds produce a statistically significant\n",
" morphological change relative to controls.\n",
" [copairs](https://github.com/cytomining/copairs) computes mean Average Precision (mAP)\n",
" to score phenotypic activity and consistency at the well/profile level;\n",
" [Buscar](https://github.com/WayScience/Buscar) operates directly on single-cell\n",
" distributions to capture cellular heterogeneity and flag off-target effects\n",
"\n",
"### Further Reading\n",
"\n",
"- [Pycytominer API Reference](https://pycytominer.readthedocs.io/en/latest/modules.html) —\n",
" full documentation for every function used in this tutorial\n",
"- [CytoTable tutorial](https://cytomining.github.io/CytoTable/tutorials/cellprofiler_to_parquet.html) —\n",
" how to convert raw CellProfiler output into the Parquet format that Pycytominer reads\n",
"- [Cell Painting Gallery](https://cellpaintinggallery.org/) —\n",
" a public repository of Cell Painting datasets ready for analysis\n",
"\n",
"---\n",
"\n",
"### Pycytominer in the Wild\n",
"\n",
"Pycytominer is used across some of the largest and most impactful image-based profiling initiatives in the world.\n",
"Here are a few to spark your curiosity:\n",
"\n",
"---\n",
"\n",
"**🧬 [JUMP-CP](https://jump-cellpainting.broadinstitute.org/) — Joint Undertaking for Morphological Profiling**\n",
"\n",
"The largest public Cell Painting dataset ever produced, generated by a consortium of 13 pharmaceutical companies and academic institutions (including AstraZeneca, Bayer, Pfizer, Merck KGaA, and the Broad Institute). JUMP-CP profiled over 116,000 compounds and ~15,000 genetic perturbations, with all profiles processed using Pycytominer. The resulting resource is used to predict compound activity, identify drug mechanisms, and match small molecules to disease phenotypes — at industrial scale.\n",
"\n",
"---\n",
"\n",
"**🔬 [LINCS Cell Painting](https://github.com/broadinstitute/lincs-cell-painting) — Library of Integrated Network-based Cellular Signatures**\n",
"\n",
"An NIH-funded initiative that profiled 1,571 bioactive compounds across six doses and five replicates in A549 lung cancer cells. Pycytominer was adopted as the **primary profiling tool** for this dataset, producing normalized and feature-selected profiles (Levels 3–5) that are publicly available for download. LINCS demonstrated that image-based profiles could serve as a systematic, reproducible reference map of cellular responses to chemical perturbation.\n",
"\n",
"---\n",
"\n",
"**🌍 [EU-OPENSCREEN](https://www.eu-openscreen.eu/) — European Chemical Biology Research Infrastructure**\n",
"\n",
"A distributed pan-European research infrastructure spanning 30 partner sites across eight countries. EU-OPENSCREEN has integrated Cell Painting into its screening platform, enabling European academic and industry researchers to access high-content imaging and morphological profiling as a service. Their contributions to the JUMP-CP consortium extended the reach of image-based profiling into the broader European drug discovery community.\n",
"\n",
"---\n",
"\n",
"**🖼️ [Cell Painting Gallery](https://registry.opendata.aws/cellpainting-gallery/) — Broad Institute Open Dataset Collection**\n",
"\n",
"A growing public repository of Cell Painting datasets, hosted on AWS as open data and maintained by the Carpenter–Singh and Cimini labs at the Broad Institute. The gallery spans tens of thousands of small-molecule treatments across diverse cell lines and experimental designs — all freely accessible and ready for analysis. It is the canonical reference point for new Cell Painting datasets produced by the community.\n",
"\n",
"---\n",
"\n",
"> These resources process their raw CellProfiler outputs through the same Pycytominer pipeline\n",
"> you just ran — the only difference is scale.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}