{ "cells": [ { "cell_type": "markdown", "id": "cell-title", "metadata": {}, "source": [ "# Single-cell image-based profiling\n", "\n", "## A complete single-cell processing pipeline with Pycytominer\n", "\n", "High-content microscopy experiments can produce thousands of single-cell\n", "measurements per image. Working at single-cell resolution (rather than first\n", "aggregating cells into well-level profiles) preserves the full diversity of\n", "cellular responses: rare subpopulations, bimodal distributions, and heterogeneous\n", "drug effects that vanish in the average.\n", "\n", "Single-cell profiling introduces a challenge that well-level profiling sidesteps:\n", "**not every detected object is a real, well-segmented cell.** Debris, out-of-focus\n", "objects, and fused cells contaminate the feature matrix and distort downstream\n", "analyses. A quality-control step is therefore essential before dimensionality\n", "reduction, clustering, or hit calling.\n", "\n", "This tutorial walks through a complete single-cell processing pipeline starting\n", "from [CytoTable](https://cytomining.github.io/CytoTable/) output.\n", "[coSMicQC](https://cytomining.github.io/coSMicQC/) is used here for QC:\n", "\n", "1. **Load**: read the joined single-cell Parquet file produced by CytoTable\n", "2. **Annotate**: attach experimental metadata and QC flags from coSMicQC\n", "3. **Normalize**: drop QC outliers and z-score features against DMSO controls\n", "4. **Feature select**: drop redundant and uninformative features\n", "\n", "The result is a clean, normalized single-cell feature matrix ready for\n", "dimensionality reduction, clustering, or further aggregation.\n", "\n", "> **New to pycytominer?** Read the\n", "> [Introduction to Pycytominer](introduction_to_pycytominer.ipynb) tutorial first.\n", "> This tutorial assumes familiarity with the core pipeline steps." ] }, { "cell_type": "raw", "id": "cell-pipeline-diagram", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. mermaid::\n", " :align: center\n", "\n", " flowchart TD\n", " cytotable[\"CytoTable output
single_cells.parquet, 1200 cells\"]\n", " qcfile[\"coSMicQC output
qc.parquet, QC annotations\"]\n", " ann[\"annotate()
Add platemap + QC flags\"]\n", " nor[\"normalize()
Drop QC outliers · Z-score vs DMSO\"]\n", " fea[\"feature_select()
Remove redundant features\"]\n", " output[\"Single-cell profiles
~1174 cells, 10 features\"]\n", "\n", " cytotable --> ann\n", " qcfile --> ann\n", " ann --> nor --> fea --> output\n", "\n", " style cytotable fill:#f0d9fa,stroke:#88239A,color:#111\n", " style qcfile fill:#f0d9fa,stroke:#88239A,color:#111\n", " style output fill:#f0d9fa,stroke:#88239A,color:#111\n", " style ann fill:#ffffff,stroke:#88239A,color:#111\n", " style nor fill:#ffffff,stroke:#88239A,color:#111\n", " style fea fill:#ffffff,stroke:#88239A,color:#111" ] }, { "cell_type": "markdown", "id": "cell-prereqs", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "Install the required packages:\n", "\n", "```bash\n", "pip install pycytominer coSMicQC pyarrow pandas numpy\n", "```\n", "\n", "This tutorial uses **simulated data** that matches the exact schema produced by [CytoTable](https://cytomining.github.io/CytoTable/) and [coSMicQC](https://cytomining.github.io/coSMicQC/). In a real experiment, replace the simulation block with your own `single_cells.parquet` and `qc.parquet` files." ] }, { "cell_type": "code", "execution_count": 1, "id": "cell-imports", "metadata": { "execution": { "iopub.execute_input": "2026-06-01T19:34:58.190044Z", "iopub.status.busy": "2026-06-01T19:34:58.189829Z", "iopub.status.idle": "2026-06-01T19:34:59.724211Z", "shell.execute_reply": "2026-06-01T19:34:59.723901Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Working directory: /var/folders/02/q30k_4wn2dqbz5pj_vvc8xn40000gp/T/tmp57clvnip\n" ] } ], "source": [ "import tempfile\n", "from pathlib import Path\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from pycytominer import annotate, feature_select, normalize\n", "\n", "# Reproducible random state used throughout the simulation\n", "rng = np.random.default_rng(42)\n", "\n", "# Temporary directory — stands in for the output directory on your filesystem\n", "tmp_dir = Path(tempfile.mkdtemp())\n", "print(f\"Working directory: {tmp_dir}\")" ] }, { "cell_type": "markdown", "id": "cell-cytotable-intro", "metadata": {}, "source": [ "## Input: CytoTable Single-Cell Data\n", "\n", "[CytoTable](https://cytomining.github.io/CytoTable/) converts CellProfiler SQLite or CSV output into a single analysis-ready Parquet file. Each row represents one segmented object (a cell), and columns fall into three groups:\n", "\n", "| Group | Example columns | Purpose |\n", "|---|---|---|\n", "| `Metadata_*` | `Metadata_Plate`, `Metadata_Well`, `Metadata_ImageNumber`, `Metadata_ObjectNumber` | Describe the experiment |\n", "| `cytotable_meta_*` | `cytotable_meta_source_path`, `cytotable_meta_offset` | CytoTable provenance. Pycytominer ignores these automatically |\n", "| Feature columns | `Cells_AreaShape_Area`, `Nuclei_Intensity_MeanIntensity_DNA` | Morphology measurements per single-cell |\n", "\n", "`Metadata_ImageNumber` and `Metadata_ObjectNumber` together uniquely identify every cell and serve as the **join key** between the single-cell data and the coSMicQC annotations.\n", "\n", "> **Note on `cytotable_meta_*` columns:** These provenance columns track source-file offsets for CytoTable's internal bookkeeping. Pycytominer's feature inference uses CellProfiler compartment prefixes (`Cells_`, `Cytoplasm_`, `Nuclei_`) and ignores them automatically. They pass through `annotate()` unchanged and are dropped at the `normalize()` step.\n", "\n", "The simulation code is available in the expandable block below. Skip it to go straight to the next step." ] }, { "cell_type": "raw", "id": "cell-simulate-toggle", "metadata": { "raw_mimetype": "text/restructuredtext", "vscode": { "languageId": "raw" } }, "source": [ ".. toggle::\n", "\n", " In a real experiment these files come from running\n", " `CytoTable `__ and\n", " `coSMicQC `__ on your CellProfiler\n", " output. The functions below reproduce their output schemas using synthetic data.\n", "\n", " **Step A — simulate CytoTable single-cell data**\n", "\n", " .. code-block:: python\n", "\n", " WELLS = {\n", " \"B02\": \"DMSO\", \"C02\": \"DMSO\",\n", " \"B03\": \"Compound_A\", \"C03\": \"Compound_A\",\n", " \"B04\": \"Compound_B\", \"C04\": \"Compound_B\",\n", " }\n", " N_CELLS_PER_WELL = 100\n", "\n", " def simulate_cytotable(plate_id: str) -> pd.DataFrame:\n", " \"\"\"Generate a synthetic CytoTable-style single-cell DataFrame.\"\"\"\n", " rows = []\n", " for img_num, (well, treatment) in enumerate(WELLS.items(), start=1):\n", " is_a = float(treatment == \"Compound_A\")\n", " is_b = float(treatment == \"Compound_B\")\n", " cell_areas = rng.normal(500 + 180 * is_a - 90 * is_b, 120, N_CELLS_PER_WELL)\n", " nuclei_areas = rng.normal(195, 55, N_CELLS_PER_WELL)\n", " for obj_num in range(1, N_CELLS_PER_WELL + 1):\n", " rows.append( {\n", " # ── CytoTable metadata ──────────────────────────────────\n", " \"Metadata_Plate\": plate_id,\n", " \"Metadata_Well\": well,\n", " \"Metadata_ImageNumber\": img_num,\n", " \"Metadata_ObjectNumber\": obj_num,\n", " # CytoTable provenance columns\n", " \"cytotable_meta_source_path\": f\"/data/{plate_id}/images/\",\n", " \"cytotable_meta_offset\": (img_num - 1) * N_CELLS_PER_WELL + obj_num,\n", " \"cytotable_meta_rownum\": obj_num,\n", " # ── Feature columns ─────────────────────────────────────\n", " \"Cells_AreaShape_Area\": cell_areas[obj_num - 1],\n", " \"Cells_AreaShape_BoundingBoxArea\": cell_areas[obj_num - 1] * 1.3\n", " + rng.normal(0, 4),\n", " \"Cells_AreaShape_EulerNumber\": 1,\n", " \"Cells_AreaShape_Eccentricity\": float(\n", " np.clip(rng.normal(0.55, 0.12), 0, 1)\n", " ),\n", " \"Cells_Intensity_MeanIntensity_Mito\": rng.normal(0.30, 0.06),\n", " \"Cells_Texture_Correlation_RNA_3_0_256\": rng.normal(0.22, 0.06),\n", " \"Cytoplasm_AreaShape_Area\": rng.normal(310, 80),\n", " \"Cytoplasm_Intensity_MeanIntensity_AGP\": rng.normal(0.25, 0.07),\n", " \"Nuclei_AreaShape_Area\": nuclei_areas[obj_num - 1],\n", " \"Nuclei_AreaShape_Eccentricity\": float(\n", " np.clip(rng.normal(0.40, 0.10), 0, 1)\n", " ),\n", " \"Nuclei_Intensity_MeanIntensity_DNA\": rng.normal(0.50, 0.08),\n", " \"Nuclei_Intensity_MassDisplacement_DNA\": abs(rng.normal(6, 4)),\n", " })\n", " return pd.DataFrame(rows)\n", "\n", " **Step B — simulate coSMicQC QC annotations**\n", "\n", " ``label_outliers(..., export_as_annotations=True)`` writes a compact Parquet\n", " with only join-key columns and boolean ``Metadata_cqc_*`` flags.\n", "\n", " .. code-block:: python\n", "\n", " def simulate_qc_parquet(sc_df: pd.DataFrame) -> pd.DataFrame:\n", " \"\"\"Reproduce the annotation schema produced by coSMicQC label_outliers().\"\"\"\n", "\n", " join_keys = [\n", " \"Metadata_Plate\",\n", " \"Metadata_Well\",\n", " \"Metadata_ImageNumber\",\n", " \"Metadata_ObjectNumber\",\n", " ]\n", "\n", " qc = sc_df[join_keys].copy()\n", "\n", " nuc_area = sc_df[\"Nuclei_AreaShape_Area\"]\n", " nuc_z = (nuc_area - nuc_area.mean()) / nuc_area.std()\n", "\n", " mass_disp = sc_df[\"Nuclei_Intensity_MassDisplacement_DNA\"]\n", " mass_disp_z = (mass_disp - mass_disp.mean()) / mass_disp.std()\n", "\n", " qc[\"Metadata_cqc_large_nuclear_size_is_outlier\"] = nuc_z > 2.5\n", " qc[\"Metadata_cqc_small_nuclear_size_is_outlier\"] = nuc_z < -2.5\n", " qc[\"Metadata_cqc_poor_segmentation_is_outlier\"] = mass_disp_z > 2.5\n", "\n", " return qc\n", "\n", " **Step C — build two plates and write to disk**\n", "\n", " .. code-block:: python\n", "\n", " plate1 = simulate_cytotable(\"Plate_1\")\n", " plate2 = simulate_cytotable(\"Plate_2\")\n", " single_cells_raw = pd.concat([plate1, plate2], ignore_index=True)\n", "\n", " qc_annotations_raw = simulate_qc_parquet(single_cells_raw)\n", "\n", " sc_path = tmp_dir / \"single_cells.parquet\"\n", " qc_path = tmp_dir / \"qc.parquet\"\n", " single_cells_raw.to_parquet(sc_path, index=False)\n", " qc_annotations_raw.to_parquet(qc_path, index=False)\n", "\n", " print(f\"single_cells.parquet {single_cells_raw.shape[0]:,} rows x {single_cells_raw.shape[1]} cols\")\n", " print(f\"qc.parquet {qc_annotations_raw.shape[0]:,} rows x {qc_annotations_raw.shape[1]} cols\")\n", " print(f\"\\nqc.parquet columns: {list(qc_annotations_raw.columns)}\")\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "cell-simulate", "metadata": { "execution": { "iopub.execute_input": "2026-06-01T19:34:59.726742Z", "iopub.status.busy": "2026-06-01T19:34:59.726457Z", "iopub.status.idle": "2026-06-01T19:34:59.852969Z", "shell.execute_reply": "2026-06-01T19:34:59.852647Z" }, "nbsphinx": "hidden" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "single_cells.parquet 1,200 rows x 19 cols\n", "qc.parquet 1,200 rows x 7 cols\n", "\n", "qc.parquet columns: ['Metadata_Plate', 'Metadata_Well', 'Metadata_ImageNumber', 'Metadata_ObjectNumber', 'Metadata_cqc_large_nuclear_size_is_outlier', 'Metadata_cqc_small_nuclear_size_is_outlier', 'Metadata_cqc_poor_segmentation_is_outlier']\n" ] } ], "source": [ "# ── Simulate CytoTable single-cell output ─────────────────────────────────\n", "#\n", "# In a real experiment this file is produced by:\n", "# import cytotable\n", "# cytotable.convert(source_path=\"...\", dest_path=\"single_cells.parquet\", ...)\n", "#\n", "# Here we generate synthetic data with the same column schema.\n", "\n", "WELLS = {\n", " \"B02\": \"DMSO\",\n", " \"C02\": \"DMSO\",\n", " \"B03\": \"Compound_A\",\n", " \"C03\": \"Compound_A\",\n", " \"B04\": \"Compound_B\",\n", " \"C04\": \"Compound_B\",\n", "}\n", "N_CELLS_PER_WELL = 100\n", "\n", "\n", "def simulate_cytotable(plate_id: str) -> pd.DataFrame:\n", " \"\"\"Generate a synthetic CytoTable-style single-cell DataFrame.\"\"\"\n", " rows = []\n", " for img_num, (well, treatment) in enumerate(WELLS.items(), start=1):\n", " is_a = float(treatment == \"Compound_A\")\n", " is_b = float(treatment == \"Compound_B\")\n", " cell_areas = rng.normal(500 + 180 * is_a - 90 * is_b, 120, N_CELLS_PER_WELL)\n", " nuclei_areas = rng.normal(195, 55, N_CELLS_PER_WELL)\n", " for obj_num in range(1, N_CELLS_PER_WELL + 1):\n", " rows.append({\n", " # ── CytoTable metadata ──────────────────────────────────\n", " \"Metadata_Plate\": plate_id,\n", " \"Metadata_Well\": well,\n", " \"Metadata_ImageNumber\": img_num,\n", " \"Metadata_ObjectNumber\": obj_num,\n", " # CytoTable provenance columns\n", " \"cytotable_meta_source_path\": f\"/data/{plate_id}/images/\",\n", " \"cytotable_meta_offset\": (img_num - 1) * N_CELLS_PER_WELL + obj_num,\n", " \"cytotable_meta_rownum\": obj_num,\n", " # ── Feature columns ─────────────────────────────────────\n", " \"Cells_AreaShape_Area\": cell_areas[obj_num - 1],\n", " \"Cells_AreaShape_BoundingBoxArea\": cell_areas[obj_num - 1] * 1.3\n", " + rng.normal(0, 4),\n", " \"Cells_AreaShape_EulerNumber\": 1,\n", " \"Cells_AreaShape_Eccentricity\": float(\n", " np.clip(rng.normal(0.55, 0.12), 0, 1)\n", " ),\n", " \"Cells_Intensity_MeanIntensity_Mito\": rng.normal(0.30, 0.06),\n", " \"Cells_Texture_Correlation_RNA_3_0_256\": rng.normal(0.22, 0.06),\n", " \"Cytoplasm_AreaShape_Area\": rng.normal(310, 80),\n", " \"Cytoplasm_Intensity_MeanIntensity_AGP\": rng.normal(0.25, 0.07),\n", " \"Nuclei_AreaShape_Area\": nuclei_areas[obj_num - 1],\n", " \"Nuclei_AreaShape_Eccentricity\": float(\n", " np.clip(rng.normal(0.40, 0.10), 0, 1)\n", " ),\n", " \"Nuclei_Intensity_MeanIntensity_DNA\": rng.normal(0.50, 0.08),\n", " \"Nuclei_Intensity_MassDisplacement_DNA\": abs(rng.normal(6, 4)),\n", " })\n", " return pd.DataFrame(rows)\n", "\n", "\n", "# ── Simulate coSMicQC annotation output (qc.parquet) ──────────────────────\n", "#\n", "# coSMicQC label_outliers(...) flags outliers using signed z-score thresholds\n", "# and writes a compact annotation file with Metadata_cqc_* boolean columns.\n", "# Here we reproduce that schema directly.\n", "\n", "\n", "def simulate_qc_parquet(sc_df: pd.DataFrame) -> pd.DataFrame:\n", " \"\"\"Reproduce the annotation schema produced by coSMicQC label_outliers().\"\"\"\n", "\n", " join_keys = [\n", " \"Metadata_Plate\",\n", " \"Metadata_Well\",\n", " \"Metadata_ImageNumber\",\n", " \"Metadata_ObjectNumber\",\n", " ]\n", "\n", " qc = sc_df[join_keys].copy()\n", "\n", " nuc_area = sc_df[\"Nuclei_AreaShape_Area\"]\n", " nuc_z = (nuc_area - nuc_area.mean()) / nuc_area.std()\n", "\n", " mass_disp = sc_df[\"Nuclei_Intensity_MassDisplacement_DNA\"]\n", " mass_disp_z = (mass_disp - mass_disp.mean()) / mass_disp.std()\n", "\n", " qc[\"Metadata_cqc_large_nuclear_size_is_outlier\"] = nuc_z > 2.5\n", " qc[\"Metadata_cqc_small_nuclear_size_is_outlier\"] = nuc_z < -2.5\n", " qc[\"Metadata_cqc_poor_segmentation_is_outlier\"] = mass_disp_z > 2.5\n", "\n", " return qc\n", "\n", "\n", "# Build two plates, concatenate, then write both files to disk\n", "plate1 = simulate_cytotable(\"Plate_1\")\n", "plate2 = simulate_cytotable(\"Plate_2\")\n", "single_cells_raw = pd.concat([plate1, plate2], ignore_index=True)\n", "\n", "qc_annotations_raw = simulate_qc_parquet(single_cells_raw)\n", "\n", "sc_path = tmp_dir / \"single_cells.parquet\"\n", "qc_path = tmp_dir / \"qc.parquet\"\n", "single_cells_raw.to_parquet(sc_path, index=False)\n", "qc_annotations_raw.to_parquet(qc_path, index=False)\n", "\n", "print(\n", " f\"single_cells.parquet {single_cells_raw.shape[0]:,} rows x {single_cells_raw.shape[1]} cols\"\n", ")\n", "print(\n", " f\"qc.parquet {qc_annotations_raw.shape[0]:,} rows x {qc_annotations_raw.shape[1]} cols\"\n", ")\n", "print(f\"\\nqc.parquet columns: {list(qc_annotations_raw.columns)}\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "cell-load-inspect", "metadata": { "execution": { "iopub.execute_input": "2026-06-01T19:34:59.854384Z", "iopub.status.busy": "2026-06-01T19:34:59.854273Z", "iopub.status.idle": "2026-06-01T19:34:59.988267Z", "shell.execute_reply": "2026-06-01T19:34:59.987891Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loaded 1,200 single cells across 2 plates and 6 unique wells\n", "\n", "Feature columns (12): ['Cells_AreaShape_Area', 'Cells_AreaShape_BoundingBoxArea', 'Cells_AreaShape_EulerNumber', 'Cells_AreaShape_Eccentricity', 'Cells_Intensity_MeanIntensity_Mito', 'Cells_Texture_Correlation_RNA_3_0_256', 'Cytoplasm_AreaShape_Area', 'Cytoplasm_Intensity_MeanIntensity_AGP', 'Nuclei_AreaShape_Area', 'Nuclei_AreaShape_Eccentricity', 'Nuclei_Intensity_MeanIntensity_DNA', 'Nuclei_Intensity_MassDisplacement_DNA']\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Metadata_PlateMetadata_WellMetadata_ImageNumberMetadata_ObjectNumbercytotable_meta_source_pathcytotable_meta_offsetcytotable_meta_rownumCells_AreaShape_AreaCells_AreaShape_BoundingBoxAreaCells_AreaShape_EulerNumberCells_AreaShape_EccentricityCells_Intensity_MeanIntensity_MitoCells_Texture_Correlation_RNA_3_0_256Cytoplasm_AreaShape_AreaCytoplasm_Intensity_MeanIntensity_AGPNuclei_AreaShape_AreaNuclei_AreaShape_EccentricityNuclei_Intensity_MeanIntensity_DNANuclei_Intensity_MassDisplacement_DNA
0Plate_1B0211/data/Plate_1/images/11536.566050698.88616310.7188980.3054350.258636145.9862320.246590174.2010600.3156770.4024952.487391
1Plate_1B0212/data/Plate_1/images/22375.201907486.42598610.6599080.2204160.221838271.2664450.227063266.4575560.5002760.54304911.349592
2Plate_1B0213/data/Plate_1/images/33590.054143766.45236410.4664870.2865680.234550324.1258690.174093175.4054820.4090490.51825816.069896
\n", "
" ], "text/plain": [ " Metadata_Plate Metadata_Well Metadata_ImageNumber Metadata_ObjectNumber \\\n", "0 Plate_1 B02 1 1 \n", "1 Plate_1 B02 1 2 \n", "2 Plate_1 B02 1 3 \n", "\n", " cytotable_meta_source_path cytotable_meta_offset cytotable_meta_rownum \\\n", "0 /data/Plate_1/images/ 1 1 \n", "1 /data/Plate_1/images/ 2 2 \n", "2 /data/Plate_1/images/ 3 3 \n", "\n", " Cells_AreaShape_Area Cells_AreaShape_BoundingBoxArea \\\n", "0 536.566050 698.886163 \n", "1 375.201907 486.425986 \n", "2 590.054143 766.452364 \n", "\n", " Cells_AreaShape_EulerNumber Cells_AreaShape_Eccentricity \\\n", "0 1 0.718898 \n", "1 1 0.659908 \n", "2 1 0.466487 \n", "\n", " Cells_Intensity_MeanIntensity_Mito Cells_Texture_Correlation_RNA_3_0_256 \\\n", "0 0.305435 0.258636 \n", "1 0.220416 0.221838 \n", "2 0.286568 0.234550 \n", "\n", " Cytoplasm_AreaShape_Area Cytoplasm_Intensity_MeanIntensity_AGP \\\n", "0 145.986232 0.246590 \n", "1 271.266445 0.227063 \n", "2 324.125869 0.174093 \n", "\n", " Nuclei_AreaShape_Area Nuclei_AreaShape_Eccentricity \\\n", "0 174.201060 0.315677 \n", "1 266.457556 0.500276 \n", "2 175.405482 0.409049 \n", "\n", " Nuclei_Intensity_MeanIntensity_DNA Nuclei_Intensity_MassDisplacement_DNA \n", "0 0.402495 2.487391 \n", "1 0.543049 11.349592 \n", "2 0.518258 16.069896 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load the CytoTable parquet from disk\n", "single_cells = pd.read_parquet(sc_path)\n", "\n", "print(\n", " f\"Loaded {len(single_cells):,} single cells across \"\n", " f\"{single_cells['Metadata_Plate'].nunique()} plates and \"\n", " f\"{single_cells['Metadata_Well'].nunique()} unique wells\"\n", ")\n", "print(\n", " f\"\\nFeature columns ({len([c for c in single_cells.columns if not c.startswith('Metadata_') and not c.startswith('cytotable_')])}): \"\n", " f\"{[c for c in single_cells.columns if not c.startswith('Metadata_') and not c.startswith('cytotable_')]}\"\n", ")\n", "single_cells.head(3)" ] }, { "cell_type": "markdown", "id": "cell-cosmicqc-intro", "metadata": {}, "source": [ "## Background: Single-cell quality control with coSMicQC [Optional]\n", "\n", "[coSMicQC](https://github.com/WayScience/coSMicQC) ([GitHub](https://github.com/WayScience/coSMicQC) | [docs](https://cytomining.github.io/coSMicQC/) | [preprint](https://www.biorxiv.org/content/10.1101/2025.10.14.682427v1)) is a Python package from the Way Lab that systematically identifies segmentation artifacts, for example:\n", "\n", "| Artifact | Morphological signature | Biological cause |\n", "|---|---|---|\n", "| **Debris / background** | Very small nucleus; low DNA intensity | Out-of-focus plane, dust on coverslip |\n", "| **Over-segmented nucleus** | Nucleus area far above the population mean | One nucleus split into multiple objects |\n", "| **Touching / fused cells** | Very high mass displacement from multiple objects | Adjacent cells merged into a single object |\n", "\n", "### How coSMicQC flags outliers\n", "\n", "coSMicQC computes a **z-score** for each quality-relevant feature across the entire experiment. Cells whose z-scores fall outside user-defined thresholds are flagged as outliers. Thresholds are **signed**:\n", "\n", "- A **negative threshold** (e.g. `−2.5`) flags cells where the feature is *unusually small* (debris, broken nuclei).\n", "- A **positive threshold** (e.g. `+2.5`) flags cells where the feature is *unusually large* (fused or over-segmented objects).\n", "\n", "The main entry point is `label_outliers()`, which accepts a dictionary of **named QC conditions**. Each condition name becomes part of the output column name, making the reason for each flag explicit and auditable:\n", "\n", "```python\n", "import cosmicqc\n", "\n", "labeled = cosmicqc.label_outliers(\n", " df=single_cells,\n", " feature_thresholds={\n", " # Flag nuclei that are too small (debris)\n", " \"small_nuclear_size\": {\n", " \"Nuclei_AreaShape_Area\": -2.5,\n", " },\n", " # Flag nuclei that are too large (over-segmented)\n", " \"large_nuclear_size\": {\n", " \"Nuclei_AreaShape_Area\": 2.5,\n", " },\n", " # Flag cells with an abnormally high nuclear mass displacement\n", " # (a hallmark of touching or merged nuclei in one object)\n", " \"poor_segmentation\": {\n", " \"Nuclei_Intensity_MassDisplacement_DNA\": 2.5,\n", " },\n", " },\n", " include_threshold_scores=True, # also write z-score columns for auditing\n", " export_path=\"qc.parquet\",\n", " export_as_annotations=True, # write compact annotation file only\n", " annotation_metadata_columns=[\n", " \"Metadata_Plate\", \"Metadata_Well\",\n", " \"Metadata_ImageNumber\", \"Metadata_ObjectNumber\",\n", " ],\n", ")\n", "```\n", "\n", "### The `qc.parquet` annotation file\n", "\n", "When `export_as_annotations=True`, coSMicQC writes a **compact annotation file** called `qc.parquet`, which contains only the join-key metadata columns and the `Metadata_cqc_*` flag columns (not the full feature table). This makes `qc.parquet` lightweight and easy to share independently of the raw single-cell data.\n", "\n", "Each `Metadata_cqc__is_outlier` column is a boolean: `True` = flagged, `False` = passes that QC check. A cell must pass **all** conditions to be included in downstream analysis." ] }, { "cell_type": "code", "execution_count": 4, "id": "cell-apply-qc", "metadata": { "execution": { "iopub.execute_input": "2026-06-01T19:34:59.989678Z", "iopub.status.busy": "2026-06-01T19:34:59.989573Z", "iopub.status.idle": "2026-06-01T19:34:59.994761Z", "shell.execute_reply": "2026-06-01T19:34:59.994475Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "coSMicQC annotation columns:\n", " Metadata_Plate\n", " Metadata_Well\n", " Metadata_ImageNumber\n", " Metadata_ObjectNumber\n", " Metadata_cqc_large_nuclear_size_is_outlier\n", " Metadata_cqc_small_nuclear_size_is_outlier\n", " Metadata_cqc_poor_segmentation_is_outlier\n", "\n", " Metadata_cqc_large_nuclear_size_is_outlier: 5 cells flagged (0.4%)\n", " Metadata_cqc_small_nuclear_size_is_outlier: 9 cells flagged (0.8%)\n", " Metadata_cqc_poor_segmentation_is_outlier: 12 cells flagged (1.0%)\n" ] } ], "source": [ "# Load the coSMicQC annotation file and inspect its contents\n", "qc_annotations = pd.read_parquet(qc_path)\n", "\n", "print(\"coSMicQC annotation columns:\")\n", "for col in qc_annotations.columns:\n", " print(f\" {col}\")\n", "\n", "outlier_cols = [c for c in qc_annotations.columns if c.endswith(\"_is_outlier\")]\n", "print()\n", "for col in outlier_cols:\n", " n_flagged = qc_annotations[col].sum()\n", " print(\n", " f\" {col}: {n_flagged:,} cells flagged ({100 * n_flagged / len(qc_annotations):.1f}%)\"\n", " )" ] }, { "cell_type": "markdown", "id": "cell-annotate-intro", "metadata": {}, "source": [ "---\n", "\n", "## Step 1: Annotate\n", "\n", "`annotate()` does two jobs at once via its `external_metadata` parameter:\n", "\n", "1. **Plate-map join** attaches the biological condition (treatment, cell line, concentration) recorded for each well to every cell in that well.\n", "2. **External metadata merge** merges any additional per-cell metadata DataFrame or file. The most common use case is a `qc.parquet` file from coSMicQC: passing it as `external_metadata` adds the `Metadata_cqc_*` flag columns directly to the annotated profiles.\n", "\n", "| Parameter | Description |\n", "|---|---|\n", "| `platemap` | Maps well positions to treatment conditions |\n", "| `join_on` | Column pair `[platemap_col, profiles_col]` for the well-position join |\n", "| `external_metadata` | Path to `qc.parquet` (or any additional metadata DataFrame) |\n", "| `external_join_on` | Column(s) shared by profiles and external metadata (here the four-part cell identity key) |\n", "\n", "After `annotate()` runs, the `Metadata_cqc_*` flag columns are present on every row and flow straight into `normalize()`, which applies the QC filter internally via `drop_cosmicqc_rows=True`." ] }, { "cell_type": "code", "execution_count": 5, "id": "cell-platemap", "metadata": { "execution": { "iopub.execute_input": "2026-06-01T19:34:59.996077Z", "iopub.status.busy": "2026-06-01T19:34:59.995987Z", "iopub.status.idle": "2026-06-01T19:35:00.000086Z", "shell.execute_reply": "2026-06-01T19:34:59.999817Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
well_positiontreatmentcell_lineconcentration_um
0B02DMSOHeLa0.0
1C02DMSOHeLa0.0
2B03Compound_AHeLa10.0
3C03Compound_AHeLa10.0
4B04Compound_BHeLa5.0
5C04Compound_BHeLa5.0
\n", "
" ], "text/plain": [ " well_position treatment cell_line concentration_um\n", "0 B02 DMSO HeLa 0.0\n", "1 C02 DMSO HeLa 0.0\n", "2 B03 Compound_A HeLa 10.0\n", "3 C03 Compound_A HeLa 10.0\n", "4 B04 Compound_B HeLa 5.0\n", "5 C04 Compound_B HeLa 5.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "platemap = pd.DataFrame({\n", " \"well_position\": [\"B02\", \"C02\", \"B03\", \"C03\", \"B04\", \"C04\"],\n", " \"treatment\": [\n", " \"DMSO\",\n", " \"DMSO\",\n", " \"Compound_A\",\n", " \"Compound_A\",\n", " \"Compound_B\",\n", " \"Compound_B\",\n", " ],\n", " \"cell_line\": [\"HeLa\"] * 6,\n", " \"concentration_um\": [0.0, 0.0, 10.0, 10.0, 5.0, 5.0],\n", "})\n", "platemap" ] }, { "cell_type": "code", "execution_count": 6, "id": "cell-annotate", "metadata": { "execution": { "iopub.execute_input": "2026-06-01T19:35:00.001369Z", "iopub.status.busy": "2026-06-01T19:35:00.001281Z", "iopub.status.idle": "2026-06-01T19:35:00.018843Z", "shell.execute_reply": "2026-06-01T19:35:00.018552Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "New columns: ['Metadata_treatment', 'Metadata_cell_line', 'Metadata_concentration_um', 'Metadata_cqc_large_nuclear_size_is_outlier', 'Metadata_cqc_small_nuclear_size_is_outlier', 'Metadata_cqc_poor_segmentation_is_outlier']\n", "QC flag columns: ['Metadata_cqc_large_nuclear_size_is_outlier', 'Metadata_cqc_small_nuclear_size_is_outlier', 'Metadata_cqc_poor_segmentation_is_outlier']\n", "\n", "Cells flagged by any QC condition: 26 (2.2%)\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Metadata_treatmentMetadata_cell_lineMetadata_concentration_umMetadata_PlateMetadata_WellMetadata_ImageNumberMetadata_ObjectNumberMetadata_cqc_large_nuclear_size_is_outlierMetadata_cqc_small_nuclear_size_is_outlierMetadata_cqc_poor_segmentation_is_outlier
0DMSOHeLa0.0Plate_1B0211FalseFalseFalse
200DMSOHeLa0.0Plate_1C0221FalseFalseFalse
400Compound_AHeLa10.0Plate_1B0331FalseFalseFalse
600Compound_AHeLa10.0Plate_1C0341FalseFalseFalse
800Compound_BHeLa5.0Plate_1B0451FalseFalseFalse
\n", "
" ], "text/plain": [ " Metadata_treatment Metadata_cell_line Metadata_concentration_um \\\n", "0 DMSO HeLa 0.0 \n", "200 DMSO HeLa 0.0 \n", "400 Compound_A HeLa 10.0 \n", "600 Compound_A HeLa 10.0 \n", "800 Compound_B HeLa 5.0 \n", "\n", " Metadata_Plate Metadata_Well Metadata_ImageNumber Metadata_ObjectNumber \\\n", "0 Plate_1 B02 1 1 \n", "200 Plate_1 C02 2 1 \n", "400 Plate_1 B03 3 1 \n", "600 Plate_1 C03 4 1 \n", "800 Plate_1 B04 5 1 \n", "\n", " Metadata_cqc_large_nuclear_size_is_outlier \\\n", "0 False \n", "200 False \n", "400 False \n", "600 False \n", "800 False \n", "\n", " Metadata_cqc_small_nuclear_size_is_outlier \\\n", "0 False \n", "200 False \n", "400 False \n", "600 False \n", "800 False \n", "\n", " Metadata_cqc_poor_segmentation_is_outlier \n", "0 False \n", "200 False \n", "400 False \n", "600 False \n", "800 False " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "join_keys = [\n", " \"Metadata_Plate\",\n", " \"Metadata_Well\",\n", " \"Metadata_ImageNumber\",\n", " \"Metadata_ObjectNumber\",\n", "]\n", "\n", "# annotate() merges the plate map AND the QC annotation file in a single call.\n", "# The qc.parquet columns already carry the Metadata_ prefix, so they pass through\n", "# prepare_external_metadata_for_annotate() unchanged.\n", "annotated_cells = annotate(\n", " profiles=single_cells,\n", " platemap=platemap,\n", " join_on=[\"Metadata_well_position\", \"Metadata_Well\"],\n", " add_metadata_id_to_platemap=True,\n", " external_metadata=str(qc_path),\n", " external_join_on=join_keys,\n", ")\n", "\n", "new_cols = [c for c in annotated_cells.columns if c not in single_cells.columns]\n", "qc_cols = [c for c in new_cols if \"cqc\" in c]\n", "print(f\"New columns: {new_cols}\")\n", "print(f\"QC flag columns: {qc_cols}\")\n", "print(\n", " f\"\\nCells flagged by any QC condition: \"\n", " f\"{annotated_cells[qc_cols].any(axis=1).sum():,} \"\n", " f\"({100 * annotated_cells[qc_cols].any(axis=1).mean():.1f}%)\"\n", ")\n", "print()\n", "annotated_cells[\n", " [c for c in annotated_cells.columns if c.startswith(\"Metadata_\")]\n", "].drop_duplicates(subset=[\"Metadata_Well\"]).head()" ] }, { "cell_type": "markdown", "id": "cell-normalize-intro", "metadata": {}, "source": [ "---\n", "\n", "## Step 2: Normalize\n", "\n", "Raw CellProfiler features vary in scale (cell area in pixels², intensities in 0–1) and are influenced by plate-to-plate technical effects. Normalization places all features on a common scale and limits plate-to-plate variation by z-scoring each feature relative to the **DMSO control cells**.\n", "\n", "Passing `drop_cosmicqc_rows=True` tells `normalize()` to drop every row where any `Metadata_cqc_*` flag is `True` before computing the z-scores, so QC filtering and normalization happen in a single call." ] }, { "cell_type": "code", "execution_count": 7, "id": "cell-normalize", "metadata": { "execution": { "iopub.execute_input": "2026-06-01T19:35:00.020223Z", "iopub.status.busy": "2026-06-01T19:35:00.020114Z", "iopub.status.idle": "2026-06-01T19:35:00.034092Z", "shell.execute_reply": "2026-06-01T19:35:00.033754Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total cells 1,200\n", "Removed (QC outliers) 26 (2.2%)\n", "Retained 1,174\n", "\n", "Normalized shape: (1174, 22)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Metadata_treatmentMetadata_cell_lineMetadata_concentration_umMetadata_PlateMetadata_WellMetadata_ImageNumberMetadata_ObjectNumberMetadata_cqc_large_nuclear_size_is_outlierMetadata_cqc_small_nuclear_size_is_outlierMetadata_cqc_poor_segmentation_is_outlier...Cells_AreaShape_EulerNumberCells_AreaShape_EccentricityCells_Intensity_MeanIntensity_MitoCells_Texture_Correlation_RNA_3_0_256Cytoplasm_AreaShape_AreaCytoplasm_Intensity_MeanIntensity_AGPNuclei_AreaShape_AreaNuclei_AreaShape_EccentricityNuclei_Intensity_MeanIntensity_DNANuclei_Intensity_MassDisplacement_DNA
0DMSOHeLa0.0Plate_1B0211FalseFalseFalse...0.01.2927720.1000040.733196-2.107301-0.009458-0.380694-0.907252-1.166556-1.028890
1DMSOHeLa0.0Plate_1B0212FalseFalseFalse...0.00.819530-1.3251560.121467-0.481165-0.2934171.3192311.0251130.6072171.555226
3DMSOHeLa0.0Plate_1B0214FalseFalseFalse...0.0-0.883623-0.280147-1.368761-0.5917940.3614010.7499721.237712-0.672132-0.767620
\n", "

3 rows × 22 columns

\n", "
" ], "text/plain": [ " Metadata_treatment Metadata_cell_line Metadata_concentration_um \\\n", "0 DMSO HeLa 0.0 \n", "1 DMSO HeLa 0.0 \n", "3 DMSO HeLa 0.0 \n", "\n", " Metadata_Plate Metadata_Well Metadata_ImageNumber Metadata_ObjectNumber \\\n", "0 Plate_1 B02 1 1 \n", "1 Plate_1 B02 1 2 \n", "3 Plate_1 B02 1 4 \n", "\n", " Metadata_cqc_large_nuclear_size_is_outlier \\\n", "0 False \n", "1 False \n", "3 False \n", "\n", " Metadata_cqc_small_nuclear_size_is_outlier \\\n", "0 False \n", "1 False \n", "3 False \n", "\n", " Metadata_cqc_poor_segmentation_is_outlier ... \\\n", "0 False ... \n", "1 False ... \n", "3 False ... \n", "\n", " Cells_AreaShape_EulerNumber Cells_AreaShape_Eccentricity \\\n", "0 0.0 1.292772 \n", "1 0.0 0.819530 \n", "3 0.0 -0.883623 \n", "\n", " Cells_Intensity_MeanIntensity_Mito Cells_Texture_Correlation_RNA_3_0_256 \\\n", "0 0.100004 0.733196 \n", "1 -1.325156 0.121467 \n", "3 -0.280147 -1.368761 \n", "\n", " Cytoplasm_AreaShape_Area Cytoplasm_Intensity_MeanIntensity_AGP \\\n", "0 -2.107301 -0.009458 \n", "1 -0.481165 -0.293417 \n", "3 -0.591794 0.361401 \n", "\n", " Nuclei_AreaShape_Area Nuclei_AreaShape_Eccentricity \\\n", "0 -0.380694 -0.907252 \n", "1 1.319231 1.025113 \n", "3 0.749972 1.237712 \n", "\n", " Nuclei_Intensity_MeanIntensity_DNA Nuclei_Intensity_MassDisplacement_DNA \n", "0 -1.166556 -1.028890 \n", "1 0.607217 1.555226 \n", "3 -0.672132 -0.767620 \n", "\n", "[3 rows x 22 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop_cosmicqc_rows=True removes QC-flagged cells before z-scoring.\n", "normalized_cells = normalize(\n", " profiles=annotated_cells,\n", " features=\"infer\",\n", " meta_features=\"infer\",\n", " samples=\"Metadata_treatment == 'DMSO'\",\n", " method=\"standardize\",\n", " drop_cosmicqc_rows=True,\n", ")\n", "\n", "n_removed = len(annotated_cells) - len(normalized_cells)\n", "print(f\"{'Total cells':<22} {len(annotated_cells):>6,}\")\n", "print(\n", " f\"{'Removed (QC outliers)':<22} {n_removed:>6,} ({100 * n_removed / len(annotated_cells):.1f}%)\"\n", ")\n", "print(f\"{'Retained':<22} {len(normalized_cells):>6,}\")\n", "print()\n", "print(f\"Normalized shape: {normalized_cells.shape}\")\n", "normalized_cells.head(3)" ] }, { "cell_type": "markdown", "id": "cell-featsel-intro", "metadata": {}, "source": [ "---\n", "\n", "## Step 3: Feature Selection\n", "\n", "Even after QC and normalization, some features carry little information:\n", "\n", "- **Low-variance features** are nearly constant across all cells and cannot distinguish biological conditions.\n", "- **Highly correlated feature pairs** are redundant; keeping both double-weights that axis of variation in clustering and embeddings.\n", "- **Blocklisted features** are known to capture image artifacts rather than cell biology.\n", "\n", "`feature_select()` applies all three filters, producing a lean feature matrix ready for single-cell analyses such as UMAP or hierarchical clustering." ] }, { "cell_type": "code", "execution_count": 8, "id": "cell-featsel", "metadata": { "execution": { "iopub.execute_input": "2026-06-01T19:35:00.035826Z", "iopub.status.busy": "2026-06-01T19:35:00.035694Z", "iopub.status.idle": "2026-06-01T19:35:00.055266Z", "shell.execute_reply": "2026-06-01T19:35:00.054966Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Features before selection: 12\n", "Features after selection: 10\n", "Features removed: {'Cells_AreaShape_Area', 'Cells_AreaShape_EulerNumber'}\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Metadata_treatmentMetadata_cell_lineMetadata_concentration_umMetadata_PlateMetadata_WellMetadata_ImageNumberMetadata_ObjectNumberMetadata_cqc_large_nuclear_size_is_outlierMetadata_cqc_small_nuclear_size_is_outlierMetadata_cqc_poor_segmentation_is_outlierCells_AreaShape_BoundingBoxAreaCells_AreaShape_EccentricityCells_Intensity_MeanIntensity_MitoCells_Texture_Correlation_RNA_3_0_256Cytoplasm_AreaShape_AreaCytoplasm_Intensity_MeanIntensity_AGPNuclei_AreaShape_AreaNuclei_AreaShape_EccentricityNuclei_Intensity_MeanIntensity_DNANuclei_Intensity_MassDisplacement_DNA
0DMSOHeLa0.0Plate_1B0211FalseFalseFalse0.4238731.2927720.1000040.733196-2.107301-0.009458-0.380694-0.907252-1.166556-1.028890
1DMSOHeLa0.0Plate_1B0212FalseFalseFalse-1.0201140.819530-1.3251560.121467-0.481165-0.2934171.3192311.0251130.6072171.555226
3DMSOHeLa0.0Plate_1B0214FalseFalseFalse1.139881-0.883623-0.280147-1.368761-0.5917940.3614010.7499721.237712-0.672132-0.767620
\n", "
" ], "text/plain": [ " Metadata_treatment Metadata_cell_line Metadata_concentration_um \\\n", "0 DMSO HeLa 0.0 \n", "1 DMSO HeLa 0.0 \n", "3 DMSO HeLa 0.0 \n", "\n", " Metadata_Plate Metadata_Well Metadata_ImageNumber Metadata_ObjectNumber \\\n", "0 Plate_1 B02 1 1 \n", "1 Plate_1 B02 1 2 \n", "3 Plate_1 B02 1 4 \n", "\n", " Metadata_cqc_large_nuclear_size_is_outlier \\\n", "0 False \n", "1 False \n", "3 False \n", "\n", " Metadata_cqc_small_nuclear_size_is_outlier \\\n", "0 False \n", "1 False \n", "3 False \n", "\n", " Metadata_cqc_poor_segmentation_is_outlier Cells_AreaShape_BoundingBoxArea \\\n", "0 False 0.423873 \n", "1 False -1.020114 \n", "3 False 1.139881 \n", "\n", " Cells_AreaShape_Eccentricity Cells_Intensity_MeanIntensity_Mito \\\n", "0 1.292772 0.100004 \n", "1 0.819530 -1.325156 \n", "3 -0.883623 -0.280147 \n", "\n", " Cells_Texture_Correlation_RNA_3_0_256 Cytoplasm_AreaShape_Area \\\n", "0 0.733196 -2.107301 \n", "1 0.121467 -0.481165 \n", "3 -1.368761 -0.591794 \n", "\n", " Cytoplasm_Intensity_MeanIntensity_AGP Nuclei_AreaShape_Area \\\n", "0 -0.009458 -0.380694 \n", "1 -0.293417 1.319231 \n", "3 0.361401 0.749972 \n", "\n", " Nuclei_AreaShape_Eccentricity Nuclei_Intensity_MeanIntensity_DNA \\\n", "0 -0.907252 -1.166556 \n", "1 1.025113 0.607217 \n", "3 1.237712 -0.672132 \n", "\n", " Nuclei_Intensity_MassDisplacement_DNA \n", "0 -1.028890 \n", "1 1.555226 \n", "3 -0.767620 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "selected_cells = feature_select(\n", " profiles=normalized_cells,\n", " features=\"infer\",\n", " operation=[\"variance_threshold\", \"correlation_threshold\", \"blocklist\"],\n", ")\n", "\n", "feature_cols_before = [\n", " c for c in normalized_cells.columns if not c.startswith(\"Metadata_\")\n", "]\n", "feature_cols_after = [\n", " c for c in selected_cells.columns if not c.startswith(\"Metadata_\")\n", "]\n", "\n", "print(f\"Features before selection: {len(feature_cols_before)}\")\n", "print(f\"Features after selection: {len(feature_cols_after)}\")\n", "print(f\"Features removed: {set(feature_cols_before) - set(feature_cols_after)}\")\n", "selected_cells.head(3)" ] }, { "cell_type": "markdown", "id": "cell-summary", "metadata": {}, "source": [ "---\n", "\n", "## Summary\n", "\n", "You have processed a CytoTable single-cell dataset through a complete quality-control and normalization pipeline, preserving single-cell resolution throughout:\n", "\n", "| Step | Function | Input | Output |\n", "|---|---|---|---|\n", "| Load | `pd.read_parquet` | CytoTable Parquet | 1,200 single cells |\n", "| Annotate | `annotate()` | Single cells + platemap + `qc.parquet` | Cells with treatment labels and QC flags |\n", "| Normalize | `normalize(drop_cosmicqc_rows=True)` | Annotated cells | ~1,176 passing cells, Z-scored |\n", "| Feature select | `feature_select()` | 11 features | 9 features |\n", "\n", "The output is a **clean, normalized single-cell feature matrix**, `selected_cells`, where every row is one cell and every column is an informative morphological feature.\n", "\n", "### Next steps\n", "\n", "- **Embed**: run UMAP or t-SNE on `selected_cells` to visualize how treatments separate in morphological space at single-cell resolution.\n", "- **Cluster**: apply k-means or Leiden clustering to discover subpopulations within each treatment condition.\n", "- **Aggregate**: feed `selected_cells` into `aggregate()` if you need well-level profiles (e.g. for the consensus pipeline shown in the [Introduction to Image-based Profiling with Pycytominer](introduction_to_pycytominer.ipynb) tutorial).\n", "- **Hit calling**: identify which compounds produce a statistically significant morphological change relative to controls. [Buscar](https://github.com/WayScience/Buscar) operates directly on single-cell distributions to capture cellular heterogeneity and flag off-target effects." ] } ], "metadata": { "kernelspec": { "display_name": "qc_env", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }