{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cell-title",
   "metadata": {},
   "source": [
    "# Using Pycytominer from the command line interface (CLI)\n",
    "\n",
    "Pycytominer ships with a full-featured **command-line interface (CLI)** so that every pipeline step can be run directly from a terminal, no Python code required. This makes it easy to integrate pycytominer into shell scripts, Snakemake workflows, Nextflow pipelines, or any other automation tool that orchestrates file-based steps.\n",
    "\n",
    "This tutorial covers all five CLI commands:\n",
    "\n",
    "| Command | Equivalent Python function |\n",
    "|---|---|\n",
    "| `pycytominer aggregate` | `aggregate()` |\n",
    "| `pycytominer annotate` | `annotate()` |\n",
    "| `pycytominer normalize` | `normalize()` |\n",
    "| `pycytominer feature_select` | `feature_select()` |\n",
    "| `pycytominer consensus` | `consensus()` |\n",
    "\n",
    "> **New to pycytominer?** Read the [Introduction to Pycytominer](../tutorials/introduction_to_pycytominer.ipynb) tutorial first to understand the pipeline concepts before running them from the command line."
   ]
  },
  {
   "cell_type": "raw",
   "id": "cell-pipeline-diagram",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. mermaid::\n",
    "   :align: center\n",
    "\n",
    "   flowchart LR\n",
    "       sc[\"single_cells.parquet\"]\n",
    "       wp[\"well_profiles.parquet\"]\n",
    "       an[\"annotated.parquet\"]\n",
    "       no[\"normalized.parquet\"]\n",
    "       fs[\"selected.parquet\"]\n",
    "       co[\"consensus.parquet\"]\n",
    "\n",
    "       sc -->|\"aggregate\"| wp\n",
    "       wp -->|\"annotate\"| an\n",
    "       an -->|\"normalize\"| no\n",
    "       no -->|\"feature_select\"| fs\n",
    "       fs -->|\"consensus\"| co\n",
    "\n",
    "       style sc fill:#f0d9fa,stroke:#88239A,color:#111\n",
    "       style co fill:#f0d9fa,stroke:#88239A,color:#111\n",
    "       style wp fill:#ffffff,stroke:#88239A,color:#111\n",
    "       style an fill:#ffffff,stroke:#88239A,color:#111\n",
    "       style no fill:#ffffff,stroke:#88239A,color:#111\n",
    "       style fs fill:#ffffff,stroke:#88239A,color:#111\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-prereqs",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "```bash\n",
    "# Recommended — with uv (faster)\n",
    "uv pip install pycytominer\n",
    "\n",
    "# Or with standard pip\n",
    "pip install pycytominer\n",
    "```\n",
    "\n",
    "After installation the `pycytominer` command is available in your shell. Verify it is on your PATH and see all available sub-commands:\n",
    "\n",
    "```bash\n",
    "pycytominer\n",
    "```\n",
    "\n",
    "> **Tip — try before you install with `uvx`:** If you use [uv](https://docs.astral.sh/uv/), you can run any pycytominer CLI command immediately without a permanent install:\n",
    ">\n",
    "> ```bash\n",
    "> uvx pycytominer aggregate --help\n",
    "> ```\n",
    ">\n",
    "> `uvx` creates an isolated environment, installs pycytominer into it, runs the command, and discards the environment, all in one step. It is the fastest way to explore the CLI or script a one-off pipeline step on a new machine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "cell-help-top",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-04T16:59:37.521279Z",
     "iopub.status.busy": "2026-06-04T16:59:37.521090Z",
     "iopub.status.idle": "2026-06-04T16:59:39.899475Z",
     "shell.execute_reply": "2026-06-04T16:59:39.898996Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1mNAME\u001b[0m\r\n",
      "    pycytominer - Command Line Interface for Pycytominer operations.\r\n",
      "\r\n",
      "\u001b[1mSYNOPSIS\u001b[0m\r\n",
      "    pycytominer \u001b[4mCOMMAND\u001b[0m\r\n",
      "\r\n",
      "\u001b[1mDESCRIPTION\u001b[0m\r\n",
      "    Command Line Interface for Pycytominer operations.\r\n",
      "\r\n",
      "\u001b[1mCOMMANDS\u001b[0m\r\n",
      "    \u001b[1m\u001b[4mCOMMAND\u001b[0m\u001b[0m is one of the following:\r\n",
      "\r\n",
      "     aggregate\r\n",
      "       Aggregate profiles from a file and write the results to disk.\r\n",
      "\r\n",
      "     annotate\r\n",
      "       Annotate profiles using a platemap file and write output.\r\n",
      "\r\n",
      "     consensus\r\n",
      "       Create consensus profiles from a file and write output.\r\n",
      "\r\n",
      "     feature_select\r\n",
      "       Select features from profiles and write the results to disk.\r\n",
      "\r\n",
      "     normalize\r\n",
      "       Normalize profiles from a file and write the results to disk.\r\n"
     ]
    }
   ],
   "source": [
    "# List all available sub-commands\n",
    "!pycytominer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "cell-help-aggregate",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-04T16:59:39.902013Z",
     "iopub.status.busy": "2026-06-04T16:59:39.901847Z",
     "iopub.status.idle": "2026-06-04T16:59:41.685161Z",
     "shell.execute_reply": "2026-06-04T16:59:41.684666Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO: Showing help with the command 'pycytominer aggregate -- --help'.\r\n",
      "\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1mNAME\u001b[0m\r\n",
      "    pycytominer aggregate - Aggregate profiles from a file and write the results to disk.\r\n",
      "\r\n",
      "\u001b[1mSYNOPSIS\u001b[0m\r\n",
      "    pycytominer aggregate \u001b[4mPROFILES\u001b[0m \u001b[4mOUTPUT_FILE\u001b[0m <flags>\r\n",
      "\r\n",
      "\u001b[1mDESCRIPTION\u001b[0m\r\n",
      "    Aggregate profiles from a file and write the results to disk.\r\n",
      "\r\n",
      "\u001b[1mPOSITIONAL ARGUMENTS\u001b[0m\r\n",
      "    \u001b[1m\u001b[4mPROFILES\u001b[0m\u001b[0m\r\n",
      "        Type: 'str'\r\n",
      "        Path to the input profiles file.\r\n",
      "    \u001b[1m\u001b[4mOUTPUT_FILE\u001b[0m\u001b[0m\r\n",
      "        Type: 'str'\r\n",
      "        Path to the output file to write.\r\n",
      "\r\n",
      "\u001b[1mFLAGS\u001b[0m\r\n",
      "    --strata=\u001b[4mSTRATA\u001b[0m\r\n",
      "        Type: 'str | Sequence[str]'\r\n",
      "        Default: 'Metadata_Plate,Metad...\r\n",
      "        Metadata columns to aggregate by.\r\n",
      "    --features=\u001b[4mFEATURES\u001b[0m\r\n",
      "        Type: 'str | Sequence[str]'\r\n",
      "        Default: 'infer'\r\n",
      "        Feature list or \"infer\" to infer CellProfiler features.\r\n",
      "    -i, --image_features=\u001b[4mIMAGE_FEATURES\u001b[0m\r\n",
      "        Type: 'bool'\r\n",
      "        Default: False\r\n",
      "        Whether inferred features should include numeric image features.\r\n",
      "    --operation=\u001b[4mOPERATION\u001b[0m\r\n",
      "        Type: 'str'\r\n",
      "        Default: 'median'\r\n",
      "        Aggregation operation (\"median\" or \"mean\").\r\n",
      "    --output_type=\u001b[4mOUTPUT_TYPE\u001b[0m\r\n",
      "        Type: \"Literal['csv', 'parquet', 'anndata_h5ad', 'anndata_zarr'] | None\"\r\n",
      "        Default: 'csv'\r\n",
      "        Output type to write.\r\n",
      "    --compute_object_count=\u001b[4mCOMPUTE_OBJECT_COUNT\u001b[0m\r\n",
      "        Type: 'bool'\r\n",
      "        Default: False\r\n",
      "        Whether to compute object counts.\r\n",
      "    --object_feature=\u001b[4mOBJECT_FEATURE\u001b[0m\r\n",
      "        Type: 'str'\r\n",
      "        Default: 'Metadata_ObjectNumber'\r\n",
      "        Column used for object counting.\r\n",
      "    --subset_data_file=\u001b[4mSUBSET_DATA_FILE\u001b[0m\r\n",
      "        Type: Optional['str | None']\r\n",
      "        Default: None\r\n",
      "        Optional path to a subset dataframe for filtering.\r\n",
      "    --compression_options=\u001b[4mCOMPRESSION_OPTIONS\u001b[0m\r\n",
      "        Type: Optional['str | di...\r\n",
      "        Default: None\r\n",
      "        Compression options for writing output.\r\n",
      "    --float_format=\u001b[4mFLOAT_FORMAT\u001b[0m\r\n",
      "        Type: Optional['str | None']\r\n",
      "        Default: None\r\n",
      "        Decimal precision for output formatting.\r\n",
      "\r\n",
      "\u001b[1mNOTES\u001b[0m\r\n",
      "    You can also use flags syntax for POSITIONAL ARGUMENTS\r\n"
     ]
    }
   ],
   "source": [
    "# Show all options for the aggregate sub-command\n",
    "!pycytominer aggregate --help"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-data-intro",
   "metadata": {},
   "source": [
    "## Sample Data\n",
    "\n",
    "The CLI reads and writes files, CSV and Parquet are both supported as input. Below we generate the same synthetic Cell Painting dataset used in the [Introduction to Pycytominer](../tutorials/introduction_to_pycytominer.ipynb) tutorial and save it to a temporary working directory as Parquet files.\n",
    "\n",
    "In a real experiment you would replace `single_cells.parquet` with the output from CellProfiler or CytoTable.\n",
    "\n",
    "The simulation code is in the expandable block below, skip ahead if you just want to follow the CLI steps."
   ]
  },
  {
   "cell_type": "raw",
   "id": "cell-generate-data-toggle",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. toggle::\n",
    "\n",
    "   .. code-block:: python\n",
    "\n",
    "      import tempfile\n",
    "      from pathlib import Path\n",
    "\n",
    "      import numpy as np\n",
    "      import pandas as pd\n",
    "\n",
    "      rng = np.random.default_rng(42)\n",
    "\n",
    "      # ── Temporary working directory ────────────────────────────────────────────\n",
    "      workdir = Path(tempfile.mkdtemp()).resolve()\n",
    "\n",
    "      # ── Synthetic single-cell data ─────────────────────────────────────────────\n",
    "      WELLS = {\n",
    "          \"B02\": \"DMSO\",       \"C02\": \"DMSO\",\n",
    "          \"B03\": \"Compound_A\", \"C03\": \"Compound_A\",\n",
    "          \"B04\": \"Compound_B\", \"C04\": \"Compound_B\",\n",
    "      }\n",
    "      N = 100\n",
    "\n",
    "      rows = []\n",
    "      for img_num, (well, treatment) in enumerate(WELLS.items(), start=1):\n",
    "          is_a = float(treatment == \"Compound_A\")\n",
    "          is_b = float(treatment == \"Compound_B\")\n",
    "          cell_areas = rng.normal(500 + 180 * is_a - 90 * is_b, 120, N)\n",
    "          for obj_num in range(1, N + 1):\n",
    "              rows.append({\n",
    "                  \"Metadata_Plate\": \"Plate_1\",\n",
    "                  \"Metadata_Well\":  well,\n",
    "                  \"Metadata_ImageNumber\": img_num,\n",
    "                  \"Metadata_ObjectNumber\": obj_num,\n",
    "                  \"Cells_AreaShape_Area\":          cell_areas[obj_num - 1],\n",
    "                  \"Cells_AreaShape_BoundingBoxArea\": cell_areas[obj_num - 1] * 1.3 + rng.normal(0, 4),\n",
    "                  \"Cells_AreaShape_EulerNumber\":    1,\n",
    "                  \"Cells_AreaShape_Eccentricity\":   float(np.clip(rng.normal(0.55, 0.12), 0, 1)),\n",
    "                  \"Cells_Intensity_MeanIntensity_Mito\":      rng.normal(0.30, 0.06),\n",
    "                  \"Cells_Texture_Correlation_RNA_3_0_256\":   rng.normal(0.22, 0.06),\n",
    "                  \"Cytoplasm_AreaShape_Area\":                rng.normal(310, 80),\n",
    "                  \"Cytoplasm_Intensity_MeanIntensity_AGP\":   rng.normal(0.25, 0.07),\n",
    "                  \"Nuclei_AreaShape_Area\":                   rng.normal(195, 55),\n",
    "                  \"Nuclei_AreaShape_Eccentricity\":  float(np.clip(rng.normal(0.40, 0.10), 0, 1)),\n",
    "                  \"Nuclei_Intensity_MeanIntensity_DNA\":      rng.normal(0.50, 0.08),\n",
    "              })\n",
    "\n",
    "      sc_path = workdir / \"single_cells.parquet\"\n",
    "      pd.DataFrame(rows).to_parquet(sc_path, index=False)\n",
    "      print(f\"Saved {len(rows):,} single cells to {sc_path.name}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "cell-generate-data",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-04T16:59:41.687008Z",
     "iopub.status.busy": "2026-06-04T16:59:41.686882Z",
     "iopub.status.idle": "2026-06-04T16:59:42.074016Z",
     "shell.execute_reply": "2026-06-04T16:59:42.073713Z"
    },
    "nbsphinx": "hidden"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Saved 600 single cells to single_cells.parquet\n"
     ]
    }
   ],
   "source": [
    "import tempfile\n",
    "from pathlib import Path\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "rng = np.random.default_rng(42)\n",
    "\n",
    "# ── Temporary working directory ────────────────────────────────────────────\n",
    "workdir = Path(tempfile.mkdtemp()).resolve()\n",
    "# ── Synthetic single-cell data ─────────────────────────────────────────────\n",
    "WELLS = {\n",
    "    \"B02\": \"DMSO\",\n",
    "    \"C02\": \"DMSO\",\n",
    "    \"B03\": \"Compound_A\",\n",
    "    \"C03\": \"Compound_A\",\n",
    "    \"B04\": \"Compound_B\",\n",
    "    \"C04\": \"Compound_B\",\n",
    "}\n",
    "N = 100\n",
    "\n",
    "rows = []\n",
    "for img_num, (well, treatment) in enumerate(WELLS.items(), start=1):\n",
    "    is_a = float(treatment == \"Compound_A\")\n",
    "    is_b = float(treatment == \"Compound_B\")\n",
    "    cell_areas = rng.normal(500 + 180 * is_a - 90 * is_b, 120, N)\n",
    "    for obj_num in range(1, N + 1):\n",
    "        rows.append({\n",
    "            \"Metadata_Plate\": \"Plate_1\",\n",
    "            \"Metadata_Well\": well,\n",
    "            \"Metadata_ImageNumber\": img_num,\n",
    "            \"Metadata_ObjectNumber\": obj_num,\n",
    "            \"Cells_AreaShape_Area\": cell_areas[obj_num - 1],\n",
    "            \"Cells_AreaShape_BoundingBoxArea\": cell_areas[obj_num - 1] * 1.3\n",
    "            + rng.normal(0, 4),\n",
    "            \"Cells_AreaShape_EulerNumber\": 1,\n",
    "            \"Cells_AreaShape_Eccentricity\": float(\n",
    "                np.clip(rng.normal(0.55, 0.12), 0, 1)\n",
    "            ),\n",
    "            \"Cells_Intensity_MeanIntensity_Mito\": rng.normal(0.30, 0.06),\n",
    "            \"Cells_Texture_Correlation_RNA_3_0_256\": rng.normal(0.22, 0.06),\n",
    "            \"Cytoplasm_AreaShape_Area\": rng.normal(310, 80),\n",
    "            \"Cytoplasm_Intensity_MeanIntensity_AGP\": rng.normal(0.25, 0.07),\n",
    "            \"Nuclei_AreaShape_Area\": rng.normal(195, 55),\n",
    "            \"Nuclei_AreaShape_Eccentricity\": float(\n",
    "                np.clip(rng.normal(0.40, 0.10), 0, 1)\n",
    "            ),\n",
    "            \"Nuclei_Intensity_MeanIntensity_DNA\": rng.normal(0.50, 0.08),\n",
    "        })\n",
    "\n",
    "sc_path = workdir / \"single_cells.parquet\"\n",
    "pd.DataFrame(rows).to_parquet(sc_path, index=False)\n",
    "print(f\"Saved {len(rows):,} single cells to {sc_path.name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-aggregate-intro",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 1: Aggregate\n",
    "\n",
    "`pycytominer aggregate` collapses single-cell rows into one representative profile per well by taking the **median** (or mean) of each feature across all cells in that well.\n",
    "\n",
    "**Key arguments:**\n",
    "- `--profiles`, input file (CSV or Parquet)\n",
    "- `--output_file`, where to write the result\n",
    "- `--strata`, comma-delimited metadata columns that define each group (default: `Metadata_Plate,Metadata_Well`)\n",
    "- `--operation`, aggregation function: `median` (default) or `mean`\n",
    "- `--output_type`, `csv` (default) or `parquet`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "cell-aggregate-cmd",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-04T16:59:42.075429Z",
     "iopub.status.busy": "2026-06-04T16:59:42.075316Z",
     "iopub.status.idle": "2026-06-04T16:59:43.666563Z",
     "shell.execute_reply": "2026-06-04T16:59:43.665581Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wrote output file: well_profiles.parquet\r\n",
      "well_profiles.parquet\r\n"
     ]
    }
   ],
   "source": [
    "!pycytominer aggregate --profiles {workdir}/single_cells.parquet --output_file {workdir}/well_profiles.parquet --strata \"Metadata_Plate,Metadata_Well\" --operation median --output_type parquet 2>&1 | sed \"s|{workdir}/||g\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "cell-aggregate-inspect",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-04T16:59:43.669393Z",
     "iopub.status.busy": "2026-06-04T16:59:43.669237Z",
     "iopub.status.idle": "2026-06-04T16:59:43.711767Z",
     "shell.execute_reply": "2026-06-04T16:59:43.711466Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Well profiles: (6, 13)  (one row per well)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Metadata_Plate</th>\n",
       "      <th>Metadata_Well</th>\n",
       "      <th>Cells_AreaShape_Area</th>\n",
       "      <th>Cells_AreaShape_BoundingBoxArea</th>\n",
       "      <th>Cells_AreaShape_EulerNumber</th>\n",
       "      <th>Cells_AreaShape_Eccentricity</th>\n",
       "      <th>Cells_Intensity_MeanIntensity_Mito</th>\n",
       "      <th>Cells_Texture_Correlation_RNA_3_0_256</th>\n",
       "      <th>Cytoplasm_AreaShape_Area</th>\n",
       "      <th>Cytoplasm_Intensity_MeanIntensity_AGP</th>\n",
       "      <th>Nuclei_AreaShape_Area</th>\n",
       "      <th>Nuclei_AreaShape_Eccentricity</th>\n",
       "      <th>Nuclei_Intensity_MeanIntensity_DNA</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>499.741578</td>\n",
       "      <td>646.410141</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.551590</td>\n",
       "      <td>0.305235</td>\n",
       "      <td>0.221010</td>\n",
       "      <td>309.361769</td>\n",
       "      <td>0.252230</td>\n",
       "      <td>191.121017</td>\n",
       "      <td>0.407695</td>\n",
       "      <td>0.492709</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B03</td>\n",
       "      <td>689.065353</td>\n",
       "      <td>895.200860</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.550686</td>\n",
       "      <td>0.304796</td>\n",
       "      <td>0.223964</td>\n",
       "      <td>319.691855</td>\n",
       "      <td>0.250131</td>\n",
       "      <td>190.228310</td>\n",
       "      <td>0.394803</td>\n",
       "      <td>0.508586</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B04</td>\n",
       "      <td>406.933246</td>\n",
       "      <td>529.871038</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.535506</td>\n",
       "      <td>0.287034</td>\n",
       "      <td>0.229690</td>\n",
       "      <td>330.455137</td>\n",
       "      <td>0.254138</td>\n",
       "      <td>189.392536</td>\n",
       "      <td>0.394729</td>\n",
       "      <td>0.509548</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Metadata_Plate Metadata_Well  Cells_AreaShape_Area  \\\n",
       "0        Plate_1           B02            499.741578   \n",
       "1        Plate_1           B03            689.065353   \n",
       "2        Plate_1           B04            406.933246   \n",
       "\n",
       "   Cells_AreaShape_BoundingBoxArea  Cells_AreaShape_EulerNumber  \\\n",
       "0                       646.410141                          1.0   \n",
       "1                       895.200860                          1.0   \n",
       "2                       529.871038                          1.0   \n",
       "\n",
       "   Cells_AreaShape_Eccentricity  Cells_Intensity_MeanIntensity_Mito  \\\n",
       "0                      0.551590                            0.305235   \n",
       "1                      0.550686                            0.304796   \n",
       "2                      0.535506                            0.287034   \n",
       "\n",
       "   Cells_Texture_Correlation_RNA_3_0_256  Cytoplasm_AreaShape_Area  \\\n",
       "0                               0.221010                309.361769   \n",
       "1                               0.223964                319.691855   \n",
       "2                               0.229690                330.455137   \n",
       "\n",
       "   Cytoplasm_Intensity_MeanIntensity_AGP  Nuclei_AreaShape_Area  \\\n",
       "0                               0.252230             191.121017   \n",
       "1                               0.250131             190.228310   \n",
       "2                               0.254138             189.392536   \n",
       "\n",
       "   Nuclei_AreaShape_Eccentricity  Nuclei_Intensity_MeanIntensity_DNA  \n",
       "0                       0.407695                            0.492709  \n",
       "1                       0.394803                            0.508586  \n",
       "2                       0.394729                            0.509548  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wp = pd.read_parquet(workdir / \"well_profiles.parquet\")\n",
    "print(f\"Well profiles: {wp.shape}  (one row per well)\")\n",
    "wp.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-annotate-intro",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 2: Annotate\n",
    "\n",
    "`pycytominer annotate` joins a **plate map** file onto the well profiles, adding columns such as treatment, cell line, and concentration. The plate map is a CSV (or any tabular format) where each row describes one well.\n",
    "\n",
    "**Key arguments:**\n",
    "- `--platemap`, path to the plate map file\n",
    "- `--join_on`, two comma-delimited column names: `platemap_col,profiles_col`   (default: `Metadata_well_position,Metadata_Well`)\n",
    "- `--add_metadata_id_to_platemap`, prefix new columns with `Metadata_` (default: `True`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "cell-platemap",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-04T16:59:43.713203Z",
     "iopub.status.busy": "2026-06-04T16:59:43.713097Z",
     "iopub.status.idle": "2026-06-04T16:59:43.720327Z",
     "shell.execute_reply": "2026-06-04T16:59:43.719982Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>well_position</th>\n",
       "      <th>treatment</th>\n",
       "      <th>cell_line</th>\n",
       "      <th>concentration_um</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>B02</td>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>C02</td>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>B03</td>\n",
       "      <td>Compound_A</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>C03</td>\n",
       "      <td>Compound_A</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B04</td>\n",
       "      <td>Compound_B</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>C04</td>\n",
       "      <td>Compound_B</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  well_position   treatment cell_line  concentration_um\n",
       "0           B02        DMSO      HeLa               0.0\n",
       "1           C02        DMSO      HeLa               0.0\n",
       "2           B03  Compound_A      HeLa              10.0\n",
       "3           C03  Compound_A      HeLa              10.0\n",
       "4           B04  Compound_B      HeLa               5.0\n",
       "5           C04  Compound_B      HeLa               5.0"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create the plate map CSV\n",
    "platemap = pd.DataFrame({\n",
    "    \"well_position\": [\"B02\", \"C02\", \"B03\", \"C03\", \"B04\", \"C04\"],\n",
    "    \"treatment\": [\n",
    "        \"DMSO\",\n",
    "        \"DMSO\",\n",
    "        \"Compound_A\",\n",
    "        \"Compound_A\",\n",
    "        \"Compound_B\",\n",
    "        \"Compound_B\",\n",
    "    ],\n",
    "    \"cell_line\": [\"HeLa\"] * 6,\n",
    "    \"concentration_um\": [0.0, 0.0, 10.0, 10.0, 5.0, 5.0],\n",
    "})\n",
    "platemap.to_csv(workdir / \"platemap.csv\", index=False)\n",
    "platemap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "cell-annotate-cmd",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-04T16:59:43.721688Z",
     "iopub.status.busy": "2026-06-04T16:59:43.721574Z",
     "iopub.status.idle": "2026-06-04T16:59:45.350912Z",
     "shell.execute_reply": "2026-06-04T16:59:45.350189Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wrote output file: annotated.parquet\r\n",
      "annotated.parquet\r\n"
     ]
    }
   ],
   "source": [
    "!pycytominer annotate --profiles {workdir}/well_profiles.parquet --platemap {workdir}/platemap.csv --output_file {workdir}/annotated.parquet --join_on \"Metadata_well_position,Metadata_Well\" --output_type parquet 2>&1 | sed \"s|{workdir}/||g\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "cell-annotate-inspect",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-04T16:59:45.353095Z",
     "iopub.status.busy": "2026-06-04T16:59:45.352932Z",
     "iopub.status.idle": "2026-06-04T16:59:45.362690Z",
     "shell.execute_reply": "2026-06-04T16:59:45.362408Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Annotated profiles: (6, 16)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Metadata_treatment</th>\n",
       "      <th>Metadata_cell_line</th>\n",
       "      <th>Metadata_concentration_um</th>\n",
       "      <th>Metadata_Plate</th>\n",
       "      <th>Metadata_Well</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>C02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Compound_A</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B03</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Metadata_treatment Metadata_cell_line  Metadata_concentration_um  \\\n",
       "0               DMSO               HeLa                        0.0   \n",
       "1               DMSO               HeLa                        0.0   \n",
       "2         Compound_A               HeLa                       10.0   \n",
       "\n",
       "  Metadata_Plate Metadata_Well  \n",
       "0        Plate_1           B02  \n",
       "1        Plate_1           C02  \n",
       "2        Plate_1           B03  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ann = pd.read_parquet(workdir / \"annotated.parquet\")\n",
    "print(f\"Annotated profiles: {ann.shape}\")\n",
    "ann[[c for c in ann.columns if c.startswith(\"Metadata_\")]].head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-normalize-intro",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 3: Normalize\n",
    "\n",
    "`pycytominer normalize` scales features to a common range and limits plate-to-plate technical variation. Z-scoring against DMSO control wells (`--samples`) is the most common approach.\n",
    "\n",
    "**Key arguments:**\n",
    "- `--samples`, a pandas query string selecting the normalization reference.   Use `all` to normalize against the entire plate.\n",
    "- `--method`, normalization method: `standardize` (z-score, default),   `robustize` (MAD-based), or `spherize`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "cell-normalize-cmd",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-04T16:59:45.364223Z",
     "iopub.status.busy": "2026-06-04T16:59:45.364110Z",
     "iopub.status.idle": "2026-06-04T16:59:47.070659Z",
     "shell.execute_reply": "2026-06-04T16:59:47.070095Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wrote output file: normalized.parquet\r\n",
      "normalized.parquet\r\n"
     ]
    }
   ],
   "source": [
    "!pycytominer normalize --profiles {workdir}/annotated.parquet --output_file {workdir}/normalized.parquet --samples \"Metadata_treatment == 'DMSO'\" --method standardize --output_type parquet 2>&1 | sed \"s|{workdir}/||g\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "cell-normalize-inspect",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-04T16:59:47.072830Z",
     "iopub.status.busy": "2026-06-04T16:59:47.072652Z",
     "iopub.status.idle": "2026-06-04T16:59:47.085780Z",
     "shell.execute_reply": "2026-06-04T16:59:47.085286Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Normalized profiles: (6, 16)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Metadata_treatment</th>\n",
       "      <th>Metadata_cell_line</th>\n",
       "      <th>Metadata_concentration_um</th>\n",
       "      <th>Metadata_Plate</th>\n",
       "      <th>Metadata_Well</th>\n",
       "      <th>Cells_AreaShape_Area</th>\n",
       "      <th>Cells_AreaShape_BoundingBoxArea</th>\n",
       "      <th>Cells_AreaShape_EulerNumber</th>\n",
       "      <th>Cells_AreaShape_Eccentricity</th>\n",
       "      <th>Cells_Intensity_MeanIntensity_Mito</th>\n",
       "      <th>Cells_Texture_Correlation_RNA_3_0_256</th>\n",
       "      <th>Cytoplasm_AreaShape_Area</th>\n",
       "      <th>Cytoplasm_Intensity_MeanIntensity_AGP</th>\n",
       "      <th>Nuclei_AreaShape_Area</th>\n",
       "      <th>Nuclei_AreaShape_Eccentricity</th>\n",
       "      <th>Nuclei_Intensity_MeanIntensity_DNA</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.00000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>C02</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.00000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Compound_A</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B03</td>\n",
       "      <td>52.302035</td>\n",
       "      <td>42.69753</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.020332</td>\n",
       "      <td>0.694833</td>\n",
       "      <td>4.413158</td>\n",
       "      <td>2.71186</td>\n",
       "      <td>0.309624</td>\n",
       "      <td>0.829585</td>\n",
       "      <td>-1.378075</td>\n",
       "      <td>8.707708</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Metadata_treatment Metadata_cell_line  Metadata_concentration_um  \\\n",
       "0               DMSO               HeLa                        0.0   \n",
       "1               DMSO               HeLa                        0.0   \n",
       "2         Compound_A               HeLa                       10.0   \n",
       "\n",
       "  Metadata_Plate Metadata_Well  Cells_AreaShape_Area  \\\n",
       "0        Plate_1           B02             -1.000000   \n",
       "1        Plate_1           C02              1.000000   \n",
       "2        Plate_1           B03             52.302035   \n",
       "\n",
       "   Cells_AreaShape_BoundingBoxArea  Cells_AreaShape_EulerNumber  \\\n",
       "0                         -1.00000                          0.0   \n",
       "1                          1.00000                          0.0   \n",
       "2                         42.69753                          0.0   \n",
       "\n",
       "   Cells_AreaShape_Eccentricity  Cells_Intensity_MeanIntensity_Mito  \\\n",
       "0                      1.000000                            1.000000   \n",
       "1                     -1.000000                           -1.000000   \n",
       "2                      0.020332                            0.694833   \n",
       "\n",
       "   Cells_Texture_Correlation_RNA_3_0_256  Cytoplasm_AreaShape_Area  \\\n",
       "0                               1.000000                   1.00000   \n",
       "1                              -1.000000                  -1.00000   \n",
       "2                               4.413158                   2.71186   \n",
       "\n",
       "   Cytoplasm_Intensity_MeanIntensity_AGP  Nuclei_AreaShape_Area  \\\n",
       "0                               1.000000               1.000000   \n",
       "1                              -1.000000              -1.000000   \n",
       "2                               0.309624               0.829585   \n",
       "\n",
       "   Nuclei_AreaShape_Eccentricity  Nuclei_Intensity_MeanIntensity_DNA  \n",
       "0                       1.000000                            1.000000  \n",
       "1                      -1.000000                           -1.000000  \n",
       "2                      -1.378075                            8.707708  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "norm = pd.read_parquet(workdir / \"normalized.parquet\")\n",
    "print(f\"Normalized profiles: {norm.shape}\")\n",
    "norm.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-featsel-intro",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 4: Feature Select\n",
    "\n",
    "`pycytominer feature_select` removes uninformative features. Multiple operations can be applied in one call by passing a comma-delimited list.\n",
    "\n",
    "**Key arguments:**\n",
    "- `--operation`, comma-delimited list of operations to apply:\n",
    "  - `variance_threshold`, drop near-constant features\n",
    "  - `correlation_threshold`, drop one of each highly correlated pair\n",
    "  - `blocklist`, drop features known to be unreliable across assays\n",
    "  - `drop_na_columns`, drop columns with too many missing values\n",
    "  - `noise_removal`, remove features with low signal-to-noise ratio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "cell-featsel-cmd",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-04T16:59:47.087400Z",
     "iopub.status.busy": "2026-06-04T16:59:47.087275Z",
     "iopub.status.idle": "2026-06-04T16:59:48.746330Z",
     "shell.execute_reply": "2026-06-04T16:59:48.745797Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wrote output file: selected.parquet\r\n",
      "selected.parquet\r\n"
     ]
    }
   ],
   "source": [
    "!pycytominer feature_select --profiles {workdir}/normalized.parquet --output_file {workdir}/selected.parquet --operation \"variance_threshold,correlation_threshold,blocklist\" --output_type parquet 2>&1 | sed \"s|{workdir}/||g\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "cell-featsel-inspect",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-04T16:59:48.748044Z",
     "iopub.status.busy": "2026-06-04T16:59:48.747919Z",
     "iopub.status.idle": "2026-06-04T16:59:48.758170Z",
     "shell.execute_reply": "2026-06-04T16:59:48.757880Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features: 11 -> 8\n",
      "Removed:  {'Cells_AreaShape_Area', 'Cells_AreaShape_EulerNumber', 'Cells_Texture_Correlation_RNA_3_0_256'}\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Metadata_treatment</th>\n",
       "      <th>Metadata_cell_line</th>\n",
       "      <th>Metadata_concentration_um</th>\n",
       "      <th>Metadata_Plate</th>\n",
       "      <th>Metadata_Well</th>\n",
       "      <th>Cells_AreaShape_BoundingBoxArea</th>\n",
       "      <th>Cells_AreaShape_Eccentricity</th>\n",
       "      <th>Cells_Intensity_MeanIntensity_Mito</th>\n",
       "      <th>Cytoplasm_AreaShape_Area</th>\n",
       "      <th>Cytoplasm_Intensity_MeanIntensity_AGP</th>\n",
       "      <th>Nuclei_AreaShape_Area</th>\n",
       "      <th>Nuclei_AreaShape_Eccentricity</th>\n",
       "      <th>Nuclei_Intensity_MeanIntensity_DNA</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B02</td>\n",
       "      <td>-1.00000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.00000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>C02</td>\n",
       "      <td>1.00000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.00000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>-1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Compound_A</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "      <td>Plate_1</td>\n",
       "      <td>B03</td>\n",
       "      <td>42.69753</td>\n",
       "      <td>0.020332</td>\n",
       "      <td>0.694833</td>\n",
       "      <td>2.71186</td>\n",
       "      <td>0.309624</td>\n",
       "      <td>0.829585</td>\n",
       "      <td>-1.378075</td>\n",
       "      <td>8.707708</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Metadata_treatment Metadata_cell_line  Metadata_concentration_um  \\\n",
       "0               DMSO               HeLa                        0.0   \n",
       "1               DMSO               HeLa                        0.0   \n",
       "2         Compound_A               HeLa                       10.0   \n",
       "\n",
       "  Metadata_Plate Metadata_Well  Cells_AreaShape_BoundingBoxArea  \\\n",
       "0        Plate_1           B02                         -1.00000   \n",
       "1        Plate_1           C02                          1.00000   \n",
       "2        Plate_1           B03                         42.69753   \n",
       "\n",
       "   Cells_AreaShape_Eccentricity  Cells_Intensity_MeanIntensity_Mito  \\\n",
       "0                      1.000000                            1.000000   \n",
       "1                     -1.000000                           -1.000000   \n",
       "2                      0.020332                            0.694833   \n",
       "\n",
       "   Cytoplasm_AreaShape_Area  Cytoplasm_Intensity_MeanIntensity_AGP  \\\n",
       "0                   1.00000                               1.000000   \n",
       "1                  -1.00000                              -1.000000   \n",
       "2                   2.71186                               0.309624   \n",
       "\n",
       "   Nuclei_AreaShape_Area  Nuclei_AreaShape_Eccentricity  \\\n",
       "0               1.000000                       1.000000   \n",
       "1              -1.000000                      -1.000000   \n",
       "2               0.829585                      -1.378075   \n",
       "\n",
       "   Nuclei_Intensity_MeanIntensity_DNA  \n",
       "0                            1.000000  \n",
       "1                           -1.000000  \n",
       "2                            8.707708  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sel = pd.read_parquet(workdir / \"selected.parquet\")\n",
    "feat_before = [c for c in norm.columns if not c.startswith(\"Metadata_\")]\n",
    "feat_after = [c for c in sel.columns if not c.startswith(\"Metadata_\")]\n",
    "print(f\"Features: {len(feat_before)} -> {len(feat_after)}\")\n",
    "print(f\"Removed:  {set(feat_before) - set(feat_after)}\")\n",
    "sel.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-consensus-intro",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 5: Consensus\n",
    "\n",
    "`pycytominer consensus` collapses replicate wells into one profile per biological condition by taking the median (or modz) across replicates.\n",
    "\n",
    "**Key arguments:**\n",
    "- `--replicate_columns`, comma-delimited metadata columns that identify   a unique condition (replicates share all of these values)\n",
    "- `--operation`, `median` (default), `mean`, or `modz`   (moderated z-score, recommended for large screens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "cell-consensus-cmd",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-04T16:59:48.759546Z",
     "iopub.status.busy": "2026-06-04T16:59:48.759444Z",
     "iopub.status.idle": "2026-06-04T16:59:50.580493Z",
     "shell.execute_reply": "2026-06-04T16:59:50.580036Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wrote output file: consensus.parquet\r\n",
      "consensus.parquet\r\n"
     ]
    }
   ],
   "source": [
    "!pycytominer consensus --profiles {workdir}/selected.parquet --output_file {workdir}/consensus.parquet --replicate_columns \"Metadata_treatment,Metadata_cell_line,Metadata_concentration_um\" --operation median --output_type parquet 2>&1 | sed \"s|{workdir}/||g\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "cell-consensus-inspect",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-04T16:59:50.582357Z",
     "iopub.status.busy": "2026-06-04T16:59:50.582213Z",
     "iopub.status.idle": "2026-06-04T16:59:50.590306Z",
     "shell.execute_reply": "2026-06-04T16:59:50.590003Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Consensus profiles: (3, 11)  (one row per condition)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Metadata_treatment</th>\n",
       "      <th>Metadata_cell_line</th>\n",
       "      <th>Metadata_concentration_um</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Compound_A</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Compound_B</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>DMSO</td>\n",
       "      <td>HeLa</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Metadata_treatment Metadata_cell_line  Metadata_concentration_um\n",
       "0         Compound_A               HeLa                       10.0\n",
       "1         Compound_B               HeLa                        5.0\n",
       "2               DMSO               HeLa                        0.0"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cons = pd.read_parquet(workdir / \"consensus.parquet\")\n",
    "print(f\"Consensus profiles: {cons.shape}  (one row per condition)\")\n",
    "cons[[c for c in cons.columns if c.startswith(\"Metadata_\")]]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-summary",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Summary\n",
    "\n",
    "You ran the full pycytominer pipeline using only command-line calls:\n",
    "\n",
    "```bash\n",
    "pycytominer aggregate    --profiles single_cells.csv  --output_file well_profiles.parquet  --strata \"Metadata_Plate,Metadata_Well\"\n",
    "pycytominer annotate     --profiles well_profiles.parquet --output_file annotated.parquet      --platemap platemap.csv\n",
    "pycytominer normalize    --profiles annotated.parquet     --output_file normalized.parquet     --samples \"Metadata_treatment == 'DMSO'\"\n",
    "pycytominer feature_select --profiles normalized.parquet  --output_file selected.parquet       --operation \"variance_threshold,correlation_threshold,blocklist\"\n",
    "pycytominer consensus    --profiles selected.parquet      --output_file consensus.parquet      --replicate_columns \"Metadata_treatment,Metadata_cell_line,Metadata_concentration_um\"\n",
    "```\n",
    "\n",
    "### Tips for scripting\n",
    "\n",
    "- List all commands with `pycytominer`; get full option docs with `pycytominer COMMAND --help`\n",
    "- Chain into Bash scripts or `Makefile` targets for reproducible pipelines\n",
    "- **Query strings** in `--samples` follow   [pandas query syntax](https://pandas.pydata.org/docs/user_guide/indexing.html#the-query-method)  , any valid pandas query expression works"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}