{ "cells": [ { "cell_type": "markdown", "id": "cell-title", "metadata": {}, "source": [ "# Using Pycytominer from the command line interface (CLI)\n", "\n", "Pycytominer ships with a full-featured **command-line interface (CLI)** so that every pipeline step can be run directly from a terminal, no Python code required. This makes it easy to integrate pycytominer into shell scripts, Snakemake workflows, Nextflow pipelines, or any other automation tool that orchestrates file-based steps.\n", "\n", "This tutorial covers all five CLI commands:\n", "\n", "| Command | Equivalent Python function |\n", "|---|---|\n", "| `pycytominer aggregate` | `aggregate()` |\n", "| `pycytominer annotate` | `annotate()` |\n", "| `pycytominer normalize` | `normalize()` |\n", "| `pycytominer feature_select` | `feature_select()` |\n", "| `pycytominer consensus` | `consensus()` |\n", "\n", "> **New to pycytominer?** Read the [Introduction to Pycytominer](../tutorials/introduction_to_pycytominer.ipynb) tutorial first to understand the pipeline concepts before running them from the command line." ] }, { "cell_type": "raw", "id": "cell-pipeline-diagram", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. mermaid::\n", " :align: center\n", "\n", " flowchart LR\n", " sc[\"single_cells.parquet\"]\n", " wp[\"well_profiles.parquet\"]\n", " an[\"annotated.parquet\"]\n", " no[\"normalized.parquet\"]\n", " fs[\"selected.parquet\"]\n", " co[\"consensus.parquet\"]\n", "\n", " sc -->|\"aggregate\"| wp\n", " wp -->|\"annotate\"| an\n", " an -->|\"normalize\"| no\n", " no -->|\"feature_select\"| fs\n", " fs -->|\"consensus\"| co\n", "\n", " style sc fill:#f0d9fa,stroke:#88239A,color:#111\n", " style co fill:#f0d9fa,stroke:#88239A,color:#111\n", " style wp fill:#ffffff,stroke:#88239A,color:#111\n", " style an fill:#ffffff,stroke:#88239A,color:#111\n", " style no fill:#ffffff,stroke:#88239A,color:#111\n", " style fs fill:#ffffff,stroke:#88239A,color:#111\n" ] }, { "cell_type": "markdown", "id": "cell-prereqs", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "```bash\n", "# Recommended — with uv (faster)\n", "uv pip install pycytominer\n", "\n", "# Or with standard pip\n", "pip install pycytominer\n", "```\n", "\n", "After installation the `pycytominer` command is available in your shell. Verify it is on your PATH and see all available sub-commands:\n", "\n", "```bash\n", "pycytominer\n", "```\n", "\n", "> **Tip — try before you install with `uvx`:** If you use [uv](https://docs.astral.sh/uv/), you can run any pycytominer CLI command immediately without a permanent install:\n", ">\n", "> ```bash\n", "> uvx pycytominer aggregate --help\n", "> ```\n", ">\n", "> `uvx` creates an isolated environment, installs pycytominer into it, runs the command, and discards the environment, all in one step. It is the fastest way to explore the CLI or script a one-off pipeline step on a new machine." ] }, { "cell_type": "code", "execution_count": 1, "id": "cell-help-top", "metadata": { "execution": { "iopub.execute_input": "2026-06-04T16:59:37.521279Z", "iopub.status.busy": "2026-06-04T16:59:37.521090Z", "iopub.status.idle": "2026-06-04T16:59:39.899475Z", "shell.execute_reply": "2026-06-04T16:59:39.898996Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1mNAME\u001b[0m\r\n", " pycytominer - Command Line Interface for Pycytominer operations.\r\n", "\r\n", "\u001b[1mSYNOPSIS\u001b[0m\r\n", " pycytominer \u001b[4mCOMMAND\u001b[0m\r\n", "\r\n", "\u001b[1mDESCRIPTION\u001b[0m\r\n", " Command Line Interface for Pycytominer operations.\r\n", "\r\n", "\u001b[1mCOMMANDS\u001b[0m\r\n", " \u001b[1m\u001b[4mCOMMAND\u001b[0m\u001b[0m is one of the following:\r\n", "\r\n", " aggregate\r\n", " Aggregate profiles from a file and write the results to disk.\r\n", "\r\n", " annotate\r\n", " Annotate profiles using a platemap file and write output.\r\n", "\r\n", " consensus\r\n", " Create consensus profiles from a file and write output.\r\n", "\r\n", " feature_select\r\n", " Select features from profiles and write the results to disk.\r\n", "\r\n", " normalize\r\n", " Normalize profiles from a file and write the results to disk.\r\n" ] } ], "source": [ "# List all available sub-commands\n", "!pycytominer" ] }, { "cell_type": "code", "execution_count": 2, "id": "cell-help-aggregate", "metadata": { "execution": { "iopub.execute_input": "2026-06-04T16:59:39.902013Z", "iopub.status.busy": "2026-06-04T16:59:39.901847Z", "iopub.status.idle": "2026-06-04T16:59:41.685161Z", "shell.execute_reply": "2026-06-04T16:59:41.684666Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO: Showing help with the command 'pycytominer aggregate -- --help'.\r\n", "\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1mNAME\u001b[0m\r\n", " pycytominer aggregate - Aggregate profiles from a file and write the results to disk.\r\n", "\r\n", "\u001b[1mSYNOPSIS\u001b[0m\r\n", " pycytominer aggregate \u001b[4mPROFILES\u001b[0m \u001b[4mOUTPUT_FILE\u001b[0m \r\n", "\r\n", "\u001b[1mDESCRIPTION\u001b[0m\r\n", " Aggregate profiles from a file and write the results to disk.\r\n", "\r\n", "\u001b[1mPOSITIONAL ARGUMENTS\u001b[0m\r\n", " \u001b[1m\u001b[4mPROFILES\u001b[0m\u001b[0m\r\n", " Type: 'str'\r\n", " Path to the input profiles file.\r\n", " \u001b[1m\u001b[4mOUTPUT_FILE\u001b[0m\u001b[0m\r\n", " Type: 'str'\r\n", " Path to the output file to write.\r\n", "\r\n", "\u001b[1mFLAGS\u001b[0m\r\n", " --strata=\u001b[4mSTRATA\u001b[0m\r\n", " Type: 'str | Sequence[str]'\r\n", " Default: 'Metadata_Plate,Metad...\r\n", " Metadata columns to aggregate by.\r\n", " --features=\u001b[4mFEATURES\u001b[0m\r\n", " Type: 'str | Sequence[str]'\r\n", " Default: 'infer'\r\n", " Feature list or \"infer\" to infer CellProfiler features.\r\n", " -i, --image_features=\u001b[4mIMAGE_FEATURES\u001b[0m\r\n", " Type: 'bool'\r\n", " Default: False\r\n", " Whether inferred features should include numeric image features.\r\n", " --operation=\u001b[4mOPERATION\u001b[0m\r\n", " Type: 'str'\r\n", " Default: 'median'\r\n", " Aggregation operation (\"median\" or \"mean\").\r\n", " --output_type=\u001b[4mOUTPUT_TYPE\u001b[0m\r\n", " Type: \"Literal['csv', 'parquet', 'anndata_h5ad', 'anndata_zarr'] | None\"\r\n", " Default: 'csv'\r\n", " Output type to write.\r\n", " --compute_object_count=\u001b[4mCOMPUTE_OBJECT_COUNT\u001b[0m\r\n", " Type: 'bool'\r\n", " Default: False\r\n", " Whether to compute object counts.\r\n", " --object_feature=\u001b[4mOBJECT_FEATURE\u001b[0m\r\n", " Type: 'str'\r\n", " Default: 'Metadata_ObjectNumber'\r\n", " Column used for object counting.\r\n", " --subset_data_file=\u001b[4mSUBSET_DATA_FILE\u001b[0m\r\n", " Type: Optional['str | None']\r\n", " Default: None\r\n", " Optional path to a subset dataframe for filtering.\r\n", " --compression_options=\u001b[4mCOMPRESSION_OPTIONS\u001b[0m\r\n", " Type: Optional['str | di...\r\n", " Default: None\r\n", " Compression options for writing output.\r\n", " --float_format=\u001b[4mFLOAT_FORMAT\u001b[0m\r\n", " Type: Optional['str | None']\r\n", " Default: None\r\n", " Decimal precision for output formatting.\r\n", "\r\n", "\u001b[1mNOTES\u001b[0m\r\n", " You can also use flags syntax for POSITIONAL ARGUMENTS\r\n" ] } ], "source": [ "# Show all options for the aggregate sub-command\n", "!pycytominer aggregate --help" ] }, { "cell_type": "markdown", "id": "cell-data-intro", "metadata": {}, "source": [ "## Sample Data\n", "\n", "The CLI reads and writes files, CSV and Parquet are both supported as input. Below we generate the same synthetic Cell Painting dataset used in the [Introduction to Pycytominer](../tutorials/introduction_to_pycytominer.ipynb) tutorial and save it to a temporary working directory as Parquet files.\n", "\n", "In a real experiment you would replace `single_cells.parquet` with the output from CellProfiler or CytoTable.\n", "\n", "The simulation code is in the expandable block below, skip ahead if you just want to follow the CLI steps." ] }, { "cell_type": "raw", "id": "cell-generate-data-toggle", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. toggle::\n", "\n", " .. code-block:: python\n", "\n", " import tempfile\n", " from pathlib import Path\n", "\n", " import numpy as np\n", " import pandas as pd\n", "\n", " rng = np.random.default_rng(42)\n", "\n", " # ── Temporary working directory ────────────────────────────────────────────\n", " workdir = Path(tempfile.mkdtemp()).resolve()\n", "\n", " # ── Synthetic single-cell data ─────────────────────────────────────────────\n", " WELLS = {\n", " \"B02\": \"DMSO\", \"C02\": \"DMSO\",\n", " \"B03\": \"Compound_A\", \"C03\": \"Compound_A\",\n", " \"B04\": \"Compound_B\", \"C04\": \"Compound_B\",\n", " }\n", " N = 100\n", "\n", " rows = []\n", " for img_num, (well, treatment) in enumerate(WELLS.items(), start=1):\n", " is_a = float(treatment == \"Compound_A\")\n", " is_b = float(treatment == \"Compound_B\")\n", " cell_areas = rng.normal(500 + 180 * is_a - 90 * is_b, 120, N)\n", " for obj_num in range(1, N + 1):\n", " rows.append({\n", " \"Metadata_Plate\": \"Plate_1\",\n", " \"Metadata_Well\": well,\n", " \"Metadata_ImageNumber\": img_num,\n", " \"Metadata_ObjectNumber\": obj_num,\n", " \"Cells_AreaShape_Area\": cell_areas[obj_num - 1],\n", " \"Cells_AreaShape_BoundingBoxArea\": cell_areas[obj_num - 1] * 1.3 + rng.normal(0, 4),\n", " \"Cells_AreaShape_EulerNumber\": 1,\n", " \"Cells_AreaShape_Eccentricity\": float(np.clip(rng.normal(0.55, 0.12), 0, 1)),\n", " \"Cells_Intensity_MeanIntensity_Mito\": rng.normal(0.30, 0.06),\n", " \"Cells_Texture_Correlation_RNA_3_0_256\": rng.normal(0.22, 0.06),\n", " \"Cytoplasm_AreaShape_Area\": rng.normal(310, 80),\n", " \"Cytoplasm_Intensity_MeanIntensity_AGP\": rng.normal(0.25, 0.07),\n", " \"Nuclei_AreaShape_Area\": rng.normal(195, 55),\n", " \"Nuclei_AreaShape_Eccentricity\": float(np.clip(rng.normal(0.40, 0.10), 0, 1)),\n", " \"Nuclei_Intensity_MeanIntensity_DNA\": rng.normal(0.50, 0.08),\n", " })\n", "\n", " sc_path = workdir / \"single_cells.parquet\"\n", " pd.DataFrame(rows).to_parquet(sc_path, index=False)\n", " print(f\"Saved {len(rows):,} single cells to {sc_path.name}\")\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "cell-generate-data", "metadata": { "execution": { "iopub.execute_input": "2026-06-04T16:59:41.687008Z", "iopub.status.busy": "2026-06-04T16:59:41.686882Z", "iopub.status.idle": "2026-06-04T16:59:42.074016Z", "shell.execute_reply": "2026-06-04T16:59:42.073713Z" }, "nbsphinx": "hidden" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Saved 600 single cells to single_cells.parquet\n" ] } ], "source": [ "import tempfile\n", "from pathlib import Path\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "rng = np.random.default_rng(42)\n", "\n", "# ── Temporary working directory ────────────────────────────────────────────\n", "workdir = Path(tempfile.mkdtemp()).resolve()\n", "# ── Synthetic single-cell data ─────────────────────────────────────────────\n", "WELLS = {\n", " \"B02\": \"DMSO\",\n", " \"C02\": \"DMSO\",\n", " \"B03\": \"Compound_A\",\n", " \"C03\": \"Compound_A\",\n", " \"B04\": \"Compound_B\",\n", " \"C04\": \"Compound_B\",\n", "}\n", "N = 100\n", "\n", "rows = []\n", "for img_num, (well, treatment) in enumerate(WELLS.items(), start=1):\n", " is_a = float(treatment == \"Compound_A\")\n", " is_b = float(treatment == \"Compound_B\")\n", " cell_areas = rng.normal(500 + 180 * is_a - 90 * is_b, 120, N)\n", " for obj_num in range(1, N + 1):\n", " rows.append({\n", " \"Metadata_Plate\": \"Plate_1\",\n", " \"Metadata_Well\": well,\n", " \"Metadata_ImageNumber\": img_num,\n", " \"Metadata_ObjectNumber\": obj_num,\n", " \"Cells_AreaShape_Area\": cell_areas[obj_num - 1],\n", " \"Cells_AreaShape_BoundingBoxArea\": cell_areas[obj_num - 1] * 1.3\n", " + rng.normal(0, 4),\n", " \"Cells_AreaShape_EulerNumber\": 1,\n", " \"Cells_AreaShape_Eccentricity\": float(\n", " np.clip(rng.normal(0.55, 0.12), 0, 1)\n", " ),\n", " \"Cells_Intensity_MeanIntensity_Mito\": rng.normal(0.30, 0.06),\n", " \"Cells_Texture_Correlation_RNA_3_0_256\": rng.normal(0.22, 0.06),\n", " \"Cytoplasm_AreaShape_Area\": rng.normal(310, 80),\n", " \"Cytoplasm_Intensity_MeanIntensity_AGP\": rng.normal(0.25, 0.07),\n", " \"Nuclei_AreaShape_Area\": rng.normal(195, 55),\n", " \"Nuclei_AreaShape_Eccentricity\": float(\n", " np.clip(rng.normal(0.40, 0.10), 0, 1)\n", " ),\n", " \"Nuclei_Intensity_MeanIntensity_DNA\": rng.normal(0.50, 0.08),\n", " })\n", "\n", "sc_path = workdir / \"single_cells.parquet\"\n", "pd.DataFrame(rows).to_parquet(sc_path, index=False)\n", "print(f\"Saved {len(rows):,} single cells to {sc_path.name}\")" ] }, { "cell_type": "markdown", "id": "cell-aggregate-intro", "metadata": {}, "source": [ "---\n", "\n", "## Step 1: Aggregate\n", "\n", "`pycytominer aggregate` collapses single-cell rows into one representative profile per well by taking the **median** (or mean) of each feature across all cells in that well.\n", "\n", "**Key arguments:**\n", "- `--profiles`, input file (CSV or Parquet)\n", "- `--output_file`, where to write the result\n", "- `--strata`, comma-delimited metadata columns that define each group (default: `Metadata_Plate,Metadata_Well`)\n", "- `--operation`, aggregation function: `median` (default) or `mean`\n", "- `--output_type`, `csv` (default) or `parquet`" ] }, { "cell_type": "code", "execution_count": 4, "id": "cell-aggregate-cmd", "metadata": { "execution": { "iopub.execute_input": "2026-06-04T16:59:42.075429Z", "iopub.status.busy": "2026-06-04T16:59:42.075316Z", "iopub.status.idle": "2026-06-04T16:59:43.666563Z", "shell.execute_reply": "2026-06-04T16:59:43.665581Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wrote output file: well_profiles.parquet\r\n", "well_profiles.parquet\r\n" ] } ], "source": [ "!pycytominer aggregate --profiles {workdir}/single_cells.parquet --output_file {workdir}/well_profiles.parquet --strata \"Metadata_Plate,Metadata_Well\" --operation median --output_type parquet 2>&1 | sed \"s|{workdir}/||g\"" ] }, { "cell_type": "code", "execution_count": 5, "id": "cell-aggregate-inspect", "metadata": { "execution": { "iopub.execute_input": "2026-06-04T16:59:43.669393Z", "iopub.status.busy": "2026-06-04T16:59:43.669237Z", "iopub.status.idle": "2026-06-04T16:59:43.711767Z", "shell.execute_reply": "2026-06-04T16:59:43.711466Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Well profiles: (6, 13) (one row per well)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Metadata_PlateMetadata_WellCells_AreaShape_AreaCells_AreaShape_BoundingBoxAreaCells_AreaShape_EulerNumberCells_AreaShape_EccentricityCells_Intensity_MeanIntensity_MitoCells_Texture_Correlation_RNA_3_0_256Cytoplasm_AreaShape_AreaCytoplasm_Intensity_MeanIntensity_AGPNuclei_AreaShape_AreaNuclei_AreaShape_EccentricityNuclei_Intensity_MeanIntensity_DNA
0Plate_1B02499.741578646.4101411.00.5515900.3052350.221010309.3617690.252230191.1210170.4076950.492709
1Plate_1B03689.065353895.2008601.00.5506860.3047960.223964319.6918550.250131190.2283100.3948030.508586
2Plate_1B04406.933246529.8710381.00.5355060.2870340.229690330.4551370.254138189.3925360.3947290.509548
\n", "
" ], "text/plain": [ " Metadata_Plate Metadata_Well Cells_AreaShape_Area \\\n", "0 Plate_1 B02 499.741578 \n", "1 Plate_1 B03 689.065353 \n", "2 Plate_1 B04 406.933246 \n", "\n", " Cells_AreaShape_BoundingBoxArea Cells_AreaShape_EulerNumber \\\n", "0 646.410141 1.0 \n", "1 895.200860 1.0 \n", "2 529.871038 1.0 \n", "\n", " Cells_AreaShape_Eccentricity Cells_Intensity_MeanIntensity_Mito \\\n", "0 0.551590 0.305235 \n", "1 0.550686 0.304796 \n", "2 0.535506 0.287034 \n", "\n", " Cells_Texture_Correlation_RNA_3_0_256 Cytoplasm_AreaShape_Area \\\n", "0 0.221010 309.361769 \n", "1 0.223964 319.691855 \n", "2 0.229690 330.455137 \n", "\n", " Cytoplasm_Intensity_MeanIntensity_AGP Nuclei_AreaShape_Area \\\n", "0 0.252230 191.121017 \n", "1 0.250131 190.228310 \n", "2 0.254138 189.392536 \n", "\n", " Nuclei_AreaShape_Eccentricity Nuclei_Intensity_MeanIntensity_DNA \n", "0 0.407695 0.492709 \n", "1 0.394803 0.508586 \n", "2 0.394729 0.509548 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wp = pd.read_parquet(workdir / \"well_profiles.parquet\")\n", "print(f\"Well profiles: {wp.shape} (one row per well)\")\n", "wp.head(3)" ] }, { "cell_type": "markdown", "id": "cell-annotate-intro", "metadata": {}, "source": [ "---\n", "\n", "## Step 2: Annotate\n", "\n", "`pycytominer annotate` joins a **plate map** file onto the well profiles, adding columns such as treatment, cell line, and concentration. The plate map is a CSV (or any tabular format) where each row describes one well.\n", "\n", "**Key arguments:**\n", "- `--platemap`, path to the plate map file\n", "- `--join_on`, two comma-delimited column names: `platemap_col,profiles_col` (default: `Metadata_well_position,Metadata_Well`)\n", "- `--add_metadata_id_to_platemap`, prefix new columns with `Metadata_` (default: `True`)" ] }, { "cell_type": "code", "execution_count": 6, "id": "cell-platemap", "metadata": { "execution": { "iopub.execute_input": "2026-06-04T16:59:43.713203Z", "iopub.status.busy": "2026-06-04T16:59:43.713097Z", "iopub.status.idle": "2026-06-04T16:59:43.720327Z", "shell.execute_reply": "2026-06-04T16:59:43.719982Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
well_positiontreatmentcell_lineconcentration_um
0B02DMSOHeLa0.0
1C02DMSOHeLa0.0
2B03Compound_AHeLa10.0
3C03Compound_AHeLa10.0
4B04Compound_BHeLa5.0
5C04Compound_BHeLa5.0
\n", "
" ], "text/plain": [ " well_position treatment cell_line concentration_um\n", "0 B02 DMSO HeLa 0.0\n", "1 C02 DMSO HeLa 0.0\n", "2 B03 Compound_A HeLa 10.0\n", "3 C03 Compound_A HeLa 10.0\n", "4 B04 Compound_B HeLa 5.0\n", "5 C04 Compound_B HeLa 5.0" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create the plate map CSV\n", "platemap = pd.DataFrame({\n", " \"well_position\": [\"B02\", \"C02\", \"B03\", \"C03\", \"B04\", \"C04\"],\n", " \"treatment\": [\n", " \"DMSO\",\n", " \"DMSO\",\n", " \"Compound_A\",\n", " \"Compound_A\",\n", " \"Compound_B\",\n", " \"Compound_B\",\n", " ],\n", " \"cell_line\": [\"HeLa\"] * 6,\n", " \"concentration_um\": [0.0, 0.0, 10.0, 10.0, 5.0, 5.0],\n", "})\n", "platemap.to_csv(workdir / \"platemap.csv\", index=False)\n", "platemap" ] }, { "cell_type": "code", "execution_count": 7, "id": "cell-annotate-cmd", "metadata": { "execution": { "iopub.execute_input": "2026-06-04T16:59:43.721688Z", "iopub.status.busy": "2026-06-04T16:59:43.721574Z", "iopub.status.idle": "2026-06-04T16:59:45.350912Z", "shell.execute_reply": "2026-06-04T16:59:45.350189Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wrote output file: annotated.parquet\r\n", "annotated.parquet\r\n" ] } ], "source": [ "!pycytominer annotate --profiles {workdir}/well_profiles.parquet --platemap {workdir}/platemap.csv --output_file {workdir}/annotated.parquet --join_on \"Metadata_well_position,Metadata_Well\" --output_type parquet 2>&1 | sed \"s|{workdir}/||g\"" ] }, { "cell_type": "code", "execution_count": 8, "id": "cell-annotate-inspect", "metadata": { "execution": { "iopub.execute_input": "2026-06-04T16:59:45.353095Z", "iopub.status.busy": "2026-06-04T16:59:45.352932Z", "iopub.status.idle": "2026-06-04T16:59:45.362690Z", "shell.execute_reply": "2026-06-04T16:59:45.362408Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Annotated profiles: (6, 16)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Metadata_treatmentMetadata_cell_lineMetadata_concentration_umMetadata_PlateMetadata_Well
0DMSOHeLa0.0Plate_1B02
1DMSOHeLa0.0Plate_1C02
2Compound_AHeLa10.0Plate_1B03
\n", "
" ], "text/plain": [ " Metadata_treatment Metadata_cell_line Metadata_concentration_um \\\n", "0 DMSO HeLa 0.0 \n", "1 DMSO HeLa 0.0 \n", "2 Compound_A HeLa 10.0 \n", "\n", " Metadata_Plate Metadata_Well \n", "0 Plate_1 B02 \n", "1 Plate_1 C02 \n", "2 Plate_1 B03 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ann = pd.read_parquet(workdir / \"annotated.parquet\")\n", "print(f\"Annotated profiles: {ann.shape}\")\n", "ann[[c for c in ann.columns if c.startswith(\"Metadata_\")]].head(3)" ] }, { "cell_type": "markdown", "id": "cell-normalize-intro", "metadata": {}, "source": [ "---\n", "\n", "## Step 3: Normalize\n", "\n", "`pycytominer normalize` scales features to a common range and limits plate-to-plate technical variation. Z-scoring against DMSO control wells (`--samples`) is the most common approach.\n", "\n", "**Key arguments:**\n", "- `--samples`, a pandas query string selecting the normalization reference. Use `all` to normalize against the entire plate.\n", "- `--method`, normalization method: `standardize` (z-score, default), `robustize` (MAD-based), or `spherize`" ] }, { "cell_type": "code", "execution_count": 9, "id": "cell-normalize-cmd", "metadata": { "execution": { "iopub.execute_input": "2026-06-04T16:59:45.364223Z", "iopub.status.busy": "2026-06-04T16:59:45.364110Z", "iopub.status.idle": "2026-06-04T16:59:47.070659Z", "shell.execute_reply": "2026-06-04T16:59:47.070095Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wrote output file: normalized.parquet\r\n", "normalized.parquet\r\n" ] } ], "source": [ "!pycytominer normalize --profiles {workdir}/annotated.parquet --output_file {workdir}/normalized.parquet --samples \"Metadata_treatment == 'DMSO'\" --method standardize --output_type parquet 2>&1 | sed \"s|{workdir}/||g\"" ] }, { "cell_type": "code", "execution_count": 10, "id": "cell-normalize-inspect", "metadata": { "execution": { "iopub.execute_input": "2026-06-04T16:59:47.072830Z", "iopub.status.busy": "2026-06-04T16:59:47.072652Z", "iopub.status.idle": "2026-06-04T16:59:47.085780Z", "shell.execute_reply": "2026-06-04T16:59:47.085286Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Normalized profiles: (6, 16)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Metadata_treatmentMetadata_cell_lineMetadata_concentration_umMetadata_PlateMetadata_WellCells_AreaShape_AreaCells_AreaShape_BoundingBoxAreaCells_AreaShape_EulerNumberCells_AreaShape_EccentricityCells_Intensity_MeanIntensity_MitoCells_Texture_Correlation_RNA_3_0_256Cytoplasm_AreaShape_AreaCytoplasm_Intensity_MeanIntensity_AGPNuclei_AreaShape_AreaNuclei_AreaShape_EccentricityNuclei_Intensity_MeanIntensity_DNA
0DMSOHeLa0.0Plate_1B02-1.000000-1.000000.01.0000001.0000001.0000001.000001.0000001.0000001.0000001.000000
1DMSOHeLa0.0Plate_1C021.0000001.000000.0-1.000000-1.000000-1.000000-1.00000-1.000000-1.000000-1.000000-1.000000
2Compound_AHeLa10.0Plate_1B0352.30203542.697530.00.0203320.6948334.4131582.711860.3096240.829585-1.3780758.707708
\n", "
" ], "text/plain": [ " Metadata_treatment Metadata_cell_line Metadata_concentration_um \\\n", "0 DMSO HeLa 0.0 \n", "1 DMSO HeLa 0.0 \n", "2 Compound_A HeLa 10.0 \n", "\n", " Metadata_Plate Metadata_Well Cells_AreaShape_Area \\\n", "0 Plate_1 B02 -1.000000 \n", "1 Plate_1 C02 1.000000 \n", "2 Plate_1 B03 52.302035 \n", "\n", " Cells_AreaShape_BoundingBoxArea Cells_AreaShape_EulerNumber \\\n", "0 -1.00000 0.0 \n", "1 1.00000 0.0 \n", "2 42.69753 0.0 \n", "\n", " Cells_AreaShape_Eccentricity Cells_Intensity_MeanIntensity_Mito \\\n", "0 1.000000 1.000000 \n", "1 -1.000000 -1.000000 \n", "2 0.020332 0.694833 \n", "\n", " Cells_Texture_Correlation_RNA_3_0_256 Cytoplasm_AreaShape_Area \\\n", "0 1.000000 1.00000 \n", "1 -1.000000 -1.00000 \n", "2 4.413158 2.71186 \n", "\n", " Cytoplasm_Intensity_MeanIntensity_AGP Nuclei_AreaShape_Area \\\n", "0 1.000000 1.000000 \n", "1 -1.000000 -1.000000 \n", "2 0.309624 0.829585 \n", "\n", " Nuclei_AreaShape_Eccentricity Nuclei_Intensity_MeanIntensity_DNA \n", "0 1.000000 1.000000 \n", "1 -1.000000 -1.000000 \n", "2 -1.378075 8.707708 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "norm = pd.read_parquet(workdir / \"normalized.parquet\")\n", "print(f\"Normalized profiles: {norm.shape}\")\n", "norm.head(3)" ] }, { "cell_type": "markdown", "id": "cell-featsel-intro", "metadata": {}, "source": [ "---\n", "\n", "## Step 4: Feature Select\n", "\n", "`pycytominer feature_select` removes uninformative features. Multiple operations can be applied in one call by passing a comma-delimited list.\n", "\n", "**Key arguments:**\n", "- `--operation`, comma-delimited list of operations to apply:\n", " - `variance_threshold`, drop near-constant features\n", " - `correlation_threshold`, drop one of each highly correlated pair\n", " - `blocklist`, drop features known to be unreliable across assays\n", " - `drop_na_columns`, drop columns with too many missing values\n", " - `noise_removal`, remove features with low signal-to-noise ratio" ] }, { "cell_type": "code", "execution_count": 11, "id": "cell-featsel-cmd", "metadata": { "execution": { "iopub.execute_input": "2026-06-04T16:59:47.087400Z", "iopub.status.busy": "2026-06-04T16:59:47.087275Z", "iopub.status.idle": "2026-06-04T16:59:48.746330Z", "shell.execute_reply": "2026-06-04T16:59:48.745797Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wrote output file: selected.parquet\r\n", "selected.parquet\r\n" ] } ], "source": [ "!pycytominer feature_select --profiles {workdir}/normalized.parquet --output_file {workdir}/selected.parquet --operation \"variance_threshold,correlation_threshold,blocklist\" --output_type parquet 2>&1 | sed \"s|{workdir}/||g\"" ] }, { "cell_type": "code", "execution_count": 12, "id": "cell-featsel-inspect", "metadata": { "execution": { "iopub.execute_input": "2026-06-04T16:59:48.748044Z", "iopub.status.busy": "2026-06-04T16:59:48.747919Z", "iopub.status.idle": "2026-06-04T16:59:48.758170Z", "shell.execute_reply": "2026-06-04T16:59:48.757880Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Features: 11 -> 8\n", "Removed: {'Cells_AreaShape_Area', 'Cells_AreaShape_EulerNumber', 'Cells_Texture_Correlation_RNA_3_0_256'}\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Metadata_treatmentMetadata_cell_lineMetadata_concentration_umMetadata_PlateMetadata_WellCells_AreaShape_BoundingBoxAreaCells_AreaShape_EccentricityCells_Intensity_MeanIntensity_MitoCytoplasm_AreaShape_AreaCytoplasm_Intensity_MeanIntensity_AGPNuclei_AreaShape_AreaNuclei_AreaShape_EccentricityNuclei_Intensity_MeanIntensity_DNA
0DMSOHeLa0.0Plate_1B02-1.000001.0000001.0000001.000001.0000001.0000001.0000001.000000
1DMSOHeLa0.0Plate_1C021.00000-1.000000-1.000000-1.00000-1.000000-1.000000-1.000000-1.000000
2Compound_AHeLa10.0Plate_1B0342.697530.0203320.6948332.711860.3096240.829585-1.3780758.707708
\n", "
" ], "text/plain": [ " Metadata_treatment Metadata_cell_line Metadata_concentration_um \\\n", "0 DMSO HeLa 0.0 \n", "1 DMSO HeLa 0.0 \n", "2 Compound_A HeLa 10.0 \n", "\n", " Metadata_Plate Metadata_Well Cells_AreaShape_BoundingBoxArea \\\n", "0 Plate_1 B02 -1.00000 \n", "1 Plate_1 C02 1.00000 \n", "2 Plate_1 B03 42.69753 \n", "\n", " Cells_AreaShape_Eccentricity Cells_Intensity_MeanIntensity_Mito \\\n", "0 1.000000 1.000000 \n", "1 -1.000000 -1.000000 \n", "2 0.020332 0.694833 \n", "\n", " Cytoplasm_AreaShape_Area Cytoplasm_Intensity_MeanIntensity_AGP \\\n", "0 1.00000 1.000000 \n", "1 -1.00000 -1.000000 \n", "2 2.71186 0.309624 \n", "\n", " Nuclei_AreaShape_Area Nuclei_AreaShape_Eccentricity \\\n", "0 1.000000 1.000000 \n", "1 -1.000000 -1.000000 \n", "2 0.829585 -1.378075 \n", "\n", " Nuclei_Intensity_MeanIntensity_DNA \n", "0 1.000000 \n", "1 -1.000000 \n", "2 8.707708 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sel = pd.read_parquet(workdir / \"selected.parquet\")\n", "feat_before = [c for c in norm.columns if not c.startswith(\"Metadata_\")]\n", "feat_after = [c for c in sel.columns if not c.startswith(\"Metadata_\")]\n", "print(f\"Features: {len(feat_before)} -> {len(feat_after)}\")\n", "print(f\"Removed: {set(feat_before) - set(feat_after)}\")\n", "sel.head(3)" ] }, { "cell_type": "markdown", "id": "cell-consensus-intro", "metadata": {}, "source": [ "---\n", "\n", "## Step 5: Consensus\n", "\n", "`pycytominer consensus` collapses replicate wells into one profile per biological condition by taking the median (or modz) across replicates.\n", "\n", "**Key arguments:**\n", "- `--replicate_columns`, comma-delimited metadata columns that identify a unique condition (replicates share all of these values)\n", "- `--operation`, `median` (default), `mean`, or `modz` (moderated z-score, recommended for large screens)" ] }, { "cell_type": "code", "execution_count": 13, "id": "cell-consensus-cmd", "metadata": { "execution": { "iopub.execute_input": "2026-06-04T16:59:48.759546Z", "iopub.status.busy": "2026-06-04T16:59:48.759444Z", "iopub.status.idle": "2026-06-04T16:59:50.580493Z", "shell.execute_reply": "2026-06-04T16:59:50.580036Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wrote output file: consensus.parquet\r\n", "consensus.parquet\r\n" ] } ], "source": [ "!pycytominer consensus --profiles {workdir}/selected.parquet --output_file {workdir}/consensus.parquet --replicate_columns \"Metadata_treatment,Metadata_cell_line,Metadata_concentration_um\" --operation median --output_type parquet 2>&1 | sed \"s|{workdir}/||g\"" ] }, { "cell_type": "code", "execution_count": 14, "id": "cell-consensus-inspect", "metadata": { "execution": { "iopub.execute_input": "2026-06-04T16:59:50.582357Z", "iopub.status.busy": "2026-06-04T16:59:50.582213Z", "iopub.status.idle": "2026-06-04T16:59:50.590306Z", "shell.execute_reply": "2026-06-04T16:59:50.590003Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Consensus profiles: (3, 11) (one row per condition)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Metadata_treatmentMetadata_cell_lineMetadata_concentration_um
0Compound_AHeLa10.0
1Compound_BHeLa5.0
2DMSOHeLa0.0
\n", "
" ], "text/plain": [ " Metadata_treatment Metadata_cell_line Metadata_concentration_um\n", "0 Compound_A HeLa 10.0\n", "1 Compound_B HeLa 5.0\n", "2 DMSO HeLa 0.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cons = pd.read_parquet(workdir / \"consensus.parquet\")\n", "print(f\"Consensus profiles: {cons.shape} (one row per condition)\")\n", "cons[[c for c in cons.columns if c.startswith(\"Metadata_\")]]" ] }, { "cell_type": "markdown", "id": "cell-summary", "metadata": {}, "source": [ "---\n", "\n", "## Summary\n", "\n", "You ran the full pycytominer pipeline using only command-line calls:\n", "\n", "```bash\n", "pycytominer aggregate --profiles single_cells.csv --output_file well_profiles.parquet --strata \"Metadata_Plate,Metadata_Well\"\n", "pycytominer annotate --profiles well_profiles.parquet --output_file annotated.parquet --platemap platemap.csv\n", "pycytominer normalize --profiles annotated.parquet --output_file normalized.parquet --samples \"Metadata_treatment == 'DMSO'\"\n", "pycytominer feature_select --profiles normalized.parquet --output_file selected.parquet --operation \"variance_threshold,correlation_threshold,blocklist\"\n", "pycytominer consensus --profiles selected.parquet --output_file consensus.parquet --replicate_columns \"Metadata_treatment,Metadata_cell_line,Metadata_concentration_um\"\n", "```\n", "\n", "### Tips for scripting\n", "\n", "- List all commands with `pycytominer`; get full option docs with `pycytominer COMMAND --help`\n", "- Chain into Bash scripts or `Makefile` targets for reproducible pipelines\n", "- **Query strings** in `--samples` follow [pandas query syntax](https://pandas.pydata.org/docs/user_guide/indexing.html#the-query-method) , any valid pandas query expression works" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }