Feature Select

Select features to use in downstream analysis based on specified selection method

pycytominer.feature_select.feature_select(profiles: str | DataFrame, features: str | list[str] = 'infer', image_features: bool = False, samples: str = 'all', operation: str | list[str] = 'variance_threshold', output_file: str | None = None, output_type: Literal['csv', 'parquet', 'anndata_h5ad', 'anndata_zarr'] | None = 'csv', na_cutoff: float = 0.05, corr_threshold: float = 0.9, corr_method: str = 'pearson', freq_cut: float = 0.05, unique_cut: float = 0.01, compression_options: str | dict[str, Any] | None = None, float_format: str | None = None, blocklist: str | list[str] | Blocklist | None = None, blocklist_name: str | list[str] | None = None, blocklist_file: str | None = None, outlier_cutoff: float = 500.0, noise_removal_perturb_groups: str | list[str] | None = None, noise_removal_stdev_cutoff: float | None = None) DataFrame | str

Performs feature selection based on the given operation.

Parameters:
  • profiles (pd.DataFrame or file) – DataFrame or file of profiles.

  • features (list, default "infer") – A list of strings corresponding to feature measurement column names in the profiles DataFrame. All features listed must be found in profiles. Defaults to “infer”. If “infer”, then assume CellProfiler features are those prefixed with “Cells”, “Nuclei”, or “Cytoplasm”.

  • image_features (bool, default False) – Whether to include inferred Image_* feature columns. When True, pycytominer preserves numeric image-level measurements while excluding non-numeric Image_* columns, which helps avoid treating image payload columns as profile features in mixed tables such as OME-Arrow-backed inputs.

  • samples (str, default "all") – Samples to provide operation on.

  • operation (list of str or str, default "variance_threshold) – Operations to perform on the input profiles.

  • output_file (str, optional) – If provided, will write feature selected profiles to file. If not specified, will return the feature selected profiles as output. We recommend that this output file be suffixed with “_normalized_variable_selected.csv”.

  • output_type (str, optional) – If provided, will write feature selected profiles as a specified file type (either CSV or parquet). If not specified and output_file is provided, then the file will be outputed as CSV as default.

  • na_cutoff (float, default 0.05) – Proportion of missing values in a column to tolerate before removing.

  • corr_threshold (float, default 0.9) – Value between (0, 1) to exclude features above if any two features are correlated above this threshold.

  • corr_method (str, default "pearson") – Correlation type to compute. Allowed methods are “spearman”, “kendall” and “pearson”.

  • freq_cut (float, default 0.05) – Ratio (2nd most common feature val / most common). Must range between 0 and 1. Remove features lower than freq_cut. A low freq_cut will remove features that have large difference between the most common feature and second most common feature. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, …])

  • unique_cut (float, default 0.01) – Ratio (num unique features / num samples). Must range between 0 and 1. Remove features less than unique cut. A low unique_cut will remove features that have very few different measurements compared to the number of samples.

  • compression_options (str or dict, optional) – Contains compression options as input to pd.DataFrame.to_csv(compression=compression_options). pandas version >= 1.2.

  • float_format (str, optional) – Decimal precision to use in writing output file as input to pd.DataFrame.to_csv(float_format=float_format). For example, use “%.3g” for 3 decimal precision.

  • blocklist (str, list of str, or Blocklist, optional) – Features to exclude when operation includes "blocklist". Accepts a feature name string, a list of feature name strings, or a Blocklist object. When None and blocklist_name is also None, the packaged default blocklist is applied automatically. For advanced usage — custom YAML registries, combining named lists with explicit features — construct a Blocklist directly and pass it here.

  • blocklist_name (str or list of str, optional) – Name(s) of packaged blocklists to use when blocklist is None. Each name is a top-level YAML key in the packaged blocklist registry (for example, default in default_blocklists.yaml). If None and blocklist is also None, the packaged default blocklist is loaded. Use "default" to load that registry entry explicitly. Multiple names are loaded in the order provided.

  • blocklist_file (str, optional) –

    Deprecated since version 2.0: Use blocklist (a list of feature names or a Blocklist object) instead. Previously accepted a path to a CSV file with a single blocklist column. This parameter will be removed in a future release.

  • outlier_cutoff (float, default 500) – The threshold at which the maximum or minimum value of a feature across a full experiment is excluded. Note that this procedure is typically applied after normalization.

  • noise_removal_perturb_groups (str or list of str, optional) – Perturbation groups corresponding to rows in profiles or the the name of the metadata column containing this information.

  • noise_removal_stdev_cutoff (float,optional) – Maximum mean feature standard deviation to be kept for noise removal, grouped by the identity of the perturbation from perturb_list. The data must already be normalized so that this cutoff can apply to all columns.

Returns:

pd.DataFrame:

The feature selected profile DataFrame. If output_file=None, then return the DataFrame. If you specify output_file, then write to file and do not return data.

str:

If output_file is provided, then the function returns the path to the output file.

Return type:

str or pd.DataFrame

See also

pycytominer.cyto_utils.blocklist.Blocklist

Full reference for blocklist construction, custom YAML registries, and combining named lists with explicit feature exclusions.