Load¶

Module for loading profiles from files or dataframes.

pycytominer.cyto_utils.load.infer_delim(file: str | Path | Any) → str¶

Sniff the delimiter in the given file

Parameters:: file (str) – File name
Returns:: the delimiter used in the dataframe (typically either tab or commas)
Return type:: str

pycytominer.cyto_utils.load.is_path_a_parquet_dataset_dir(file: str | Path) → bool¶

Check whether a path is a parquet dataset directory.

Parameters:: file (Union[str, pathlib.Path]) – Path to inspect.
Returns:: Returns True when the path is a directory, contains at least one direct file child, and all direct file children are parquet files.
Return type:: bool
Raises:: FileNotFoundError – Raised if the provided path in the file does not exist.

Notes

If file is not a string or path-like object, the function prints a message and returns False rather than raising TypeError.

pycytominer.cyto_utils.load.is_path_a_parquet_file(file: str | Path) → bool¶

Checks if the provided file path is a parquet file.

Identify parquet files by inspecting the file extensions. If the file does not end with parquet, this will return False, else True.

Parameters:: file (Union[str, pathlib.Path]) – path to parquet file
Returns:: Returns True if the file path contains .parquet, else it will return False
Return type:: bool
Raises:: FileNotFoundError – Raised if the provided path in the file does not exist.

Notes

If file is not a string or path-like object, the function prints a message and returns False rather than raising TypeError.

pycytominer.cyto_utils.load.load_cytotable_profiles(warehouse_path: str | Path | PurePath, table_name: str = 'joined_profiles', namespace: str = 'profiles') → DataFrame¶

Load a profile table from a CytoTable-style warehouse layout.

This helper loads profile data stored as parquet fragments within an Iceberg-style table directory, typically under warehouse/<namespace>/<table_name>/data, where namespace is typically profiles. It is intended for CytoTable-style local outputs that organize tables by namespace and table name for downstream Pycytominer processing.

Parameters:

warehouse_path (path-like) – Path to either the warehouse root or the project directory that contains a warehouse/ directory.
table_name (str, default "joined_profiles") – Table name to load from within the namespace. The default, joined_profiles, is the conventional CytoTable table that joins object-level profile measurements across compartments into one profile table.
namespace (str, default "profiles") – Iceberg namespace that contains the table. For profile data this is typically profiles.

Returns:

Loaded table as a pandas dataframe.

Return type:

pd.DataFrame

Raises:

FileNotFoundError – Raised when the requested table cannot be resolved to a parquet dataset.

pycytominer.cyto_utils.load.load_npz_features(npz_file: str, fallback_feature_prefix: str = 'DP', metadata: bool = True) → DataFrame¶

Load an npz file storing features and, sometimes, metadata.

The function will first search the .npz file for a metadata column called “Metadata_Model”. If the field exists, the function uses this entry as the feature prefix. If it doesn’t exist, use the fallback_feature_prefix.

If the npz file does not exist, this function returns an empty dataframe.

Parameters:

npz_file (str) – file path to the compressed output (typically DeepProfiler output)
fallback_feature_prefix (str) – a string to prefix all features [default: “DP”].
metadata (bool) – whether or not to load metadata [default: True]

Returns:

df – pandas DataFrame of profiles

Return type:

pd.DataFrame

pycytominer.cyto_utils.load.load_npz_locations(npz_file: str, location_x_col_index: int = 0, location_y_col_index: int = 1) → DataFrame¶

Load an npz file storing locations and, sometimes, metadata.

The function will first search the .npz file for a metadata column called “locations”. If the field exists, the function uses this entry as the feature prefix.

If the npz file does not exist, this function returns an empty dataframe.

Parameters:

npz_file (str) – file path to the compressed output (typically DeepProfiler output)
location_x_col_index (int) – index of the x location column (which column in DP output has X coords)
location_y_col_index (int) – index of the y location column (which column in DP output has Y coords)

Returns:

df – pandas DataFrame of profiles

Return type:

pd.DataFrame

pycytominer.cyto_utils.load.load_platemap(platemap: str | DataFrame, add_metadata_id: bool = True, sep: str | None = None) → DataFrame¶

Unless a dataframe is provided, load the given platemap dataframe from path or string.

Parameters:

platemap (pd.DataFrame or str) – Location or actual pd.DataFrame of platemap file.
add_metadata_id (bool, default True) – Whether Metadata_ should be prepended to all platemap columns.
sep (str, optional) –
The column delimiter used in the platemap file (e.g. "," for CSV, "\t" for TSV). Only relevant when platemap is a file path rather than a DataFrame — has no effect when a DataFrame is passed directly.

When None (the default), the delimiter is detected automatically via infer_delim(). Automatic detection relies on Python’s csv.Sniffer, which can be unreliable on Windows for tab-separated files (see cpython#119123). If you are on Windows and loading a TSV platemap, pass sep="\t" explicitly.

Returns:

platemap – pandas DataFrame of platemap.

Return type:

pd.DataFrame

pycytominer.cyto_utils.load.load_profiles(profiles: str | Path | PurePath | DataFrame | AnnDataLike) → DataFrame¶

Unless a dataframe is provided, load the given profile dataframe from path or string.

This loader supports direct files, parquet dataset directories, AnnData inputs, and unambiguous CytoTable-style warehouse roots that contain a single parquet-backed table under profiles/*/data. This is the entry point used by higher-level functions such as normalize() and annotate() when they receive a path-like profiles input. If a warehouse path contains multiple profile tables, this loader will not guess which one to use; call load_cytotable_profiles() directly with an explicit table_name and namespace instead.

Parameters:

profiles – {str, pathlib.Path, pathlib.PurePath, pandas.DataFrame, ad.AnnData} File location, warehouse root, or in-memory profile data.

Returns:

pandas DataFrame of profiles
Raises
——-
FileNotFoundError – Raised if the provided profile does not exists

pycytominer.cyto_utils.load.resolve_cytotable_profiles_target(warehouse_path: str | Path | PurePath) → tuple[Path, str, str] | None¶

Resolve a single profile table from a CytoTable-style warehouse.

This helper only auto-resolves a target when exactly one parquet-backed profile table is present under the expected profile namespace layout. It does not infer which table to use based on downstream pycytominer operations or processing level; callers must be explicit when multiple profile tables are available.

Parameters:: warehouse_path (path-like) – Path to either the warehouse root or a project directory that contains a warehouse/ directory.
Returns:: Returns the resolved warehouse root path, namespace, and table name when exactly one parquet-backed profile table can be identified under the profile namespace. Returns None when the path does not expose a profile namespace in either <root>/profiles/<table> or <root>/warehouse/profiles/<table> form.
Return type:: tuple[pathlib.Path, str, str] or None
Raises:: ValueError – Raised when multiple parquet-backed profile tables are found and the intended target is ambiguous. This helper is only for the convenience case where a warehouse path exposes exactly one profile table. When multiple tables are present, use load_cytotable_profiles() with an explicit namespace and table name.

pycytominer.cyto_utils.load.resolve_parquet_path(path_like: str | Path | PurePath) → Path | None¶

Resolve file and dataset paths that pandas can read via parquet.

Parameters:: path_like (path-like) – Path to inspect.
Returns:: Resolved parquet file or dataset directory. Returns None when the path does not point to a parquet-backed source. This helper also resolves Iceberg-style table directories whose parquet data lives under a data/ child directory, such as CytoTable warehouse tables.
Return type:: pathlib.Path or None