Copilot Instructions

Project Overview

Repo covers a lightweight unsupervised anomaly detection pipeline focused on historian CSV exports.
Core package lives in src/; CLI entry points train.py and detect.py orchestrate data prep, feature engineering, and model scoring.
requirements.txt pins analytics stack (pandas, scikit-learn, numpy, joblib)—assume Python 3.10+ with virtualenv per README.

Reuse src/data_loader.py::load_timeseries with DataLoadConfig to ensure consistent timestamp parsing, optional timezone localization, and feature inference.
When adding new ingestion logic, funnel it through load_timeseries or extend it; downstream code relies on df.attrs["feature_columns"] being populated for inference overrides.
Raise DataValidationError for user-facing data issues instead of generic exceptions so CLIs can surface clear messages.

feature_engineering.build_feature_matrix is the single entry point for derived features; it controls rolling stats (add_rolling_statistics) and rate-of-change (add_rate_of_change).
Rolling windows are expressed with pandas offset aliases (default 5T, 15T, 60T); keep new feature names suffix-based so persisted artifacts stay discoverable.
Always pass through timestamp_column and any id_columns; the helper filters non-numeric fields automatically.

CLI expects PowerShell-friendly invocation (^ line continuations) and creates artifact bundles with pipeline + metadata.
fit_pipeline wraps StandardScaler + IsolationForest with configurable contamination, estimators, and random-state—extend via the existing Pipeline to avoid breaking saved artifacts.
generate_scores writes anomaly flags plus ranking; extra columns must come from the non-feature portion of feature_df.
Outputs default to ml/anomaly_detection/models/ and ml/anomaly_detection/outputs/; use ensure_parent_dir before writing new files.

Loads the joblib artifact and rehydrates config (rolling flags, windows) when building features; keep artifact schema stable across changes.
Supports overrides for timestamp, features, and id columns—mirror option names if adding parameters to maintain parity with training CLI.
--keep-features toggles whether engineered columns are retained in the scored CSV; preserve this pattern when expanding outputs.
If you add new anomaly criteria, integrate with the existing alert_threshold / top_n flow instead of inventing parallel mechanisms.

Scripts use local imports (e.g., from data_loader import ...); when creating new modules keep them under src/ and import similarly to run via python ml/anomaly_detection/src/<script>.py.
Favor pandas-native operations and avoid mutating input frames in place—helpers copy data before augmentation.
Gracefully handle timezone-aware timestamps by checking dtype (pandas.api.types) as done in feature helpers.
There are no bundled tests; when adding features, demonstrate usage via docstrings or README snippets and validate with small CSV fixtures.
Readme commands assume Windows PowerShell; prefer caret continuations and backslash paths when documenting new CLI usage.