Files
controls-web/ai_agents/anomaly_detection/.github/copilot-instructions.md
2026-02-17 09:29:34 -06:00

36 lines
3.3 KiB
Markdown

# Copilot Instructions
## Project Overview
- Repo covers a lightweight unsupervised anomaly detection pipeline focused on historian CSV exports.
- Core package lives in `src/`; CLI entry points `train.py` and `detect.py` orchestrate data prep, feature engineering, and model scoring.
- `requirements.txt` pins analytics stack (`pandas`, `scikit-learn`, `numpy`, `joblib`)—assume Python 3.10+ with virtualenv per README.
## Data Loading & Validation
- Reuse `src/data_loader.py::load_timeseries` with `DataLoadConfig` to ensure consistent timestamp parsing, optional timezone localization, and feature inference.
- When adding new ingestion logic, funnel it through `load_timeseries` or extend it; downstream code relies on `df.attrs["feature_columns"]` being populated for inference overrides.
- Raise `DataValidationError` for user-facing data issues instead of generic exceptions so CLIs can surface clear messages.
## Feature Engineering Patterns
- `feature_engineering.build_feature_matrix` is the single entry point for derived features; it controls rolling stats (`add_rolling_statistics`) and rate-of-change (`add_rate_of_change`).
- Rolling windows are expressed with pandas offset aliases (default `5T`, `15T`, `60T`); keep new feature names suffix-based so persisted artifacts stay discoverable.
- Always pass through `timestamp_column` and any `id_columns`; the helper filters non-numeric fields automatically.
## Training Workflow (`src/train.py`)
- CLI expects PowerShell-friendly invocation (`^` line continuations) and creates artifact bundles with pipeline + metadata.
- `fit_pipeline` wraps `StandardScaler` + `IsolationForest` with configurable contamination, estimators, and random-state—extend via the existing Pipeline to avoid breaking saved artifacts.
- `generate_scores` writes anomaly flags plus ranking; extra columns must come from the non-feature portion of `feature_df`.
- Outputs default to `ml/anomaly_detection/models/` and `ml/anomaly_detection/outputs/`; use `ensure_parent_dir` before writing new files.
## Detection Workflow (`src/detect.py`)
- Loads the joblib artifact and rehydrates config (rolling flags, windows) when building features; keep artifact schema stable across changes.
- Supports overrides for timestamp, features, and id columns—mirror option names if adding parameters to maintain parity with training CLI.
- `--keep-features` toggles whether engineered columns are retained in the scored CSV; preserve this pattern when expanding outputs.
- If you add new anomaly criteria, integrate with the existing `alert_threshold` / `top_n` flow instead of inventing parallel mechanisms.
## Project Conventions & Tips
- Scripts use local imports (e.g., `from data_loader import ...`); when creating new modules keep them under `src/` and import similarly to run via `python ml/anomaly_detection/src/<script>.py`.
- Favor pandas-native operations and avoid mutating input frames in place—helpers copy data before augmentation.
- Gracefully handle timezone-aware timestamps by checking dtype (`pandas.api.types`) as done in feature helpers.
- There are no bundled tests; when adding features, demonstrate usage via docstrings or README snippets and validate with small CSV fixtures.
- Readme commands assume Windows PowerShell; prefer caret continuations and backslash paths when documenting new CLI usage.