3.3 KiB
3.3 KiB
Copilot Instructions
Project Overview
- Repo covers a lightweight unsupervised anomaly detection pipeline focused on historian CSV exports.
- Core package lives in
src/; CLI entry pointstrain.pyanddetect.pyorchestrate data prep, feature engineering, and model scoring. requirements.txtpins analytics stack (pandas,scikit-learn,numpy,joblib)—assume Python 3.10+ with virtualenv per README.
Data Loading & Validation
- Reuse
src/data_loader.py::load_timeserieswithDataLoadConfigto ensure consistent timestamp parsing, optional timezone localization, and feature inference. - When adding new ingestion logic, funnel it through
load_timeseriesor extend it; downstream code relies ondf.attrs["feature_columns"]being populated for inference overrides. - Raise
DataValidationErrorfor user-facing data issues instead of generic exceptions so CLIs can surface clear messages.
Feature Engineering Patterns
feature_engineering.build_feature_matrixis the single entry point for derived features; it controls rolling stats (add_rolling_statistics) and rate-of-change (add_rate_of_change).- Rolling windows are expressed with pandas offset aliases (default
5T,15T,60T); keep new feature names suffix-based so persisted artifacts stay discoverable. - Always pass through
timestamp_columnand anyid_columns; the helper filters non-numeric fields automatically.
Training Workflow (src/train.py)
- CLI expects PowerShell-friendly invocation (
^line continuations) and creates artifact bundles with pipeline + metadata. fit_pipelinewrapsStandardScaler+IsolationForestwith configurable contamination, estimators, and random-state—extend via the existing Pipeline to avoid breaking saved artifacts.generate_scoreswrites anomaly flags plus ranking; extra columns must come from the non-feature portion offeature_df.- Outputs default to
ml/anomaly_detection/models/andml/anomaly_detection/outputs/; useensure_parent_dirbefore writing new files.
Detection Workflow (src/detect.py)
- Loads the joblib artifact and rehydrates config (rolling flags, windows) when building features; keep artifact schema stable across changes.
- Supports overrides for timestamp, features, and id columns—mirror option names if adding parameters to maintain parity with training CLI.
--keep-featurestoggles whether engineered columns are retained in the scored CSV; preserve this pattern when expanding outputs.- If you add new anomaly criteria, integrate with the existing
alert_threshold/top_nflow instead of inventing parallel mechanisms.
Project Conventions & Tips
- Scripts use local imports (e.g.,
from data_loader import ...); when creating new modules keep them undersrc/and import similarly to run viapython ml/anomaly_detection/src/<script>.py. - Favor pandas-native operations and avoid mutating input frames in place—helpers copy data before augmentation.
- Gracefully handle timezone-aware timestamps by checking dtype (
pandas.api.types) as done in feature helpers. - There are no bundled tests; when adding features, demonstrate usage via docstrings or README snippets and validate with small CSV fixtures.
- Readme commands assume Windows PowerShell; prefer caret continuations and backslash paths when documenting new CLI usage.