add all files

This commit is contained in:
Rucus
2026-02-17 09:29:34 -06:00
parent b8c8d67c67
commit 782d203799
21925 changed files with 2433086 additions and 0 deletions

View File

@@ -0,0 +1,35 @@
# Copilot Instructions
## Project Overview
- Repo covers a lightweight unsupervised anomaly detection pipeline focused on historian CSV exports.
- Core package lives in `src/`; CLI entry points `train.py` and `detect.py` orchestrate data prep, feature engineering, and model scoring.
- `requirements.txt` pins analytics stack (`pandas`, `scikit-learn`, `numpy`, `joblib`)—assume Python 3.10+ with virtualenv per README.
## Data Loading & Validation
- Reuse `src/data_loader.py::load_timeseries` with `DataLoadConfig` to ensure consistent timestamp parsing, optional timezone localization, and feature inference.
- When adding new ingestion logic, funnel it through `load_timeseries` or extend it; downstream code relies on `df.attrs["feature_columns"]` being populated for inference overrides.
- Raise `DataValidationError` for user-facing data issues instead of generic exceptions so CLIs can surface clear messages.
## Feature Engineering Patterns
- `feature_engineering.build_feature_matrix` is the single entry point for derived features; it controls rolling stats (`add_rolling_statistics`) and rate-of-change (`add_rate_of_change`).
- Rolling windows are expressed with pandas offset aliases (default `5T`, `15T`, `60T`); keep new feature names suffix-based so persisted artifacts stay discoverable.
- Always pass through `timestamp_column` and any `id_columns`; the helper filters non-numeric fields automatically.
## Training Workflow (`src/train.py`)
- CLI expects PowerShell-friendly invocation (`^` line continuations) and creates artifact bundles with pipeline + metadata.
- `fit_pipeline` wraps `StandardScaler` + `IsolationForest` with configurable contamination, estimators, and random-state—extend via the existing Pipeline to avoid breaking saved artifacts.
- `generate_scores` writes anomaly flags plus ranking; extra columns must come from the non-feature portion of `feature_df`.
- Outputs default to `ml/anomaly_detection/models/` and `ml/anomaly_detection/outputs/`; use `ensure_parent_dir` before writing new files.
## Detection Workflow (`src/detect.py`)
- Loads the joblib artifact and rehydrates config (rolling flags, windows) when building features; keep artifact schema stable across changes.
- Supports overrides for timestamp, features, and id columns—mirror option names if adding parameters to maintain parity with training CLI.
- `--keep-features` toggles whether engineered columns are retained in the scored CSV; preserve this pattern when expanding outputs.
- If you add new anomaly criteria, integrate with the existing `alert_threshold` / `top_n` flow instead of inventing parallel mechanisms.
## Project Conventions & Tips
- Scripts use local imports (e.g., `from data_loader import ...`); when creating new modules keep them under `src/` and import similarly to run via `python ml/anomaly_detection/src/<script>.py`.
- Favor pandas-native operations and avoid mutating input frames in place—helpers copy data before augmentation.
- Gracefully handle timezone-aware timestamps by checking dtype (`pandas.api.types`) as done in feature helpers.
- There are no bundled tests; when adding features, demonstrate usage via docstrings or README snippets and validate with small CSV fixtures.
- Readme commands assume Windows PowerShell; prefer caret continuations and backslash paths when documenting new CLI usage.

View File

@@ -0,0 +1,75 @@
# Anomaly Detection Starter Kit
This module seeds a machine-learning workflow for flagging unusual behavior in LASUCA's process data (steam flow, turbine RPM, conveyor load cells, etc.). It focuses on unsupervised anomaly detection so you can start surfacing outliers without labeled fault data.
## Project structure
```
ml/anomaly_detection/
├── README.md # Project overview and next steps
├── requirements.txt # Python dependencies for the pipeline
└── src/
├── __init__.py # Marks the package
├── data_loader.py # Helpers for reading & validating time-series data
├── feature_engineering.py # Domain feature transformations and rolling stats
├── train.py # CLI script to fit an Isolation Forest model
└── detect.py # CLI script to score new data with the trained model
```
## Quick start
1. **Create a virtual environment** inside the repository root and install dependencies:
```powershell
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r ml/anomaly_detection/requirements.txt
```
2. **Prepare a CSV export** with at least the following columns:
- `timestamp`: ISO 8601 timestamp or anything `pandas.to_datetime` can parse.
- Sensor columns: numerical fields such as `steam_tph`, `turbine_rpm`, `conveyor_tph`.
Additional metadata columns (e.g., `area`, `equipment`) are optional and help slice metrics later.
3. **Train a baseline model**:
```powershell
python ml/anomaly_detection/src/train.py ^
--data data/clean/process_snapshot.csv ^
--timestamp-column timestamp ^
--features steam_tph turbine_rpm conveyor_tph ^
--model-out ml/anomaly_detection/models/isolation_forest.joblib
```
The script standardizes numeric columns, fits an Isolation Forest, and saves the pipeline along with a CSV of anomaly scores.
4. **Score fresh data** (e.g., streaming batch or another day):
```powershell
python ml/anomaly_detection/src/detect.py ^
--data data/clean/process_snapshot_new.csv ^
--model ml/anomaly_detection/models/isolation_forest.joblib ^
--timestamp-column timestamp ^
--features steam_tph turbine_rpm conveyor_tph ^
--output data/clean/process_snapshot_new_scored.csv
```
## Roadmap ideas
| Phase | Goal | Details |
|-------|------|---------|
| Baseline | Clean data + isolation forest | Validate signals, calculate rolling mean/std, track top anomalies per asset & shift. |
| Enhancements | Context-aware detection | Separate models per unit (boiler, milling line), include load-based normalization, add feedback loop for dismissed alerts. |
| Advanced | Forecast + residual alerts | Train LSTM/Prophet forecasts and alert on residuals, integrate maintenance work orders. |
## Data tips
- **Resample** fast signals to a consistent cadence (e.g., 1 min) to smooth control jitter.
- **Align units** (e.g., convert all steam flows to TPH) before feeding models.
- **Label known events** (downs, maintenance) to benchmark the detector and reduce false positives.
## Next steps
1. Pull a week of reconciled historian data into `data/clean/`.
2. Run `train.py` to create an initial anomaly score CSV.
3. Visualize results in the existing dashboards or a Jupyter notebook (e.g., scatter of anomaly score vs. timestamp grouped by equipment).
4. Iterate on feature engineering: rolling gradients, energy-per-ton, turbine slip ratios, etc.
5. Deploy: schedule the detection script (cron/Windows Task Scheduler) and push alerts via email or dashboard badges.
Feel free to extend the pipeline with deep-learning models, model registry integration, or streaming inferencing as the project matures.

View File

@@ -0,0 +1,7 @@
pandas>=2.1,<3.0
numpy>=1.24,<3.0
scikit-learn>=1.4,<2.0
joblib>=1.3,<2.0
matplotlib>=3.8,<4.0
seaborn>=0.13,<1.0
pyyaml>=6.0,<7.0

View File

@@ -0,0 +1 @@
"""Anomaly detection toolkit for LASUCA controls data."""

View File

@@ -0,0 +1,104 @@
"""Utilities for loading and validating historian exports.
The goal is to keep data ingestion consistent across training and
inference flows. Functions here assume CSV input but can be expanded to
support SQL queries or APIs later.
"""
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import Iterable, List, Optional, Sequence
import pandas as pd
@dataclass
class DataLoadConfig:
"""Configuration describing the structure of the raw dataset."""
timestamp_column: str
feature_columns: Optional[Sequence[str]] = None
id_columns: Sequence[str] = ()
dropna: bool = True
timezone: Optional[str] = None
class DataValidationError(RuntimeError):
"""Raised when the provided dataset cannot be prepared as requested."""
def _normalize_columns(columns: Iterable[str]) -> List[str]:
return [str(col) for col in columns]
def load_timeseries(csv_path: Path, config: DataLoadConfig) -> pd.DataFrame:
"""Load and clean a historian export.
Parameters
----------
csv_path:
Path to the CSV file to read.
config:
DataLoadConfig describing timestamp and required feature columns.
Returns
-------
pandas.DataFrame
Sorted dataframe with parsed timestamps and selected columns.
"""
csv_path = Path(csv_path)
if not csv_path.exists():
raise FileNotFoundError(f"Data file not found: {csv_path}")
df = pd.read_csv(csv_path)
if config.timestamp_column not in df.columns:
raise DataValidationError(
f"Timestamp column '{config.timestamp_column}' missing from dataset"
)
df[config.timestamp_column] = pd.to_datetime(
df[config.timestamp_column], errors="coerce"
)
df = df.dropna(subset=[config.timestamp_column])
df = df.sort_values(config.timestamp_column).reset_index(drop=True)
if config.timezone:
df[config.timestamp_column] = df[config.timestamp_column].dt.tz_localize(
"UTC"
).dt.tz_convert(config.timezone)
selected_features: Sequence[str]
if config.feature_columns:
missing = set(config.feature_columns) - set(df.columns)
if missing:
raise DataValidationError(
"Missing feature columns: " + ", ".join(sorted(missing))
)
selected_features = list(config.feature_columns)
else:
selected_features = _normalize_columns(
df.select_dtypes(include=["number"]).columns
)
selected_features = [
col
for col in selected_features
if col not in (config.timestamp_column, *config.id_columns)
]
if not selected_features:
raise DataValidationError(
"No numeric feature columns detected; specify them explicitly"
)
columns_to_keep = [config.timestamp_column, *config.id_columns, *selected_features]
df = df[columns_to_keep]
if config.dropna:
df = df.dropna(axis=0, how="any", subset=selected_features)
df = df.reset_index(drop=True)
df.attrs["feature_columns"] = selected_features
return df

View File

@@ -0,0 +1,167 @@
"""Score new historian data with a previously trained anomaly detector."""
from __future__ import annotations
import argparse
from pathlib import Path
from typing import Sequence
import joblib
import numpy as np
import pandas as pd
from data_loader import DataLoadConfig, load_timeseries
from feature_engineering import build_feature_matrix
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Run anomaly detection on new data using a saved model."
)
parser.add_argument("--data", required=True, help="Path to the CSV file to score.")
parser.add_argument(
"--model",
required=True,
help="Path to the joblib artifact generated by train.py.",
)
parser.add_argument(
"--timestamp-column",
help="Timestamp column name. Defaults to the value stored in the model artifact.",
)
parser.add_argument(
"--features",
nargs="*",
help="Optional override for base feature columns. Defaults to the model's training configuration.",
)
parser.add_argument(
"--id-columns",
nargs="*",
help="Optional override for identifier columns to retain in outputs.",
)
parser.add_argument(
"--output",
required=True,
help="Where to write the scored CSV (timestamp, ids, anomaly_score, is_anomaly, rank).",
)
parser.add_argument(
"--keep-features",
action="store_true",
help="Include engineered feature columns in the output file for deeper analysis.",
)
parser.add_argument(
"--alert-threshold",
type=float,
help="Optional manual threshold on the anomaly score. If set, flag rows with score < threshold.",
)
parser.add_argument(
"--top-n",
type=int,
default=None,
help="Optional count of the most anomalous rows to print to stdout for quick review.",
)
return parser.parse_args()
def prepare_feature_frame(
df: pd.DataFrame,
timestamp_column: str,
base_features: Sequence[str],
id_columns: Sequence[str],
artifact_config: dict,
):
feature_df, feature_columns = build_feature_matrix(
df,
timestamp_column=timestamp_column,
base_feature_columns=base_features,
include_rolling=artifact_config.get("include_rolling", True),
include_rate=artifact_config.get("include_rate", True),
rolling_windows=artifact_config.get("rolling_windows", ["5T", "15T", "60T"]),
id_columns=id_columns,
)
return feature_df, feature_columns
def main() -> None:
args = parse_args()
artifact = joblib.load(Path(args.model))
pipeline = artifact["pipeline"]
artifact_features = artifact["feature_columns"]
timestamp_column = args.timestamp_column or artifact.get("timestamp_column")
if not timestamp_column:
raise RuntimeError("Timestamp column must be provided via --timestamp-column or stored in the artifact.")
id_columns = list(
args.id_columns if args.id_columns is not None else artifact.get("id_columns", [])
)
base_features = list(
args.features if args.features is not None else artifact.get("base_features", [])
)
config = DataLoadConfig(
timestamp_column=timestamp_column,
feature_columns=base_features if base_features else None,
id_columns=id_columns,
)
raw_df = load_timeseries(Path(args.data), config)
if not base_features:
base_features = list(raw_df.attrs.get("feature_columns", []))
if not base_features:
raise RuntimeError(
"Unable to infer base feature columns. Provide them via --features."
)
feature_df, feature_columns = prepare_feature_frame(
raw_df,
timestamp_column=timestamp_column,
base_features=base_features,
id_columns=id_columns,
artifact_config=artifact.get("config", {}),
)
missing = set(artifact_features) - set(feature_columns)
if missing:
raise RuntimeError(
"Engineered features required by the model are missing: "
+ ", ".join(sorted(missing))
)
matrix = feature_df[artifact_features]
scores = pipeline.decision_function(matrix)
predictions = pipeline.predict(matrix) == -1
if args.alert_threshold is not None:
is_anomaly = scores < args.alert_threshold
else:
is_anomaly = predictions
rank = pd.Series(scores).rank(method="first", ascending=True).astype(int)
if args.keep_features:
context_cols = list(feature_df.columns)
else:
context_cols = [timestamp_column, *id_columns]
output_df = feature_df[context_cols].copy()
output_df["anomaly_score"] = scores
output_df["is_anomaly"] = is_anomaly
output_df["anomaly_rank"] = rank.values
output_path = Path(args.output)
output_path.parent.mkdir(parents=True, exist_ok=True)
output_df.to_csv(output_path, index=False)
total_anomalies = int(output_df["is_anomaly"].sum())
print(f"Wrote {len(output_df)} scored rows to {output_path}")
print(f"Flagged {total_anomalies} anomalies")
if args.top_n:
top_rows = output_df.nsmallest(args.top_n, "anomaly_score")
print("Top anomalous rows:")
print(top_rows[[timestamp_column, *id_columns, "anomaly_score", "is_anomaly"]])
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,124 @@
"""Domain-specific feature engineering helpers.
These routines create additional signals such as rolling averages,
standard deviations, and gradients that often improve anomaly detection
performance on process data.
"""
from __future__ import annotations
from typing import Iterable, List, Sequence, Tuple
import numpy as np
import pandas as pd
from pandas.api import types as pdt
def _ensure_sorted(df: pd.DataFrame, timestamp_column: str) -> pd.DataFrame:
if not df[timestamp_column].is_monotonic_increasing:
return df.sort_values(timestamp_column).reset_index(drop=True)
return df
def add_rolling_statistics(
df: pd.DataFrame,
feature_columns: Sequence[str],
timestamp_column: str,
windows: Iterable[str] = ("5T", "15T", "60T"),
min_periods: int = 3,
) -> pd.DataFrame:
"""Append rolling mean and std for each feature.
Parameters
----------
df:
DataFrame containing the source data.
feature_columns:
Columns to summarize with rolling statistics.
timestamp_column:
Name of the timestamp column (must be datetime-like).
windows:
Pandas offset aliases representing rolling windows (default minutes).
min_periods:
Minimum data points required within each window.
"""
df = df.copy()
df = _ensure_sorted(df, timestamp_column)
if not pdt.is_datetime64_any_dtype(df[timestamp_column]):
raise TypeError("timestamp_column must be datetime-like")
df = df.set_index(timestamp_column)
for window in windows:
rolling_obj = df[feature_columns].rolling(window=window, min_periods=min_periods)
mean_df = rolling_obj.mean().add_suffix(f"__rolling_mean_{window}")
std_df = rolling_obj.std(ddof=0).add_suffix(f"__rolling_std_{window}")
df = pd.concat([df, mean_df, std_df], axis=1)
df = df.reset_index()
return df
def add_rate_of_change(
df: pd.DataFrame,
feature_columns: Sequence[str],
timestamp_column: str,
) -> pd.DataFrame:
"""Append the first derivative (per-minute change) for each feature."""
df = df.copy()
df = _ensure_sorted(df, timestamp_column)
df = df.set_index(timestamp_column)
deltas = df.index.to_series().diff().dt.total_seconds().fillna(0)
seconds = deltas.to_numpy()
seconds[seconds == 0] = np.nan # Avoid division by zero
for col in feature_columns:
rate_col = f"{col}__rate_per_min"
df[rate_col] = df[col].diff() / seconds * 60.0
df[rate_col] = df[rate_col].fillna(0.0)
df = df.reset_index()
return df
def build_feature_matrix(
df: pd.DataFrame,
timestamp_column: str,
base_feature_columns: Sequence[str],
include_rolling: bool = True,
include_rate: bool = True,
rolling_windows: Iterable[str] = ("5T", "15T", "60T"),
id_columns: Sequence[str] | None = None,
) -> Tuple[pd.DataFrame, List[str]]:
"""Produce an enriched feature matrix for model consumption."""
id_columns = list(id_columns or [])
enriched = df.copy()
if include_rolling:
enriched = add_rolling_statistics(
enriched,
feature_columns=base_feature_columns,
timestamp_column=timestamp_column,
windows=rolling_windows,
)
if include_rate:
enriched = add_rate_of_change(
enriched,
feature_columns=base_feature_columns,
timestamp_column=timestamp_column,
)
feature_columns = [
col
for col in enriched.columns
if col not in {timestamp_column, *id_columns} and pdt.is_numeric_dtype(enriched[col])
]
columns_to_return = [timestamp_column, *id_columns, *feature_columns]
return enriched[columns_to_return], feature_columns

View File

@@ -0,0 +1,217 @@
"""Command-line entry point for training an anomaly detection model.
Example usage (PowerShell):
python ml/anomaly_detection/src/train.py ^
--data data/clean/process_snapshot.csv ^
--timestamp-column timestamp ^
--features steam_tph turbine_rpm conveyor_tph ^
--model-out ml/anomaly_detection/models/isolation_forest.joblib
"""
from __future__ import annotations
import argparse
from pathlib import Path
from typing import Sequence
import joblib
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from data_loader import DataLoadConfig, load_timeseries
from feature_engineering import build_feature_matrix
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Train an Isolation Forest on historian data to detect anomalies."
)
parser.add_argument(
"--data",
required=True,
help="Path to the CSV file containing historical measurements.",
)
parser.add_argument(
"--timestamp-column",
required=True,
help="Name of the timestamp column in the dataset.",
)
parser.add_argument(
"--features",
nargs="*",
default=None,
help="Optional explicit list of feature columns to use. Defaults to all numeric columns.",
)
parser.add_argument(
"--id-columns",
nargs="*",
default=(),
help="Columns that should be preserved in outputs but excluded from modelling.",
)
parser.add_argument(
"--contamination",
type=float,
default=0.02,
help="Expected fraction of anomalies in the dataset (Isolation Forest parameter).",
)
parser.add_argument(
"--n-estimators",
type=int,
default=200,
help="Number of trees for the Isolation Forest.",
)
parser.add_argument(
"--random-state",
type=int,
default=42,
help="Random seed for reproducibility.",
)
parser.add_argument(
"--model-out",
type=str,
default="ml/anomaly_detection/models/isolation_forest.joblib",
help="Where to save the fitted model pipeline.",
)
parser.add_argument(
"--scores-out",
type=str,
default="ml/anomaly_detection/outputs/training_scores.csv",
help="Optional path to write anomaly scores for the training data.",
)
parser.add_argument(
"--no-rolling",
action="store_true",
help="Disable rolling-window feature augmentation.",
)
parser.add_argument(
"--no-rate",
action="store_true",
help="Disable rate-of-change feature augmentation.",
)
parser.add_argument(
"--rolling-windows",
nargs="*",
default=["5T", "15T", "60T"],
help="Rolling window sizes (pandas offset aliases) for derived features.",
)
return parser.parse_args()
def ensure_parent_dir(path: Path) -> None:
path.parent.mkdir(parents=True, exist_ok=True)
def fit_pipeline(feature_matrix: pd.DataFrame, feature_columns: Sequence[str], args: argparse.Namespace) -> Pipeline:
scaler = StandardScaler()
model = IsolationForest(
contamination=args.contamination,
n_estimators=args.n_estimators,
random_state=args.random_state,
n_jobs=-1,
)
pipeline = Pipeline([("scaler", scaler), ("model", model)])
pipeline.fit(feature_matrix[feature_columns])
return pipeline
def generate_scores(
pipeline: Pipeline,
feature_matrix: pd.DataFrame,
feature_columns: Sequence[str],
) -> pd.DataFrame:
transformed = feature_matrix[feature_columns]
decision_scores = pipeline.decision_function(transformed)
predictions = pipeline.predict(transformed) # 1 for normal, -1 for anomaly
scores_df = feature_matrix[[col for col in feature_matrix.columns if col not in feature_columns]].copy()
scores_df["anomaly_score"] = decision_scores
scores_df["is_anomaly"] = predictions == -1
scores_df["anomaly_rank"] = (
pd.Series(decision_scores).rank(method="first", ascending=True).astype(int).values
)
return scores_df
def main() -> None:
args = parse_args()
config = DataLoadConfig(
timestamp_column=args.timestamp_column,
feature_columns=args.features,
id_columns=args.id_columns,
)
raw_df = load_timeseries(Path(args.data), config)
if config.feature_columns:
base_features = list(config.feature_columns)
else:
base_features = list(raw_df.attrs.get("feature_columns", []))
if not base_features:
non_feature_columns = {args.timestamp_column, *config.id_columns}
base_features = [
col
for col in raw_df.columns
if col not in non_feature_columns and pd.api.types.is_numeric_dtype(raw_df[col])
]
if not base_features:
raise RuntimeError(
"Unable to determine feature columns. Specify them explicitly via --features."
)
feature_df, feature_columns = build_feature_matrix(
raw_df,
timestamp_column=args.timestamp_column,
base_feature_columns=base_features,
include_rolling=not args.no_rolling,
include_rate=not args.no_rate,
rolling_windows=args.rolling_windows,
id_columns=config.id_columns,
)
if not feature_columns:
raise RuntimeError("No feature columns available after feature engineering.")
pipeline = fit_pipeline(feature_df, feature_columns, args)
artifact = {
"pipeline": pipeline,
"feature_columns": list(feature_columns),
"timestamp_column": args.timestamp_column,
"id_columns": list(config.id_columns),
"base_features": list(base_features),
"config": {
"contamination": args.contamination,
"n_estimators": args.n_estimators,
"random_state": args.random_state,
"rolling_windows": args.rolling_windows,
"include_rolling": not args.no_rolling,
"include_rate": not args.no_rate,
},
}
model_path = Path(args.model_out)
ensure_parent_dir(model_path)
joblib.dump(artifact, model_path)
scores_df = generate_scores(pipeline, feature_df, feature_columns)
scores_path = Path(args.scores_out)
ensure_parent_dir(scores_path)
scores_df.to_csv(scores_path, index=False)
print(f"Model saved to {model_path}")
print(f"Training scores written to {scores_path}")
print(
f"Detected {(scores_df['is_anomaly'].sum())} anomalies out of {len(scores_df)} records"
)
if __name__ == "__main__":
main()