Files
2026-02-17 09:29:34 -06:00
..
2026-02-17 09:29:34 -06:00
2026-02-17 09:29:34 -06:00
2026-02-17 09:29:34 -06:00
2026-02-17 09:29:34 -06:00

Anomaly Detection Starter Kit

This module seeds a machine-learning workflow for flagging unusual behavior in LASUCA's process data (steam flow, turbine RPM, conveyor load cells, etc.). It focuses on unsupervised anomaly detection so you can start surfacing outliers without labeled fault data.

Project structure

ml/anomaly_detection/
├── README.md                # Project overview and next steps
├── requirements.txt         # Python dependencies for the pipeline
└── src/
    ├── __init__.py          # Marks the package
    ├── data_loader.py       # Helpers for reading & validating time-series data
    ├── feature_engineering.py # Domain feature transformations and rolling stats
    ├── train.py             # CLI script to fit an Isolation Forest model
    └── detect.py            # CLI script to score new data with the trained model

Quick start

  1. Create a virtual environment inside the repository root and install dependencies:

    python -m venv .venv
    .\.venv\Scripts\Activate.ps1
    pip install -r ml/anomaly_detection/requirements.txt
    
  2. Prepare a CSV export with at least the following columns:

    • timestamp: ISO 8601 timestamp or anything pandas.to_datetime can parse.
    • Sensor columns: numerical fields such as steam_tph, turbine_rpm, conveyor_tph.

    Additional metadata columns (e.g., area, equipment) are optional and help slice metrics later.

  3. Train a baseline model:

    python ml/anomaly_detection/src/train.py ^
        --data data/clean/process_snapshot.csv ^
        --timestamp-column timestamp ^
        --features steam_tph turbine_rpm conveyor_tph ^
        --model-out ml/anomaly_detection/models/isolation_forest.joblib
    

    The script standardizes numeric columns, fits an Isolation Forest, and saves the pipeline along with a CSV of anomaly scores.

  4. Score fresh data (e.g., streaming batch or another day):

    python ml/anomaly_detection/src/detect.py ^
        --data data/clean/process_snapshot_new.csv ^
        --model ml/anomaly_detection/models/isolation_forest.joblib ^
        --timestamp-column timestamp ^
        --features steam_tph turbine_rpm conveyor_tph ^
        --output data/clean/process_snapshot_new_scored.csv
    

Roadmap ideas

Phase Goal Details
Baseline Clean data + isolation forest Validate signals, calculate rolling mean/std, track top anomalies per asset & shift.
Enhancements Context-aware detection Separate models per unit (boiler, milling line), include load-based normalization, add feedback loop for dismissed alerts.
Advanced Forecast + residual alerts Train LSTM/Prophet forecasts and alert on residuals, integrate maintenance work orders.

Data tips

  • Resample fast signals to a consistent cadence (e.g., 1 min) to smooth control jitter.
  • Align units (e.g., convert all steam flows to TPH) before feeding models.
  • Label known events (downs, maintenance) to benchmark the detector and reduce false positives.

Next steps

  1. Pull a week of reconciled historian data into data/clean/.
  2. Run train.py to create an initial anomaly score CSV.
  3. Visualize results in the existing dashboards or a Jupyter notebook (e.g., scatter of anomaly score vs. timestamp grouped by equipment).
  4. Iterate on feature engineering: rolling gradients, energy-per-ton, turbine slip ratios, etc.
  5. Deploy: schedule the detection script (cron/Windows Task Scheduler) and push alerts via email or dashboard badges.

Feel free to extend the pipeline with deep-learning models, model registry integration, or streaming inferencing as the project matures.