This module seeds a machine-learning workflow for flagging unusual behavior in LASUCA's process data (steam flow, turbine RPM, conveyor load cells, etc.). It focuses on unsupervised anomaly detection so you can start surfacing outliers without labeled fault data.

Project structure

ml/anomaly_detection/
├── README.md                # Project overview and next steps
├── requirements.txt         # Python dependencies for the pipeline
└── src/
    ├── __init__.py          # Marks the package
    ├── data_loader.py       # Helpers for reading & validating time-series data
    ├── feature_engineering.py # Domain feature transformations and rolling stats
    ├── train.py             # CLI script to fit an Isolation Forest model
    └── detect.py            # CLI script to score new data with the trained model

Quick start

Create a virtual environment inside the repository root and install dependencies:

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r ml/anomaly_detection/requirements.txt

Prepare a CSV export with at least the following columns:
- timestamp: ISO 8601 timestamp or anything pandas.to_datetime can parse.
- Sensor columns: numerical fields such as steam_tph, turbine_rpm, conveyor_tph.
Additional metadata columns (e.g., area, equipment) are optional and help slice metrics later.

Train a baseline model:

python ml/anomaly_detection/src/train.py ^
    --data data/clean/process_snapshot.csv ^
    --timestamp-column timestamp ^
    --features steam_tph turbine_rpm conveyor_tph ^
    --model-out ml/anomaly_detection/models/isolation_forest.joblib

The script standardizes numeric columns, fits an Isolation Forest, and saves the pipeline along with a CSV of anomaly scores.

Score fresh data (e.g., streaming batch or another day):

python ml/anomaly_detection/src/detect.py ^
    --data data/clean/process_snapshot_new.csv ^
    --model ml/anomaly_detection/models/isolation_forest.joblib ^
    --timestamp-column timestamp ^
    --features steam_tph turbine_rpm conveyor_tph ^
    --output data/clean/process_snapshot_new_scored.csv

Roadmap ideas

Phase	Goal	Details
Baseline	Clean data + isolation forest	Validate signals, calculate rolling mean/std, track top anomalies per asset & shift.
Enhancements	Context-aware detection	Separate models per unit (boiler, milling line), include load-based normalization, add feedback loop for dismissed alerts.
Advanced	Forecast + residual alerts	Train LSTM/Prophet forecasts and alert on residuals, integrate maintenance work orders.

Data tips

Resample fast signals to a consistent cadence (e.g., 1 min) to smooth control jitter.
Align units (e.g., convert all steam flows to TPH) before feeding models.
Label known events (downs, maintenance) to benchmark the detector and reduce false positives.

Next steps

Pull a week of reconciled historian data into data/clean/.
Run train.py to create an initial anomaly score CSV.
Visualize results in the existing dashboards or a Jupyter notebook (e.g., scatter of anomaly score vs. timestamp grouped by equipment).
Iterate on feature engineering: rolling gradients, energy-per-ton, turbine slip ratios, etc.
Deploy: schedule the detection script (cron/Windows Task Scheduler) and push alerts via email or dashboard badges.

Feel free to extend the pipeline with deep-learning models, model registry integration, or streaming inferencing as the project matures.