7.7

Data-prep pipeline v0.3 delivered
- Added preprocess_incremental.py
  - YAML-driven (paths, target_len, scaler, shard_size)
  - Welford online stats (mean / std / min / max)
  - Incremental mode – processes only new CSV files and updates stats
  - Writes per-sample .npy files (keeps raw folder hierarchy)
  - Optional shard output: shard_####.pt (disabled by default)
  - Progress bars via tqdm; generates summary.txt for later plotting
- Successfully processed 1,951 CSV files (~31 GB) end-to-end
Bug fixes / tweaks
- Explicit encoding="utf-8" when reading YAML → fixed UnicodeDecodeError
- Cast shard_size to int (or set as bare number in YAML) → fixed int // str error
- Added .copy() before torch.from_numpy to silence “array not writable” warning (optional)
Version control
- Added .gitignore (ignores data/, *.npy, *.pt, etc.)
- Used VS Code Source Control:
  - staged & committed changes
  - Sync Changes pushed to origin/main
- No data files pushed; only code, scripts, configs
Next steps
- (Optional) create plot_stats.py to visualize stats.npz
- Use NpyLazyDataset for training; enable shard export later if needed

Outcome: Robust, incremental preprocessing pipeline is online; code is safely pushed to GitLab while 31 GB of data remain local.