Skip to content

7.7

  • Data-prep pipeline v0.3 delivered
    • Added preprocess_incremental.py
      • YAML-driven (paths, target_len, scaler, shard_size)
      • Welford online stats (mean / std / min / max)
      • Incremental mode – processes only new CSV files and updates stats
      • Writes per-sample .npy files (keeps raw folder hierarchy)
      • Optional shard output: shard_####.pt (disabled by default)
      • Progress bars via tqdm; generates summary.txt for later plotting
    • Successfully processed 1,951 CSV files (~31 GB) end-to-end
  • Bug fixes / tweaks
    • Explicit encoding="utf-8" when reading YAML → fixed UnicodeDecodeError
    • Cast shard_size to int (or set as bare number in YAML) → fixed int // str error
    • Added .copy() before torch.from_numpy to silence “array not writable” warning (optional)
  • Version control
    • Added .gitignore (ignores data/, *.npy, *.pt, etc.)
    • Used VS Code Source Control:
      • staged & committed changes
      • Sync Changes pushed to origin/main
    • No data files pushed; only code, scripts, configs
  • Next steps
    • (Optional) create plot_stats.py to visualize stats.npz
    • Use NpyLazyDataset for training; enable shard export later if needed

Outcome: Robust, incremental preprocessing pipeline is online; code is safely pushed to GitLab while 31 GB of data remain local.