7.7
-
Data-prep pipeline v0.3 delivered
- Added
preprocess_incremental.py
- YAML-driven (paths,
target_len
, scaler,shard_size
) - Welford online stats (mean / std / min / max)
- Incremental mode – processes only new CSV files and updates stats
- Writes per-sample
.npy
files (keeps raw folder hierarchy) - Optional shard output:
shard_####.pt
(disabled by default) - Progress bars via
tqdm
; generatessummary.txt
for later plotting
- YAML-driven (paths,
- Successfully processed 1,951 CSV files (~31 GB) end-to-end
- Added
-
Bug fixes / tweaks
- Explicit
encoding="utf-8"
when reading YAML → fixedUnicodeDecodeError
- Cast
shard_size
toint
(or set as bare number in YAML) → fixedint // str
error - Added
.copy()
beforetorch.from_numpy
to silence “array not writable” warning (optional)
- Explicit
-
Version control
- Added
.gitignore
(ignoresdata/
,*.npy
,*.pt
, etc.) - Used VS Code Source Control:
- staged & committed changes
-
Sync Changes pushed to
origin/main
- No data files pushed; only code, scripts, configs
- Added
-
Next steps
- (Optional) create
plot_stats.py
to visualizestats.npz
- Use
NpyLazyDataset
for training; enable shard export later if needed
- (Optional) create
Outcome: Robust, incremental preprocessing pipeline is online; code is safely pushed to GitLab while 31 GB of data remain local.