Automatic Data Versioning
Automatically track every version of data used in ML. A cloud-scale, efficient, zero copy implementation.
ML data stored in cloud object stores is automatically versioned at the file system level.
All changes are recorded and every version of data is perpetually preserved.
A consistent view of data is presented to the ML program.
Changes made to the data after the program starts are not visible to the program, thus preventing irreproducible results.
This consistent view is preserved forever and a parameter is logged to the MLflow run.
In the future, Data Scientists can go back and view the exact data that was used in the ML run.
- These versions are read-only and cannot be modified
- In the future, if the Data Scientist wants to make small changes to the data and re-run the experiment, then they can copy the version that the original run refers to, make the changes to the copy and re-run the experiment.
Data Provenance
See the exact data used for training and inferencing.