Data Management
for MLflow

Data provenance, data versioning, and time-based data selection.

Automatic Data Versioning

Automatically track every version of data used in ML. A cloud-scale, efficient, zero copy implementation. 

  • ML data stored in cloud object stores is automatically versioned at the file system level.

    • All changes are recorded and every version of data is perpetually preserved.

  • A consistent view of data is presented to the ML program.

    • Changes made to the data after the program starts are not visible to the program, thus preventing irreproducible results.

  • This consistent view is preserved forever and a parameter is logged to the MLflow run.

    • In the future, Data Scientists can go back and view the exact data that was used in the ML run.

      • These versions are read-only and cannot be modified
      • In the future, if the Data Scientist wants to make small changes to the data and re-run the experiment, then they can copy the version that the original run refers to, make the changes to the copy and re-run the experiment.

MLflow Auth Based Data Access

Provide Access to MLflow Artifacts and Input Data gated by MLflow Authentication and Authorization

Data Provenance

See the exact data used for training and inferencing.

Data Versioning

Automatically track every version of data used in ML.

Data Selection

Fine grained snapshot technology for a given point in time. Patent Pending.
A cloud-scale, enriched access API. 

Time and File Metadata

What new data was acquired?
What was the data acquired in any time interval in the past?

Micro Batching

Each batch is fed to ICE, where it will be used for inferencing or retraining.