Data Management
for MLflow

Automatic Data Versioning, Access Control and Metadata based selection

Automatic Data Versioning

Automatically track every version of data used in ML. A cloud-scale, efficient, zero copy implementation. 

  • ML data stored in cloud object stores is automatically versioned at the file system level.

    • All changes are recorded and every version of data is perpetually preserved.

  • A consistent view of data is presented to the ML program.

    • Changes made to the data after the program starts are not visible to the program, thus preventing irreproducible results.

  • This consistent view is preserved forever and a parameter is logged to the MLflow run.

    • In the future, Data Scientists can go back and view the exact data that was used in the ML run.

      • These versions are read-only and cannot be modified
      • In the future, if the Data Scientist wants to make small changes to the data and re-run the experiment, then they can copy the version that the original run refers to, make the changes to the copy and re-run the experiment.
  • Data Provenance

    • See the exact data used for training and inferencing.

Access Control

Provide Access to MLflow Artifacts and Input Data gated by MLflow Authentication and Authorization

No Need to Manage S3 Access Keys

With open source MLflow, users need S3 Access Keys for accessing MLflow artifacts, thus creating a administrative nightmare. InfinStor solves that problem elegantly - once users sign into the InfinStor MLflow platform, they automatically have the appropriate access (read or write) to the experiment's artifact repository

Data Selection

Fine grained snapshot technology for a given point in time. Patent Pending.
A cloud-scale, enriched access API. 

Time and File Metadata

What new data was acquired?
What was the data acquired in any time interval in the past?

Micro Batching

Each batch is fed to ICE, where it will be used for inferencing or retraining.