Quick start DVC iii: Data and model versioning

This is the 8th day of my participation in the First Challenge 2022

Data and model versioning is a fundamental layer of DVC for managing large files, data sets, and machine learning models. Use your regular Git workflow, but don’t store large files in your Git repository. Big data files are stored separately to achieve efficient sharing. Imagine how cool it would be to have Git handle arbitrarily large files and directories with the same performance as small code files. For example, use Git Clone and view data files and machine learning models in your workspace. Or use Git Checkout to switch to a different version of a 100Gb file in less than a second.

The foundation of DVC consists of commands that are run in conjunction with Git to track large files, directories, or ML model files. In short, DVC is “Git for managing data.”

I created and initialized a Git project in my previous article, Quick Start DVC (II) : Installation and Initialization. Now, we use DVC Add to track files or directories.

Download files (`dvc get`)

Before tracking the file, we pre-download the data file using DVC GET.

Go to the project directory you initialized earlier
$ cd example-get-started

$ dvc get https://github.com/iterative/dataset-registry \
          get-started/data.xml -o data/data.xml

Copy the code

Trace file (`dvc add`)

Next, trace the file.

dvc add data/data.xml
Copy the code

With the command above, DVC stores information about the added file (or directory) in a special.dvc file named data/data.xml.dvc (a small text file in a human-readable format). This metadata file is a placeholder for the original data file, making it as easy to version control as using Git’s source code. In the meantime, the raw data files are placed in the.gitignore file.

$ cat data/.gitignore

/data.xml
Copy the code

Add metadata files to Git for tracking.

$ git add data/data.xml.dvc data/.gitignore
$ git commit -m "Add raw data"
Copy the code

DVC Add moves the data to the project’s cache and links it back to the workspace.

$ tree .dvc/cache .. /. DVC/cache └ ─ ─ a3 └ ─ ─ 04 afb96060aad90176268345e10355Copy the code

The hash value of the data.xml file we just added (a304afb…) Determines the cache path above.

If you check data/data.xml. DVC, you’ll also find it here:

$ cat data/data.xml.dvc  

outs:
  - md5: a304afb96060aad90176268345e10355
    path: data.xml    
Copy the code

Store data files to a remote repository (`dvc push`)

You can use DVC push to upload data or model files tracked by DVC to securely store them remotely. This also means that you can later restore them in other environments using DVC pull.

First, we need to set up the address of a remote repository. DVC supports many remote Storage types, including Amazon S3, SSH, Google Drive, Azure Blob Storage, and HDFS. As you can see from the image above, the code is stored separately from the model and data files.

Configure remote Git repository
$ git remote add origin https://gitee.com/xxxx/dvc-samples.git

# Configure the remote data repository (note: for simplicity, I am using other local folders as remote repositories, not recommended)
$ mkdir -p /home/lgd/dvc/local_remote_data_register
$ dvc remote add -d local_remote /home/lgd/dvc/local_remote_data_register

# check configuration
$ cat .dvc/config

[core]
    remote = local_remote
['remote "local_remote"']
    url = /home/lgd/dvc/local_remote_data_register


Add the DVC configuration file to your local Git repository
$ git add .dvc/config
$ git commit -m "Configure local remote storage"

Or use the following command instead of the above two commands
# git commit .dvc/config -m "Configure local remote"
Copy the code

DVC remote lets you store a copy of the data tracked by DVC outside the local cache (typically a cloud storage service).

Note: I use “local remote” in this Demo, and while “local remote” may seem like an oxymoron, it doesn’t have to be. Local indicates that the file location is another folder in the local file system. “Remote” is what we call remote storage for DVC projects. It is essentially a local backup of data.

We then use DVC push to copy the locally cached data to the remote store we set up earlier.

$ dvc push
Copy the code

You can check if the data is stored in the DVC remote repository using the following command:

$ ls -R /home/lgd/dvc/local_remote_data_register

/home/lgd/dvc/local_remote_data_register/:
a3

/home/lgd/dvc/local_remote_data_register/a3:
04afb96060aad90176268345e10355
Copy the code

The.dvc metadata file is committed to a local git repository and pushed to a git remote repository, as shown in the following example:

$ git push origin main
Copy the code

Download and Recover data (`dvc pull`)

Once the dvC-traced data and model files are stored in a remote repository, we can use DVC pull to download them to other copies of the project as needed. Normally, we run it after Git Clone and Git pull.

Download project data in another environment as shown in the following example:

$ git clone https://gitee.com/xxxx/dvc-samples.git

$ cd dvc-samples

$ git pull origin main

$ dvc pull
Copy the code

The following example restores data in this project:

$ cd example-get-started

# Suppose we delete locally cached data
$ rm -rf .dvc/cache
$ rm -f data/data.xml

# Restore data
$ dvc pull origin main
Copy the code

Modifying data files

When you make changes to a file or folder, run DVC Add again to keep track of the latest version.

# Simulate modifying data by doubling the data set
$ cp data/data.xml /tmp/data.xml
$ cat /tmp/data.xml >> data/data.xm

# Keep track of the latest version
dvc add data/data.xml
Copy the code

Typically, you also need to run git Commit and DVC push to save the changes.

$ git commit data/data.xml.dvc -m "Dataset updates"
$ dvc push
Copy the code

Switching between versions (`dvc checkout`)

The usual workflow is to use Git Checkout first (switch branches or switch.dvc file versions) and then run DVC Checkout to synchronize data.

# First, get a previous version of the data set. Let's go back to the original version of the data
$ git checkout HEAD~1

# Synchronize data
$ dvc checkout
Copy the code

After that, let’s commit the.dvC file to the local Git repository (no DVC push is required this time because the original version of the dataset is already stored in the local cache and remote repository).

$ git commit data/data.xml.dvc -m "Revert dataset updates"

$ git push origin main
Copy the code

conclusion

In fact, DVC is not even technically a version control system! DVC metadata file content defines the version of the data file, essentially providing version control through Git. DVC in turn creates these.dvc files, then updates them and effectively synchronizes the data tracked by DVC in the workspace to match them.

Quick start DVC iii: Data and model versioning

Download files (dvc get)

Trace file (dvc add)

Store data files to a remote repository (dvc push)

Download and Recover data (dvc pull)

Modifying data files

Switching between versions (dvc checkout)

conclusion

Related Posts

An article to figure out what MPP is?

Association rule mining and Apriori algorithm

Artificial intelligence science, high school students can read — Jinkey original

Download files (`dvc get`)

Trace file (`dvc add`)

Store data files to a remote repository (`dvc push`)

Download and Recover data (`dvc pull`)

Switching between versions (`dvc checkout`)