Why Conduct Version Control?

Filed Under: Resources

As a part of a large team that collaborates and works with a large number of models, it’s very inconvenient to try and keep track of which data set was used with which model. Moreover, multiple data sets may be used on the same model to try and reach optimum performance nowadays. Additionally, each dataset can have new or improved data appended daily. 

It goes without saying, that developers should keep track of such data.

What Does Version Control Offer?

So, what are some of the benefits of conducting version control? In case a given data set is to be accessed by multiple developers, a given categorization and ordering of such sets should be available for the user’s ease of usage so that no further confusion arises.

Additionally, using version control tools, you can create your own isolated data branch. Meaning that each developer could create and use his own branch to test and experiment freely without changing the data stored for other users. This is extremely important, as developers may change the schema of the data or even remove certain parts.

Some specialized open-source software tools were created specifically for data controlling purposes. These tools can help you greatly with maintaining and controlling all the different data sets of your project, such tools include open-source software tools like LakeFS.

Using an Open Source Tool like LakeFS

Lakefs

LakeFS is an open-source software tool that provides Git-like capabilities for your ML data sets, allowing you to test, experiment, and even collaborate with multiple developers on your ML, AI, and data science projects.

Like Git, LakeFS provides you with Git-like benefits but with way larger file storage capacities. These benefits include having each contributor create his own data set branch to work on freely, and commit and merge to the master branch if needed.

It goes without saying that developers can keep reverting to their previous commits till they reach the latest error-free data set that works best.

Starting with LakeFS

To try LakeFS, go to their website and click on the “try without installing” button. Use the given access and secret keys to sign in to your playground.

To start using LakeFS, we will state some ways of importing and exporting data into and out of your repository. Note that most of the methods stated will have different merits as they may be used in different situations.

To start using the LakeFS command line, you should install either lakectl, AWS CLI, or Spark. In this tutorial, we will use lakectl and Spark.

Creating a Repo with LakeFS

A starting repo will have been created upon your logging into the account. You can also create more repos by choosing the “Create Repository” button at the top right of the page.

Importing Data Using S3 CLI and ApacheDistcp

If you require a simple importing method, you can use S3 CLI and ApacheDistcp to import the necessary data. S3 CLI and ApacheDistcp are easy to use and are recommended for importing data through LakeFS.

S3 CLI

aws –profile lakefs \ –endpoint-url https://lakefs.example.com \ s3 cp /path/to/local/file s3://example-repo/main/example-file-1

Apache DistCp

hadoop distcp \ -Dfs.s3a.path.style.access=true \ -Dfs.s3a.bucket.example-repo.access.key=“AKIAIOSFODNN7EXAMPLE” \ -Dfs.s3a.bucket.example-repo.secret.key=“wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY” \ -Dfs.s3a.bucket.example-repo.endpoint=“https://lakefs.example.com” \ -Dfs.s3a.bucket.example-bucket.access.key=“AKIAIOSFODNN3EXAMPLE” \ -Dfs.s3a.bucket.example-bucket.secret.key=“wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY” \ “s3a://example-bucket/example-file.parquet” \ “s3a://example-repo/main/example-file.parquet”

lakectl

1. Installing lakectl

You should Install the required lakectl version depending on the available operating system. I am currently using Mac (Intel).

For Mac (Intel)
sudo wget http://treeverse-clients-us-east.s3-website-us-east-1.amazonaws.com/lakectl/0.60.0/darwin_amd64/lakectl -O /usr/local/bin/lakectl

2. Configure lakectl

Here, you are going to configure lakectl by seting your access and secret key. Note that each user will receive a unique access and secret key when creating their account.

echo “credentials:access_key_id: <Access Key>secret_access_key: <Secret Key>server:endpoint_url: https://glad-boa.lakefs-demo.io/api/v1″ > ~/.lakectl.yaml

3. List your Repositories

To check that everything is working fine, you may need to list all the repos available.

lakectl repo list

4. Importing the Required Data using lakectl and ingest Command

After installing lakectl and signing in to your account, choose the dataset and repository that you want to import data from.

lakectl ingest \ –from s3://bucket/optional/prefix/ \ –to lakefs://my-repo/ingest-branch/optional/path/

Note that ingesting is used in cases where copying the data itself is not possible. Thus, the required data is ingested without copying the actual data. Ingesting data takes a lot of time, so it is preferably done on small datasets. For more on ingesting, check the LakeFS zero-copy import.

5. Importing the Required Data Using lakectl without ingesting Command

You can also import bigger datasets using the LakeFS import command.

lakefs import lakefs://example-repo -m s3://example-bucket/path/to/inventory/YYYY-MM-DDT00-00Z/manifest.json –config config.yaml

Spark

You can also read and write data from and into repositories using the Spark engine

1. Configuring Spark

spark.sparkContext.hadoopConfiguration.set(“fs.s3a.access.key”, <Access Key>)spark.sparkContext.hadoopConfiguration.set(“fs.s3a.secret.key”, <Secret Key>)spark.sparkContext.hadoopConfiguration.set(“fs.s3a.endpoint”, “<End Point>)spark.sparkContext.hadoopConfiguration.set(“fs.s3a.path.style.access”, “true”)

Note that each user will have a different access and secret key.

2. Reading using Spark

val df = spark.read.parquet(s”s3a://${repo}/${branch}/example-path/example-file.parquet”)

Note that you should insert the required repository and branch you want to read from into the repo and branch values.

3. Writing using Spark

df.write.partitionBy(“example-column”).parquet(s”s3a://${repo}/${branch}/output-path/”)

Note that you should insert the required repository and branch you want to write to into the repo and branch values. You can also partition your data by a given column.

Conclusion

With ML and data analysis projects getting bigger and more complex with time, a devoted method of tracking and organizing data such as LakeFS should be widely established for developers’ ease of use. 

In this tutorial, we learned about version control, especially for large datasets normally used in ML models. We also learned about open-source tools such as LakeFS.

close
Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors