Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

DagsHub → Github for Data Science
Latest

DagsHub → Github for Data Science

Last Updated on September 9, 2021 by Editorial Team

Author(s): Shubham Saboo

Data Science

Data Scientists deserve to browse, preview, share, fork, and merge data & models alongside code.

DAGsHub → A collaborative Data Science Platform!

What is DAGsHub?

DagsHub is an open-source data science & machine learning collaboration platform that allows you to quickly build, scale and deploy machine learning projects by leveraging the power of git (Source code Versioning) and DVC (Data Version Control).

DAGsHub combines data, mode, and code → All in one place!

Since the inception of the field, the handling code and data together were the key pain points for the data professionals. Unlike the conventional software engineering projects where you just have to track the code, in ML projects you have to track the data and the models along with the code which is a complex task in itself.

If you have ever tried working on an enterprise-grade ML project, you can totally relate to the multiple components like code, data, monitoring that come into play. Altogether it's a dreadful task to put together all those pieces and make them work in tandem mainly because standard code versioning platforms like GitHub, Bitbucket, or GitLab do not support pushing and pulling vast amounts of data.

Conventional Solution

The conventional solution to manage data and code was to push code to any standard code versioning platform like GitHub and push the data and models to either the on-prem or the cloud storage like AWS, Google Cloud, etc.

There are a lot of problems that come with storing your data, code, and models at different places, the first and foremost being the connection or the bridge between them. You need to efficiently thread all of them to work in tandem for your ML project to function properly. The other problem that you can face is the latency between the connection that will affect your application runtime speed.

DAGsHub Storage — The Way Forward

With the invention of DVC (Data Version Control) to manage data, similar to what git does with the code. We entered in the era of efficient platforms and approaches for managing end-to-end ML projects!

DAGsHub Storage is built on top of DVC to bring together the essence of data science i.e data and Code. It works via the DVC remote that requires zero configuration and works out of the box. It makes sharing data and models as easy as sharing a link, which allows easy collaboration and free flow of ideas within data teams.

DAGsHub Storage uses DVC to version data and models like git code that can easily be tracked and compared across versions. Within the repository interface, DAGsHub provides an automated pipeline to visualize the components of the project and how they link together allowing everyone within the team to understand the workflow of the project irrespective of their technical understanding. It also provides the capability to compare data side-by-side and supports multiple modalities of like text, images, audio, and tables.

Automated Pipeline generated by DAGsHub | Side-by-Side Image data comparison

Where DAGsHub Shines?

DAGsHub allows you to quickly build, share and reuse machine learning and data science projects eliminating the toil for the data teams to start every time from scratch. Following are the features of DAGsHub that makes it stand out from other traditional platforms:

Inbuilt remote for tools like Git (for source code tracking), DVC (for data version tracking), and MLflow (for experiment tracking) that allows you to connect everything in one place with zero configuration.

DAGsHub allows you to track and monitor the different ML experiments performed by different individuals with the comfort of a beautiful user interface. All the experiments within an ML project can be tracked and linked to its specific version of data, code, and models!

Experiment Tracking Dashboard

In addition to tracking your experiments, you can also compare different experiments side-by-side and understand the difference in performance metrics and hyperparameters through the recorded values for each experiment and intuitive visualizations provided out-of-the-box by DAGsHub.

Experiment Comparision
Interactive visualizations to compare experiments

Conclusion

As the data science projects grow and become big by involving multiple stakeholders, it becomes really difficult for traditional source code management platforms to manage the code and artifacts together in an efficient way that is collaborative and sharable → DAGsHub to the rescue!

Going forward, platforms like DAGsHub will become mainstream and play a great role in executing collaborative data projects across sectors and organizations enabling data teams to quickly build, collaborate and share their data science and machine learning projects.

References

If you would like to learn more or want to me write more on this subject, feel free to reach out.

My social links: LinkedIn| TwitterGithub

If you liked this post or found it helpful, please take a minute to press the clap button, it increases the post visibility for other medium users.


DagsHub → Github for Data Science was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓