Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

A Data Scientist Is More Than Just a Data Scientist
Latest   Machine Learning

A Data Scientist Is More Than Just a Data Scientist

Last Updated on July 26, 2023 by Editorial Team

Author(s): Shanmukh Dara

Originally published on Towards AI.

My thoughts on the best way to enter and advance in the field of data science…

Photo by UX Indonesia on Unsplash

Hello there, may I ask you a question? What are the top skills required for a data scientist to be successful? I can guarantee that the answer to this question differs from person to person and firm to firm. I must admit that there is no single objective answer to this question. But, as a data scientist, I’ve always wondered why. If we can create driverless cars and forecast the future, why can’t we answer this issue objectively? So let me explain why answering this question is difficult, as well as my thoughts on what skills a data scientist should have and the best way to develop them.

This blog is not intended to provide technical resources. Rather, my focus will be on changing your perspective and steering your journey to enter and grow in the data science domain.

So, to return to my original point, why is it so difficult to determine the abilities necessary for a data scientist? This, in my opinion, is attributable to three major factors:

  1. In recent years, the phrase “data scientist” has become diluted
  2. The company’s culture
  3. Division of Labor
Data scientist is without a doubt the sexiest job in the twenty-first century (Image Source)

Anyone who has been following this domain for a few years would undoubtedly agree with me on the first two points. Indeed, the term “data scientist” has become diluted in recent years; now, a data scientist can play any role, ranging from business problem formulation to model deployment and monitoring. Second, the role of a data scientist can be influenced by the company’s culture and data maturity. After working with a number of established firms and a few startups, I’ve discovered that working as a data scientist at a startup in its early stages necessitates more business acumen than technical skills.

Finally, and most importantly, there is the “Division of Labor.” Adam Smith uses the vivid example of a pin factory assembly line in The Wealth of Nations to explain how the division of labor is the primary source of productivity gains. Data analysis tasks, like pin-making, necessitate numerous processes, which is why organizations typically hire specialists such as data engineers, experimentation scientists, machine learning experts, and so on. A product manager oversees the work and handles hand-offs between functions.

Because of this division of labor, many data scientists wind up doing a lot of data modeling, which fosters the impression that data scientists only need data-related skills and nothing else. Let me tell you that this is entirely incorrect, which is why I titled the blog “A data scientist is more than just a data scientist.”

Image Source

This is why, rather than being a specialist in only one domain, a data scientist’s knowledge should take the form of a π shape, with good horizontal knowledge across the entire end-to-end process and in-depth knowledge in 1–to 2 particular domains. A data scientist must be more of a generalist than a specialist.

When you ask a couple of data scientists for advice on how to become an expert in this domain, the most common response is “Kaggle competitions.” However, the harsh reality is that Kaggle competitions will not prepare you for the real world. Without a doubt, Kaggle is a good place for newcomers and those looking to build a profile. However, after the first learning phase, Kaggle fails to provide a sense of real-world challenges.

Is it hard to believe? I learned this the hard way when I began my practicum project at Kiva Organization as part of my MSBA coursework. Here are a few reasons why experiential learning is superior to Kaggle competitions.

1. Problem-solving thought process

Image Source

Real-world projects are not like Kaggle competitions, Kaggle offers you a clear view of the problem, the data at hand, the solution required, and sometimes what needs to be done. So you’re left with almost nothing to brainstorm and think about.

However, in the real world, your problem statements are most often not precisely defined or are open-ended business problems. In most cases, the analysis begins with transforming the business problem into an analytics challenge and then experimenting with various analytics methodologies to address it.

As part of my practicum project at Kiva, we were given a very vague business problem; it took us approximately 3–4 weeks to fully understand Kiva’s business, then the business problem, and ultimately scope the business problem. Then, to ensure that we were on the right track, we explained our understanding and suggested numerous approaches to the problem. That’s when we came to an agreement on a problem and the approach. This provided us with a safe environment in which to brainstorm, generate innovative and creative ideas, do a fast sanity check, and, if necessary, kill the ideas.

2. Data collection and cleaning

Datasets are already available in Kaggle competitions, and they are frequently clean and well-structured. This limits your thinking; you grasp the problem and try different methods to discover which one works the best. In the real world, however, it is our responsibility as data scientists to comprehend the problem and find the key list of data attributes that would be useful from the massive amounts of data that exist in data warehouses. In some cases, the data is not readily available and must be gathered from a variety of sources using web scraping.

Before deciding on a set of features to use at Kiva, we had to understand all of the accessible data, its quality, and quantity. This required a great deal of trial and error. Furthermore, most of the data is not clean in general, so it is our responsibility to clean and fix the data before analyzing it.

3. Performance vs Business impact

Image Source

In the classroom and at other competitions, our success metric is the model’s performance, or how well it predicts unseen data. However, in most real-world scenarios, we are more concerned with the business impact than with performance. The business impact can range from increased bottom-line profit to increased sales to decreased expenditure.

This involves ensuring that you thoroughly grasp the client’s (marketing team or product team’s) problem and that the client is well aware of the problem we are trying to solve and is eager to use these predictions in the future.

4. Communicating with a non-technical audience

We are required to collaborate closely with marketing, product, sales, and engineering teams as data scientists. Most of these individuals are non-technical, which means you cannot communicate with them in the same manner that you would with a fellow data scientist. You must clearly comprehend what that team is interested in and communicate only the necessary information with minimal jargon.

5. Miscellaneous

Aside from everything mentioned above, here are a few other skills that the practicum project helped me improve.

  • Storytelling — Another key talent that every data scientist requires is storytelling. Since the majority of the people we interact with on a daily basis are non-technical, this skill will come in handy when communicating your impact to others.
  • Prioritization — On a daily basis, we are inundated with ad hoc analysis requests, and there may be a large number of projects in the pipeline. It is our responsibility to evaluate the impact and prioritize the projects.

So what now?

A good data scientist is much more than a data scientist. He is a detective who can spot problems, a very good storyteller, a magician who can solve problems using machine learning approaches, a good team member who can collaborate, and a good project manager who can prioritize tasks.

And, in order to become a good data scientist, you should focus on holistic growth rather than simply data tools, which can only be accomplished by working on end-to-end projects similar to my practicum project at Kiva.

So what next?

  • Try working on end-to-end problems
  • Start with an open-ended business problem, brainstorm, and scope it
  • Scrape/Collect data if possible
  • Think of success criteria

Thanks for Reading!

Do you agree/disagree with me? Let me know in the comments below.

Further reading

Why Data Science Teams Need Generalists, Not Specialists

Most businesses organize for efficient productivity. They do this through specialization. Workers that are highly…

hbr.org

Machine Learning is Kaggle Competitions – Machine Learning Mastery

Last Updated on September 5, 2016, Julia Evans wrote a post recently titled " Machine learning isn't Kaggle competitions…

machinelearningmastery.com

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓