Author(s): Kelvin Lu
Originally published on Towards AI.
The Data Scientist Show, by
Daliana Liu, is one of my favorite YouTube channels. Unlike many other data science programs that are very technical and require concentration to follow through, Daliana’s talk show strikes a delicate balance between profession and relaxation. I used to listen to the shows while walking, on the train, or washing dishes. I am interested in how other data scientists and engineers solve real-world problems and how they feel about their work. They share their successes and also some awkward moments. In some sense, the latter insight being widely neglected is something I found priceless.
In one of her latest shows — The future of Data Science Teams, they talked about their experience as research data scientists who are not very handy with engineering stuff. They have to pair with an engineer when they join a new team, and they were advised that:
“You need to be a full-stack data scientist.”
Em, this is the first time I've heard the term full-stack scientist. I have to admit that this is an interesting idea. I did my research about this idea and hoped my insight could inspire more data science practitioners.
Definition of a full-stack data scientist
The sibling relationship between data science and software development has led to the borrowing of many concepts from the software development domain into data science practice. For example, DevOps became MLOps, and the CICD became the 5Cs: continuous testing, continuous integration, continuous deployment, continuous evaluation, and continuous training. The concept of a full-stack data scientist looks like the latest transplant of the title of full-stack software engineer.
The modern software architecture convention states that a software solution stack comprises multiple tiers, with the database being the bottom one of them. On top of the database, there is a backend system and a frontend system. Prior to the emergence of full-stack developers, backend developers and frontend developers were two separate positions. Occasionally, even database development was a dedicated role. The concept of a full-stack developer means that the same person will be in charge of developing everything, from database design to backend functions to frontend websites and services. The benefit is obvious: lower human costs, better productivity, fewer communication costs, better quality, and faster roll-out speed.
The adoption of the full-stack data scientist is about the same concept: encouraging data scientists to expand their skills to cover more areas. However, this concept has only been forged very recently. It doesn’t even have a commonly agreed-upon definition yet. The broadest but ambiguous definition might be “engage in all stages of the data science lifecycle,” while the narrowest definition would be “must be able to develop models, test and validate them, deploy them to production, refine the model, and test again," which looks not different from the traditional machine learning engineer role. The most commonly cited skills of a full-stack data scientist are data science and data engineering. No one expects a data scientist to develop back-end and front-end applications.
Based on these variant role descriptions, we can develop a picture of the full-stack data scientist.
- Engineering-oriented, which is the opposite of pure research-oriented data scientist roles.
- Has certain data engineering skills. Capable of handling big data and traditional relational data sources independently.
- Knows MLOps practices. Can confidently perform model training and deployment in production.
A failed example
Quite a few years ago, I had the opportunity to work with a very smart PhD data scientist. He’s really, really smart, up to a jaw-dropping level. He was not only doing his machine learning analysis but also building a big-data processing engine in an unpopular language, Haskel. That was when Apache Spark just became a popular name, and his system could perform as efficiently as Spark.
If we match his skill with the description of a full-stack data scientist, we will find that he can tick all the boxes: he was a data scientist with very strong engineering skills. He knew big data processing in all the details. He managed the model production environment with a single finger.
Unfortunately, his smarts and his outstanding achievement in building the Spark-like new system didn’t result in a plausible business outcome. In fact, the startup company found it was not affordable to continue supporting the development of the complex system without immediate financial benefit. And the versatility of that data scientist didn’t accelerate their project delivery.
It was an emotional moment when the full-stack data scientist left the company. A palpable cloud of sadness hung over the office as the company’s brilliant full-stack data scientist said his goodbye. Tears were shed by both colleagues and the departing individual. After he left, the company still had trouble finding its direction. It struggled to maintain its financial balance for multiple years. It looks like the full-stack data scientist was not an accelerator for the company, and neither was he the sole reason for the unsuccessful business. In this case, why do we ask for a full-stack data scientist?
Read between the lines
Let’s have a look at how a software development team is structured. A full software development team has multiple different roles:
- product owner
- project manager
- business analyst
After over 70 years of software development practices, the gap between different roles is not very significant. So it is very common for a single person to wear multiple hats.
On the contrary, the structure of a machine learning team is more complex. Compared to a software development team, a machine learning team is more compact. The emerging requirement for full-stack data scientists reflects the fact that a machine learning team can’t be equipped with all the skills it needs. That’s why it requires its data scientists to be full-stack.
The point is that a successful machine-learning team requires quite a few different skills, and all of them require a decent level of machine-learning knowledge. It is not just about data science and data engineering. Have you ever seen a job description for a business analyst with machine-learning experience, a project manager with machine-learning experience, an architect with machine-learning experience, or a machine-learning QA?
These different roles are also very important in machine-learning teams. However, most machine-learning teams haven’t evolved to such a high level of engineering yet. Literally, lacking engineering capabilities characterizes most machine learning teams, as far as I know. Most machine learning teams still lean towards researching organizations rather than industrial organizations. The key differences are:
- Do they have a clear sense of budget, quality, and schedule?
- Do they have the power to drive requirement definitions based on business values?
- Do they believe it’s important to minimise issues caused by human mistakes?
- Do they automate repetitive routines?
- Do they place their priorities on model performance or system stability?
So far, many machine learning teams are more focused on model building. Their abilities haven’t been developed to survive the market. Yes, surviving the market is a capability that needs to be developed. Industrialization is essential for companies to survive. Why did that aforementioned full-stack data scientist fail to benefit his company, and why did the company still struggle after he left? The reason was that the whole team wasn’t ready to run at full speed. It is not an individual's problem.
The old wisdom
In 1986, Brooks, Frederick P. published his historical paper “No Silver Bullet — Essence and Accident in Software Engineering” based on his experience managing IBM 360 family mainframes and software packages. This paper is an important milestone in modern software engineering. It dives into Brooks’ framework for understanding complexity, which draws inspiration from Aristotle. He identifies two key types:
1. Accidental Complexity: This is self-inflicted complexity, introduced by our design choices and engineering limitations. Imagine writing code directly in assembly! Thankfully, modern languages abstract away many such hurdles, like optimizing every instruction or waiting for batch processing. While progress has been made, other forms of accidental complexity remain, waiting to be streamlined.
2. Essential Complexity: This is inherent to the problem itself. It arises from the core functionality users demand. Think of a programme tasked with 30 diverse tasks. No matter how cleverly designed, it can’t escape the inherent complexity of handling all 30 functions effectively.
Brook believed that the accidental complexity had been greatly reduced. This means programmers today spend less time wrestling with clunky systems and more time tackling the inherent challenges of the problem itself, also known as “essential complexity.”
However, Brooks cautions that completely eliminating accidental complexity won’t be a game-changer. The real breakthroughs will come from addressing essential complexity, which arises from the very nature of the task at hand. He compares it to building a complex machine: no matter how well-designed, it still needs all its intricate parts to function.
While there’s no magic solution (“silver bullet”), Brooks believes a series of targeted innovations can make a big difference. Take high-level programming languages like Ada as an example. They’ve simplified complex tasks, freeing programmers to focus on the core challenges.
Brooks advocates “growing” software organically through incremental development. He suggests devising and implementing the main and subprograms right at the beginning, filling in the working sub-sections later. He believes that programming this way excites the engineers and provides a working system at every stage of development.
Brooks was right! The evolution of software engineering proved that there is no single magic way to dramatically enhance our productivity. We have to go through a series of minor improvements. When we talk about full-stack software development, does it surprise you that people once developed the backend and front end separately? The emergence of full-stack developers was due to modern technology, which has greatly reduced the complexity of developing both frontend and backend programs so that people can have enough bandwidth to cover more fields. So, if an experienced full-stack developer time-traveled 20 years ago, he probably still has to choose between being a frontend developer or a backend developer because all the tools he relies on haven’t been created yet.
No silver bullet
The software engineering didn’t improve with a finger click. It took all the thoughts and practices for ages to reach the current level. And it still has a vast room to improve.
So far, we have revisited software engineering history. Then how about data science engineering?
Unfortunately, there isn’t a lot of discussion about data science engineering yet. MLOps falls in that area, but I’m expecting a broader discussion than that. And even the concept of MLOps wasn’t adopted very well at the moment. There are quite a few ML platforms promoting one-stop solutions, but only time can tell.
In a different interview, an ex-cofounder of an Automl solution provider admitted that they have invested a huge team (+100 people) trying to help fight COVID as a use case for their product. Eventually, the team found they had to handcraft a new solution without using their system. He said, “AutoML is more suitable for the happy path.”
So, you know what? We can’t expect our lives to be much happier by adopting a single concept or buying into a new system. We can only enhance machine learning productivity gradually.
Brooks is right—there's no silver bullet!
There are areas where data science engineering can be improved:
- Education: future data scientists need to be equipped with an engineering mindset rather than only academic data science knowledge.
- Research: new theories need to be developed to guide practitioners in dealing with the uncertainties of machine learning applications. For instance, using ARIMA or Prophet in time series analysis is very common, right? How about you have to predict more than 50,000 time series efficiently? And can you discover causal relationships from that amount of time series?
- Tooling: When asked about when to stop tuning a deep learning model, the tutor said: Stop when you run out of money or get exhausted. We may take it as a necessary inconvenience, but who knows? People also thought that making weapons from obsidian was the only option. So, never say never. This is an example of how current machine learning practices are bare-footed. We need a lot of new tools to improve our productivity.
The idea of a full-stack data scientist knocked on the door of data science engineering. Once you walk in, you will find plenty of options. You don’t have to be a machine learning specialist or data engineer; you can be an architect with machine learning know-how; you can be a product owner who knows how machine learning could help in real businesses; you can be a business analyst who matches the knowledge gaps between your data scientists and the non-technical users; and you can be a machine learning QA who has the specialty to red-team machine learning models. You can do anything to help your team and also make your work exciting. You don’t have to be a full-stack data scientist if you don’t like to.
In his famous paper, Brooks also raises a crucial point:
Not all designers are created equal. He emphasizes that programming’s inherent creativity fosters varying design capabilities. He goes as far as suggesting a tenfold difference in potential exists between the “ordinary” and “great.” Brooks proposes elevating “star designers” to match the treatment traditionally reserved for “star managers.” This suggests not just equal pay but also perks associated with higher status, such as spacious offices, support staff, and travel allowances.
I wish you all to be tenfold data scientists, with spacious offices, support staff, and travel allowances. 😀
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI