5 Pitfalls of the Modern Data Stack For Startups
Last Updated on July 25, 2023 by Editorial Team
Author(s): Luhui Hu
Originally published on Towards AI.
Practical lessons from and for startups using the modern data stack
The modern data stack is cloud-native open data platforms and services. Today is the era of the modern data stack. It is widely backed by venture capital and evolving rapidly.
Data has become more critical than ever. More and more startups are adopting the modern data stack to accelerate their business. It is pressing to understand what and how. Here I will share five pitfalls from and for startups using the modern data stack.
Pitfall 1: βModernβ in the modern data stack means βadvancedβ
The term βmodernβ in the modern data stack refers to using recent technology. It doesnβt mean itβs advanced in data management and analysis.
With this principle, there are three areas to observe closely before jumping into the first trap.
- Backward compatible: A startup might not have much legacy burden, but many have run over the years. And data platforms were evolving quickly, for example, computing from Hadoop to Spark and storage from Hive to object store (like S3). Backward-compatible adoption can eliminate system and business disruption.
- Easy to upgrade or migrate: Business growth may be the priority for a startup. It is time to embrace the modern data stack, but there are often different options. It is vital to choose easy to upgrade or migrate in favor of business. Over time, the startup has time to upgrade while considering the latest modern data stack again, which will likely evolve at the same rate as your business.
- Long-term development: A startup should focus on business growth and customer satisfaction. But any company will be a data-driven company. Data platforms will play an increasingly critical role. It is necessary to consider long-term data development. Primarily, we should consider the associated ecosystem and community due to the nature of data platforms. That is, we should start from a leading cloud provider (such as AWS, Azure, GCP, Aliyun, etc.) or emerging startups (such as Snowflake, Databricks, etc.) and consider a multi-cloud strategy for the future. Choosing an appropriate ecosystem with an active community can help be part of it and solve challenging problems in no time.
To analyze the above areas, we need to learn recent technology used in the modern data stack. These should include but not be limited to cloud computing, distributed systems, containerization, and practices such as data governance (quality, security, and compliance), automation (low code/no code), and machine learning.
Cloud-native and operation excellence should be the foundation of the modern data stack. These were re-architecting data platforms and improving their performance and scalability.
The modern data stack offers many benefits compared to traditional data platforms, including scalability, flexibility, and ease of maintenance. These make it well-suited for startups looking to take advantage of the cloud and build data-driven applications.
Pitfall 2: Itβs enough in the cloud or to use a cloud-based service
The cloud should be the first catalyst for the evolution of the modern data stack. But it is not enough to use any cloud-based platform, for example, Amazon EMR and Azure Databricks. These are cloud hosting solutions.
It requires three cloud practices to label as a βmodernβ data stack.
- Cloud-native: Recently, the cloud-native practice has reimagined the cloud space. I wonβt define cloud-native here because many data platforms, such as Amazon Redshift and Aurora, are cloud-native today.
- Cloud security: Security becomes increasingly crucial to adopt a cloud data platform. It may be one of the necessary criteria for evaluating data stack adoption. That is why we should choose a leading cloud provider or platform startup as a framework. Within the framework, we can add open sources for more features.
- Multi-cloud support: Multi-cloud support is a business-growth trend for both users and providers. Leading cloud providers are giant silos but also initiate to support multi-clouds. For instance, Microsoft supports multi-cloud protection for the industryβs top three cloud platforms.
Pitfall 3: More affordable to employ the modern data stack
We were impressed with the cloud pay-per-use model and taking advantage of cheap resources in the cloud. But thatβs not the case with the modern cloud and data stack, as weβre constantly pushing for higher performance and the latest technology.
The cost is not just cloud resources but the added value in the modern data stack. For example, Snowflake is significantly cheaper than many traditional data warehouses considering overall cost performance. But due to increasing data volume and complexity, the data cloud has emerged as one of the highest-grossing platforms.
So we have to optimize it as much as possible. We can consider data partition and retention to balance cost and speed. For instance, we can use Redshift as a data warehouse by storing active data in SSD and less active data in S3.
Pitfall 4: Move all data to the data cloud for in-place processing and transforming
Unlimited data lakes and cloud data warehouses are attractive. While moving data to the cloud for in-place processing and transformation can offer benefits such as easier access and greater scalability, it also has some potential drawbacks and trade-offs.
Firstly, moving large amounts of data to the cloud can take time and effort. Depending on the data volume and internet connection speed, transferring data to the cloud can take a significant amount of time, slowing down the overall data processing and transformation. Also, many cloud providers charge for data transfer, so moving large amounts of data can quickly become costly. In this case, near-data computing is one solution before putting all data together. This can apply to edge-computing and distributed web3 startups.
Secondly, it may enhance the complexity of data security and privacy. If sensitive or confidential data is being transferred to the cloud, it is vital to ensure it is protected and that only authorized users can access it. This may require additional security measures, such as encryption and authentication, adding complexity and overhead to the overall process.
So moving data to the cloud for in-place processing and transformation should be carefully considered and weighed against the potential benefits and drawbacks. It may not always be the best option. In some cases, it may be more appropriate to pre-process and transform data locally or use a hybrid approach combining cloud and on-premises infrastructure.
Pitfall 5: Data Lakehouse will be all down the road
Data lakehouse is nascent but demonstrates the mixed benefit of data lake and data warehouse. It may be a unified solution for all data (both structured and unstructured) and all OLAP use cases (including BI and AI).
And it can eliminate data Lambda architecture and messaging queues and simplify data platforms. But at least three key areas complement the data lakehouse and can paint the picture of the entire data landscape together.
- AI engineering: AI engineering coordinates the lifecycle of data and AI. It can systematically address data quality, model optimization, user effectiveness, and data and model governance and operations with the engineering discipline. Through this end-to-end engineering principle, we can maximize the value of the growing unified data platform.
- Data fabric and data mesh: Data fabric and data mesh are data architectures, unlike data lakehouse as a data platform. They design to centralize or decentralize data management and analytics using different mechanisms. This can help startups maintain existing systems and processes, flexible and scalable.
- Purpose-built data platforms: Data lakehouse is an innovative unified platform but still an OLAP. We still need other data platforms to fulfill the modern data stack, such as graph store and engine, search store, HTAP store, etc.
TL;DR
The modern data stack is an emerging technology. Itβs imperative for startups to adopt it. But it is not omni-tech or anything like handy kitchen tools. If you donβt pay attention, it will be bumpy.
It is essential to understand its fundamentals for meeting data-driven business goals. Something else may be secondary. The above five areas cover the modern data stack, from concepts to key concerns and trends.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI