Understanding Distributed Computing

Last Updated on April 7, 2024 by Editorial Team

Author(s): Renu Gehring

Originally published on Towards AI.

I am having coffee and a slice of poundcake one afternoon with my imaginary friend, Mr. Pound, in his charming and entirely fictitious bakery, “Pound Cakes and More”. Business has been great, my friend tells me, and he is thinking of expanding.

A note about my friend, Mr. Pound. In addition to being an astute businessman and an excellent confectioner, he is really into data science, data, and technology. In his spare time, he analyzes data that he collects meticulously. I really like talking to him because hey, who can turn down an offer of free pound cake with a side of cool techie talk?

Back to expanding “Pound Cakes and More”. Mr. Pound has one small oven and a mixer, and both are optimized to make four cakes at the same time. He believes that he can expand in two ways. Option (1), Go Big, is to purchase an XXL oven and mixer that will make 12 cakes concurrently. Option (2), Distributed Baking, is to purchase 3 additional pairs of small ovens and mixers. With both options, he will be able to make 12 additional cakes at the same time, but each option has its advantages and disadvantages.

With Go Big, Mr. Pound is worried about cake quality since he is not sure about the evenness of the temperature in the XXL oven. “Maybe the cakes in the corners will burn and the ones in the center will be under-done”, he frets. With Distributed Baking, he will need to carefully set baking times with four different ovens. He might, he muses, be able to hire his nephew to help.

Suddenly, Mr. Pound’s face lights up. “I think that I have figured out what distributed computing is.” he declares.

Mr. Pound continues excitedly. “It is like me trying to clean the two seating areas in my bakery. Instead of just me, I would hire two people and each one would clean one room. Then I would supervise their results. The supervision would be extra effort, but the cleaning would get done in about half the time”.

As I help myself to another melt-in-your-mouth slice of poundcake, I interject, “Mr. Pound, have you considered the task of sorting numbers? How might you sort three random numbers? And then one hundred numbers? And finally, a thousand numbers?”

Well-used to my leading questions, Mr. Pound responds affably. “With three numbers, it is easy. I would simply do them in my head. All in memory-processing and no distribution required”, he says with a smile.

“With one hundred numbers, I would grab a pencil and a piece of paper. I would peruse my numbers to find the lowest, write it down, go through the remaining 99 numbers to find the next higher number, repeating this process until I had a new perfectly ordered list of one hundred numbers. Isn’t this similar to building a bigger computer and utilizing in-memory processing as well as temporary writing to disk?”, Mr. Pound says with a chuckle.

“With a thousand numbers…I don’t think I want to do this by myself”, Mr. Pound says. He pauses, thinks for a bit, and resumes, “I would hire ten helpers. And give them one hundred numbers each. They order their own hundred numbers and hand in their work. So now I have ten sheets with one hundred sorted numbers.” Mr. Pound pauses again and looks thoughtful. He continues, “I look at the first row of the 10 lists and write down the minimum number. I can then look at the same row and some additional rows to find the next highest number. Tedious, but do-able”, he declares.

Having finished my second scrumptious slice, I chime in, “Mr. Pound, this is why distributed computing began. As data grew, so did computers. But data grew faster. So, Doug Cutting and his brilliant friends created a system of connecting computers to each other and dividing work across these connecting computers. Now work was distributed among these computers (executors) but there was also a “master” computer that collected intermediate results, did additional work, and then presented the complete task to the end user. Doug Cutting christened this new system Hadoop after his young son’s toy elephant,” I say smiling.

“This system proved itself quickly. To sort one TB of data, Hadoop took three and a half minutes whereas the closest non-distributed system took three and a half days!”, I say excitedly.

Because Mr. Pound looks a bit confused, I add, “One TB of data is close to 1,000,000,000,000 bytes and that is twelve zeroes.”

“But the problem was that this early version of Hadoop was difficult to use by those not trained in computer science. Many new tools and languages popped up, all named for various animals. There was HIVE, Pig, Impala, even Zookeeper”, I say, laughing.

“So did the old code work on the new distributed system?” Mr. Pound asks.

“No”, I reply. “Code had to be adapted. Many programming paradigms went out the window.”

Mr. Pound jumps in, “So if I wrote a loop, that would not distribute well because it works on one row at a time, right? And table indexing would not translate well because it assumes that the entire table is in one place.”

“Yes, that is right”, I say, adding, “But Hadoop evolved to become easier to use. And a new system called Spark was created. Spark integrates well with Hadoop, plays nicely with different types of hardware, and is generally easy to use. It has become dominant today. And now there are tools like PySpark that work much better with distributed data than, say, Pandas.”

Mr. Pound interjects, “So I cannot use Pandas iloc construct?” Knowing that Mr. Pound loves Pandas for analyzing his data, I assure him, “But PySpark is easy. Hey, you can even use Generative AI to help convert your Pandas code to PySpark.”

I continue, “Distributed computing made advances in Generative AI possible. Well, that and the Transformer models, which are designed to iterate to a closer understanding of language on a massively parallel scale.”

Mr. Pound smiles and gets up. “I think I have made up my mind. I am going with Distributed Baking”, he says. I get up, too, knowing that my friend wants to go home, cook dinner, and await his wife, who is expecting their first child. We say our goodbyes and as I leave, Mr. Pound calls out laughing, “Hey, what do you think the chances are of us figuring out how to grow a baby in one month with 9 people?”.

Some things, I thought, are best left undistributed.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Understanding Distributed Computing

Author(s): Renu Gehring

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

I Used ChatGPT to Count My Calories

Resource-Efficient Fine-Tuning of DeepSeek-R1

TAI #138: OpenAI’s o3-Mini and Deep Research: A New Era of Reasoning Powered Agents?

Text Preprocessing for NLP: A Step-by-Step Guide to Clean Raw Text Data

DeepSeek AI — The Future is Here

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Understanding Distributed Computing

Author(s): Renu Gehring

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement