Understanding Distributed Computing
Last Updated on April 7, 2024 by Editorial Team
Author(s): Renu Gehring
Originally published on Towards AI.
I am having coffee and a slice of poundcake one afternoon with my imaginary friend, Mr. Pound, in his charming and entirely fictitious bakery, βPound Cakes and Moreβ. Business has been great, my friend tells me, and he is thinking of expanding.
A note about my friend, Mr. Pound. In addition to being an astute businessman and an excellent confectioner, he is really into data science, data, and technology. In his spare time, he analyzes data that he collects meticulously. I really like talking to him because hey, who can turn down an offer of free pound cake with a side of cool techie talk?
Back to expanding βPound Cakes and Moreβ. Mr. Pound has one small oven and a mixer, and both are optimized to make four cakes at the same time. He believes that he can expand in two ways. Option (1), Go Big, is to purchase an XXL oven and mixer that will make 12 cakes concurrently. Option (2), Distributed Baking, is to purchase 3 additional pairs of small ovens and mixers. With both options, he will be able to make 12 additional cakes at the same time, but each option has its advantages and disadvantages.
With Go Big, Mr. Pound is worried about cake quality since he is not sure about the evenness of the temperature in the XXL oven. βMaybe the cakes in the corners will burn and the ones in the center will be under-doneβ, he frets. With Distributed Baking, he will need to carefully set baking times with four different ovens. He might, he muses, be able to hire his nephew to help.
Suddenly, Mr. Poundβs face lights up. βI think that I have figured out what distributed computing is.β he declares.
Mr. Pound continues excitedly. βIt is like me trying to clean the two seating areas in my bakery. Instead of just me, I would hire two people and each one would clean one room. Then I would supervise their results. The supervision would be extra effort, but the cleaning would get done in about half the timeβ.
As I help myself to another melt-in-your-mouth slice of poundcake, I interject, βMr. Pound, have you considered the task of sorting numbers? How might you sort three random numbers? And then one hundred numbers? And finally, a thousand numbers?β
Well-used to my leading questions, Mr. Pound responds affably. βWith three numbers, it is easy. I would simply do them in my head. All in memory-processing and no distribution requiredβ, he says with a smile.
βWith one hundred numbers, I would grab a pencil and a piece of paper. I would peruse my numbers to find the lowest, write it down, go through the remaining 99 numbers to find the next higher number, repeating this process until I had a new perfectly ordered list of one hundred numbers. Isnβt this similar to building a bigger computer and utilizing in-memory processing as well as temporary writing to disk?β, Mr. Pound says with a chuckle.
βWith a thousand numbersβ¦I donβt think I want to do this by myselfβ, Mr. Pound says. He pauses, thinks for a bit, and resumes, βI would hire ten helpers. And give them one hundred numbers each. They order their own hundred numbers and hand in their work. So now I have ten sheets with one hundred sorted numbers.β Mr. Pound pauses again and looks thoughtful. He continues, βI look at the first row of the 10 lists and write down the minimum number. I can then look at the same row and some additional rows to find the next highest number. Tedious, but do-ableβ, he declares.
Having finished my second scrumptious slice, I chime in, βMr. Pound, this is why distributed computing began. As data grew, so did computers. But data grew faster. So, Doug Cutting and his brilliant friends created a system of connecting computers to each other and dividing work across these connecting computers. Now work was distributed among these computers (executors) but there was also a βmasterβ computer that collected intermediate results, did additional work, and then presented the complete task to the end user. Doug Cutting christened this new system Hadoop after his young sonβs toy elephant,β I say smiling.
βThis system proved itself quickly. To sort one TB of data, Hadoop took three and a half minutes whereas the closest non-distributed system took three and a half days!β, I say excitedly.
Because Mr. Pound looks a bit confused, I add, βOne TB of data is close to 1,000,000,000,000 bytes and that is twelve zeroes.β
βBut the problem was that this early version of Hadoop was difficult to use by those not trained in computer science. Many new tools and languages popped up, all named for various animals. There was HIVE, Pig, Impala, even Zookeeperβ, I say, laughing.
βSo did the old code work on the new distributed system?β Mr. Pound asks.
βNoβ, I reply. βCode had to be adapted. Many programming paradigms went out the window.β
Mr. Pound jumps in, βSo if I wrote a loop, that would not distribute well because it works on one row at a time, right? And table indexing would not translate well because it assumes that the entire table is in one place.β
βYes, that is rightβ, I say, adding, βBut Hadoop evolved to become easier to use. And a new system called Spark was created. Spark integrates well with Hadoop, plays nicely with different types of hardware, and is generally easy to use. It has become dominant today. And now there are tools like PySpark that work much better with distributed data than, say, Pandas.β
Mr. Pound interjects, βSo I cannot use Pandas iloc construct?β Knowing that Mr. Pound loves Pandas for analyzing his data, I assure him, βBut PySpark is easy. Hey, you can even use Generative AI to help convert your Pandas code to PySpark.β
I continue, βDistributed computing made advances in Generative AI possible. Well, that and the Transformer models, which are designed to iterate to a closer understanding of language on a massively parallel scale.β
Mr. Pound smiles and gets up. βI think I have made up my mind. I am going with Distributed Bakingβ, he says. I get up, too, knowing that my friend wants to go home, cook dinner, and await his wife, who is expecting their first child. We say our goodbyes and as I leave, Mr. Pound calls out laughing, βHey, what do you think the chances are of us figuring out how to grow a baby in one month with 9 people?β.
Some things, I thought, are best left undistributed.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI