Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Understanding Distributed Computing
Latest   Machine Learning

Understanding Distributed Computing

Last Updated on April 7, 2024 by Editorial Team

Author(s): Renu Gehring

Originally published on Towards AI.

Photo by charlesdeluvio on Unsplash

I am having coffee and a slice of poundcake one afternoon with my imaginary friend, Mr. Pound, in his charming and entirely fictitious bakery, β€œPound Cakes and More”. Business has been great, my friend tells me, and he is thinking of expanding.

A note about my friend, Mr. Pound. In addition to being an astute businessman and an excellent confectioner, he is really into data science, data, and technology. In his spare time, he analyzes data that he collects meticulously. I really like talking to him because hey, who can turn down an offer of free pound cake with a side of cool techie talk?

Back to expanding β€œPound Cakes and More”. Mr. Pound has one small oven and a mixer, and both are optimized to make four cakes at the same time. He believes that he can expand in two ways. Option (1), Go Big, is to purchase an XXL oven and mixer that will make 12 cakes concurrently. Option (2), Distributed Baking, is to purchase 3 additional pairs of small ovens and mixers. With both options, he will be able to make 12 additional cakes at the same time, but each option has its advantages and disadvantages.

With Go Big, Mr. Pound is worried about cake quality since he is not sure about the evenness of the temperature in the XXL oven. β€œMaybe the cakes in the corners will burn and the ones in the center will be under-done”, he frets. With Distributed Baking, he will need to carefully set baking times with four different ovens. He might, he muses, be able to hire his nephew to help.

Suddenly, Mr. Pound’s face lights up. β€œI think that I have figured out what distributed computing is.” he declares.

Photo by CDC on Unsplash

Mr. Pound continues excitedly. β€œIt is like me trying to clean the two seating areas in my bakery. Instead of just me, I would hire two people and each one would clean one room. Then I would supervise their results. The supervision would be extra effort, but the cleaning would get done in about half the time”.

As I help myself to another melt-in-your-mouth slice of poundcake, I interject, β€œMr. Pound, have you considered the task of sorting numbers? How might you sort three random numbers? And then one hundred numbers? And finally, a thousand numbers?”

Well-used to my leading questions, Mr. Pound responds affably. β€œWith three numbers, it is easy. I would simply do them in my head. All in memory-processing and no distribution required”, he says with a smile.

β€œWith one hundred numbers, I would grab a pencil and a piece of paper. I would peruse my numbers to find the lowest, write it down, go through the remaining 99 numbers to find the next higher number, repeating this process until I had a new perfectly ordered list of one hundred numbers. Isn’t this similar to building a bigger computer and utilizing in-memory processing as well as temporary writing to disk?”, Mr. Pound says with a chuckle.

β€œWith a thousand numbers…I don’t think I want to do this by myself”, Mr. Pound says. He pauses, thinks for a bit, and resumes, β€œI would hire ten helpers. And give them one hundred numbers each. They order their own hundred numbers and hand in their work. So now I have ten sheets with one hundred sorted numbers.” Mr. Pound pauses again and looks thoughtful. He continues, β€œI look at the first row of the 10 lists and write down the minimum number. I can then look at the same row and some additional rows to find the next highest number. Tedious, but do-able”, he declares.

Having finished my second scrumptious slice, I chime in, β€œMr. Pound, this is why distributed computing began. As data grew, so did computers. But data grew faster. So, Doug Cutting and his brilliant friends created a system of connecting computers to each other and dividing work across these connecting computers. Now work was distributed among these computers (executors) but there was also a β€œmaster” computer that collected intermediate results, did additional work, and then presented the complete task to the end user. Doug Cutting christened this new system Hadoop after his young son’s toy elephant,” I say smiling.

β€œThis system proved itself quickly. To sort one TB of data, Hadoop took three and a half minutes whereas the closest non-distributed system took three and a half days!”, I say excitedly.

Because Mr. Pound looks a bit confused, I add, β€œOne TB of data is close to 1,000,000,000,000 bytes and that is twelve zeroes.”

β€œBut the problem was that this early version of Hadoop was difficult to use by those not trained in computer science. Many new tools and languages popped up, all named for various animals. There was HIVE, Pig, Impala, even Zookeeper”, I say, laughing.

β€œSo did the old code work on the new distributed system?” Mr. Pound asks.

β€œNo”, I reply. β€œCode had to be adapted. Many programming paradigms went out the window.”

Mr. Pound jumps in, β€œSo if I wrote a loop, that would not distribute well because it works on one row at a time, right? And table indexing would not translate well because it assumes that the entire table is in one place.”

β€œYes, that is right”, I say, adding, β€œBut Hadoop evolved to become easier to use. And a new system called Spark was created. Spark integrates well with Hadoop, plays nicely with different types of hardware, and is generally easy to use. It has become dominant today. And now there are tools like PySpark that work much better with distributed data than, say, Pandas.”

Mr. Pound interjects, β€œSo I cannot use Pandas iloc construct?” Knowing that Mr. Pound loves Pandas for analyzing his data, I assure him, β€œBut PySpark is easy. Hey, you can even use Generative AI to help convert your Pandas code to PySpark.”

I continue, β€œDistributed computing made advances in Generative AI possible. Well, that and the Transformer models, which are designed to iterate to a closer understanding of language on a massively parallel scale.”

Mr. Pound smiles and gets up. β€œI think I have made up my mind. I am going with Distributed Baking”, he says. I get up, too, knowing that my friend wants to go home, cook dinner, and await his wife, who is expecting their first child. We say our goodbyes and as I leave, Mr. Pound calls out laughing, β€œHey, what do you think the chances are of us figuring out how to grow a baby in one month with 9 people?”.

Some things, I thought, are best left undistributed.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓