Blockchain and Decentralization: Paving the Path to True Open-Source AI
Last Updated on November 3, 2024 by Editorial Team
Author(s): Laszlo Fazekas
Originally published on Towards AI.
The key players in AI can be broadly divided into two camps: advocates of open-source AI and proponents of closed, proprietary AI.
Interestingly, one of the strongest advocates for closed AI is OpenAI, which, despite its name, does not release the source code for its models. Instead, it only offers access to them. Their reasoning is often framed around safety concerns, claiming that releasing these models could pose risks similar to nuclear technology, necessitating centralized control. While this argument has merit, itβs also clear that business interests play a significant role. After all, if ChatGPTβs source code were freely available, who would pay for its services?
On the other hand, open-source AI supporters, such as Meta (Facebook), argue that closed AI stifles innovation and that embracing open-source is the way forward. However, business considerations are also at play here. For Meta, AI is not their core product but a tool, so sharing the models doesnβt harm their business. It benefits them, as they can leverage improvements made by the community. Still, thereβs a catch β this approach isnβt entirely open-source.
An AI model is a complex mathematical function with adjustable parameters determined during the training process. When a company claims to offer open-source AI, it typically means that these parameters are made publicly available, allowing others to run the model on their systems. However, this doesnβt represent full open-source accessibility.
In AI, training is akin to the building process in traditional software development. In this analogy, the modelβs parameters are like a compiled binary file. So when companies like Meta, X (formerly Twitter), or others make their models available as open-source, what theyβre sharing is just the final product.
What we receive is a set of parameters for a fixed architecture. If we want to modify or enhance the architecture β for instance, swapping the Transformer architecture with a Mamba architecture β we would need to retrain the model from scratch, which isnβt possible without access to the original training data. As a result, these models can only be fine-tuned, not fundamentally altered or further developed.
The so-called open-source models are not truly open-source, as the architecture is fixed. These models can only be fine-tuned but not further developed, as that would require the training set as well. True open-source AI consists of both the model and the training set!
βOpen-sourceβ AI models are usually developed by large corporations. This makes sense, as training massive models demands immense computational power and significant financial investment. Only big companies possess the necessary resources, leading to the centralization of AI development.
Just as blockchain technology in the form of Bitcoin created the possibility of decentralized money, it also allows us to create truly open-source AI that is owned by the community instead of a company.
This article presents a concept for developing a truly open-source, community-driven AI by leveraging blockchain technology.
As mentioned earlier, the core of a truly open-source AI lies in having an open dataset, which is the most valuable asset. For example, ChatGPTβs language model was initially trained on publicly available datasets like Common Crawl, then fine-tuned through human oversight (using reinforcement learning with human feedback, or RLHF). This fine-tuning process is particularly costly due to the human labor involved, but itβs what gives ChatGPT its distinctive capabilities. The model architecture itself is likely a general transformer or a variant like the Mixture of Experts, which involves multiple parallel transformers. In essence, the architecture isnβt groundbreaking β what sets models like ChatGPT apart is the quality of the dataset, which drives the modelβs performance.
AI training datasets often span several terabytes, and the content they include varies depending on cultural and group-specific perspectives. The choice of data is crucial, as it can define the βpersonalityβ of a large language model. For instance, major scandals have occurred when AI models from companies like Google and Microsoft exhibited racist behavior, largely due to poor dataset selection. Since dataset requirements can differ by culture, creating multiple forks may be necessary. Decentralized, content-addressed storage solutions like IPFS or Ethereum Swarm are ideal for managing these versioned, multi-fork datasets. These systems function similarly to the GIT version control system, where each file is referenced by a hash generated from its content. Forks can be created efficiently in such systems, as only the differences are stored, while the shared portions of the datasets are stored just once.
Once we have the right datasets, we can move forward with training the model.
As mentioned earlier, an AI model is essentially a large mathematical function with many adjustable parameters. Generally, the more parameters a model has, the more βintelligentβ it is, which is why the number of parameters is often included in the modelβs name. For instance, llma-2β7b indicates that the model is based on the llma-2 architecture and has 7 billion parameters. During training, these parameters are adjusted using the dataset to ensure the model produces the correct output for a given input. This process is accomplished through backpropagation, which uses partial derivatives to optimize the parameters.
The dataset is divided into batches during training. For each step, a batch provides input and expected output values, and backpropagation is used to adjust the modelβs parameters so it can correctly predict the output based on the input. This process is repeated many times until the model reaches the desired level of accuracy, which is verified using a test dataset.
Large companies often train models using vast GPU clusters due to the high computational demands of the process. In decentralized systems, however, a key challenge is that individual nodes are unreliable, and this unreliability comes at a cost. This is similar to how Bitcoinβs Proof of Work consensus mechanism consumes massive amounts of energy β trust in the system is achieved by relying on computational power instead of individual node reliability. In contrast, Ethereumβs Proof of Stake mechanism reduces energy consumption by using staked assets to ensure reliability rather than computational power.
In decentralized AI training, there needs to be a way to ensure trust between the training node and the requester. One possible solution is for the training node to log the entire training process, while a third-party validator node performs random checks on the log. If the validator is satisfied with the results, the training node receives payment. The validator does not check the entire log, as that would require duplicating the computations, which would be as resource-intensive as the original training.
Another option is an optimistic approach, where we assume the node performed the computation correctly, but allow a challenge period in which anyone can dispute the result. In this case, the node performing the computation stakes a larger amount (penalty), and the node requesting the computation also stakes an amount (reward). The node performs the computation and then publishes the result. After the node publishes its result, a challenge period (e.g., one day) follows. If an error is found during this time, the challenger receives the penalty, and the requester gets their reward back. If no errors are found, the computing node receives the reward.
Thereβs also a variant of zero-knowledge proofs, known as zkSNARKs, which can verify computations. The benefit of this method is that verification is inexpensive, but generating the proof is computationally demanding. For AI training, this method is currently too resource-intensive, requiring more computation than the training itself. However, zkML is a growing field of research, and in the future, smart contracts could potentially replace third-party validators by verifying the computation using zkSNARKs.
From the above, itβs clear there are multiple methods for verifying computations. Based on this, letβs explore how a blockchain-based decentralized training system would be constructed.
In this system, datasets are collectively owned by the community through DAOs (Decentralized Autonomous Organizations). The DAO determines which data should be included in the dataset. If a subset of members disagrees with these decisions, they can split off to form a new DAO, where they can fork the existing dataset and continue developing it independently. In this way, both the DAO and dataset can be forked. Since the dataset is stored in decentralized, content-addressed storage (like Ethereum Swarm), forking is cost-effective, and the community funds the datasetβs storage.
The training process is also managed by a DAO. Training nodes that wish to offer their spare computational capacity can register through the DAO. To qualify, each node must stake an amount in a smart contract, which they risk losing if they attempt to cheat during the computation.
A requester selects the dataset and model they want to train and offers a reward for the task. This offer is public, allowing any training node to apply. The training node then logs the entire training process, with each entry corresponding to the training of a batch. The log includes the input, output, weight matrix, and all relevant parameters (such as the random seed used by the dropout layer). This allows the entire computation to be fully reproduced from the log.
As previously mentioned, there are several ways to verify computations. The simplest method is the optimistic approach. In this case, the requester locks the reward in a smart contract, and the training node publishes the training log. A designated period (e.g., one day) is then available for verification. If the requester or anyone else can prove a specific step is incorrect during this time, the training node loses its stake, and the requester retrieves their reward. The person who submits the correct proof earns the stake, incentivizing the community to check the computations. If no proof is submitted, the training node receives the reward when the period ends.
In summary, this is how the system operates, though a few questions remain.
Who will pay for the cost of training and storing the datasets?
The business model of this system mirrors that of many free and open-source solutions, like the Linux model. If a company needs an AI model and is comfortable with it being free and open-source, itβs far more cost-effective to invest in this shared system than to train a proprietary model from scratch. For example, if 10 companies require the same language model and donβt mind it being open, itβs much cheaper for each to cover 1/10th of the training cost rather than for each to bear the full expense individually. The same principle applies to the datasets used for training. Additionally, crowdfunding campaigns could be organized for training models, allowing future users to contribute to their development.
Isnβt it cheaper to train models in the cloud?
Since prices in such a system are governed by market forces, itβs challenging to provide a definitive answer. The cost will depend on how much unused computational capacity users have available. Weβve already witnessed the power of community-driven systems with Bitcoin, where the networkβs computational power exceeds that of any supercomputer. Unlike cloud providers, who need to generate profit, users in a decentralized system can offer their spare computational resources. For example, someone with a high-performance gaming PC can contribute unused capacity when theyβre not gaming. If the service generates even slightly more than the energy costs, it becomes worthwhile for the user. Additionally, there is a significant amount of wasted energy worldwide that traditional methods canβt harness. Take thermal energy from volcanoes, for instance β these areas often lack infrastructure for generating electricity. Some startups are already using this energy for Bitcoin mining, so why not apply it to βintelligence miningβ? With virtually free energy, the only real cost is the hardware. Therefore, various factors suggest that training models in such a decentralized system could be far cheaper than using cloud services.
What about inference?
When it comes to running AI models, privacy is a critical concern. While large service providers assure us that our data is handled securely, how can we be certain that no one is listening in on our interactions with models like ChatGPT? Techniques such as homomorphic encryption allow servers to process encrypted data, but they come with significant performance costs. The most secure option is to run models locally. Fortunately, hardware is becoming more powerful, and there are already specialized solutions for running AI models on local devices. The models themselves are also evolving rapidly. Research shows that performance often remains high even after quantization, with minimal degradation, even in extreme cases where weights are represented with as little as 1.5 bits. This is particularly promising because it eliminates multiplication, which is the most resource-intensive operation. As model and hardware advancements continue, we are likely to be able to run models locally that surpass human-level performance. Additionally, with tools like LoRA, weβll have the ability to customize these models to our personal preferences.
The enormous potential of binarized and 1,58-bit neural networks
Quantization is a frequently used method to reduce the memory and computational capacity requirements of our machineβ¦
medium.datadriveninvestor.com
Distributed knowledge
A highly promising approach is retrieval-augmented generation (RAG), where βlexical knowledgeβ is stored in a vector database, and the language model retrieves the relevant context from this database to answer questions. This mimics how humans process information β no one memorizes an entire dictionary. Instead, when faced with a question, itβs enough to know where to find the necessary information. By reading and interpreting relevant sources, we can formulate coherent answers. This approach offers several advantages. First, it allows the use of a smaller model, which is easier to run locally. Second, it helps reduce hallucination, a common issue in language models. Additionally, expanding the modelβs knowledge becomes simple β thereβs no need for retraining; you just add new information to the vector database. Ethereum Swarm is well-suited for creating such a database, as it serves not only as decentralized storage but also as a communication platform. For instance, Swarm can facilitate group messaging, allowing the creation of a distributed vector database. A node can publish a search query, and other nodes respond with the relevant knowledge.
A Concept of Distributed Artificial Intelligence aka a New Type of World Wide Web
Do you know how large corporate document databases work, where you can ask questions about the content of millions ofβ¦
medium.datadriveninvestor.com
Summary: Implementation of LLM OS over Ethereum and Swarm
The concept of LLM OS was first introduced by Andrej Karpathy on Twitter. It envisions a hypothetical operating system built around a large language model. In the context of our blockchain-based distributed system, this can be thought of as an agent running on a userβs node. This agent can communicate with other agents as well as traditional Software 1.0 tools, such as a calculator, Python interpreter, or even control devices like robots, cars, or smart home systems. In our setup, the file system is represented by Ethereum Swarm, along with a vector database built on top of Swarm, where shared knowledge is stored. The entire system, made up of these agents, can be seen as a form of collective intelligence.
https://x.com/karpathy/status/1723140519554105733
I believe that in the future, artificial intelligence will be deeply integrated into our daily lives, far beyond what we experience today. AI will essentially become a part of who we are. Instead of relying on mobile phones, weβll wear smart glasses equipped with cameras that record everything and microphones that capture every sound. Weβll engage in continuous dialogues with language models and other agents running locally, which will fine-tune themselves to meet our needs over time. These agents wonβt just interact with us β theyβll also communicate with each other, constantly tapping into the collective knowledge generated by the entire community. This system will transform humanity into a form of collective intelligence, which is incredibly powerful. This collective intelligence mustnβt become controlled by a single company or entity. Thatβs why systems like the ones discussed, or similar alternatives, are so essential!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI