Do Your Data Right by Web3

Without self-sovereignty, your data is never your data. Can web3 be the game changer in democratizing data ownership and monetization?

May 20, 2023

Medieval peasants working the land — Gilles de Rome, CC BY-SA 4.0, via Wikimedia Commons

If you are using web2 services—which I bet most of us are—your digital identities are inevitably being exposed. This is so important in retaining the traditional world's law and order. Due diligence, or know-your-customer (KYC), is uplifting because authorities ride herd on combating terrorism and criminal activities, which in turn sacrifices the innate privacy of the innocents. While we are stressed about fighting and preventing immorality, the rights of ordinary people are ignored.

Good apples compensate for bad apples. Therefore, cybercrime will not dissipate but only be bred. Bad actors steal profiles of popular icons, for example, company executives, government officials, or influencers, impersonate them to place advertisements, committing financial fraud on social media to other users.

It is a classical dysfunctionality of web2 platforms that personal data are easily tampered with, and service providers are slack on spending real effort eliminating bad actors or even incentivized to conspire with them when they contribute part of the platform income, such as advertising revenues.

On the other hand, advertisers with legitimate business intentions spend huge sums to acquire irrelevant target audiences because the advertisements are often perceived to be annoying on the other side, and many of them are recklessly delivered merely by the classification of demographic profiles, which is not identical to individual preferences. Thus, it induces a lot of resource wastage and supply and demand mismatch.

Web2 platforms see it through, too, so they have been pushing hard to collect digital footprints as a proxy for users' preferences. With the aid of AI, platforms inspect user navigation histories thoroughly and resell to advertisers, adorning AI will customize tailor-made advertisements to boost their client sales. Still, irritated users shall examine loopholes in the system, from rightfully going incognito to more deceptive means like creating fake personas, setting up firewalls to avoid identity abuse and thefts, and leaving out advertisers inflating their budget year by year in a vain attempt.

The daunting inefficient fallacy haunts web2, coercing platform operators to "upgrade their technology" forever as a marketing pitch to remove churn. Some have been crushed, and the survivors succeeded their debris moving forward. The consolidation is happening continually to only a few contenders left, dictating the web2 ecosystem—harvesting over data of ordinary people for personal gains—as a predatory overlord to their serfs in medieval times.

Proprietary AI Hampers

Big Techs are investing heavily in AI, piling up capital as their moat, much like to replicate their success formula in the past. According to rumors from alleged insiders, OpenAI mounted a USD540 million loss in 2022 alone. Sam Altman, CEO of OpenAI, privately suggested raising USD100 billion in funding to develop a more advanced AI model than ChatGPT, thus making his fortune the "most capital-intensive startup in Silicon Valley history." Creating, running, and maintaining a gargantuan AI model is never a low-priced pursuit, given human capital and computing power are all costly to amass, and other data aggregators are awakening, charging Big Techs using the data for training their AI models.

However, doing AI commercially will not always secure success. An internal note from a Google engineer has leaked to the public, highlighting how fast open-source AI is catching up in the arms race. Since the LLaMA—the proprietary large language model by Meta—was released in February 2023 and uploaded by someone in the public domain just a week later, various fine-tuned, open-source versions emerged.

Moreover, as more researchers are hopping in, looking for cheaper datasets and computing power alternatives, developing models that the performance will equate to—or at least be close to—those sophisticated models crafted by Big Techs is becoming plausible. Some merely cost a few hundred US Dollars in training.

Although the internal note does not represent the entire consensus of Google, and we have no verification whether this note is genuinely coming from inside the corporation, some professionals are fidgeting, wary of the open-source at the heels.

Rob Reich, the associate director of Stanford's Institute for Human-Centered Artificial Intelligence, warns of the dangers that lie in open-source AI models. These models, while appearing to be benign and democratic, could be weaponized by bad actors looking to cause harm to our society. The impact such a scenario could have on humanity is akin to providing villains with nuclear devices, opening the door to existential risks.

While it may seem true that allowing unrestricted access to AI models is dangerous on the surface, such a belief is too superficial. The potential harm to humans is not due to developers' choice to open-source the models. Few proprietary models dominate mainstream usage, but misbehavior occurs frequently. We have seen this illustrated by traditional web2 products, such as social media and search engines, whose source codes are owned by Big Techs as inherent business secrets. So, AI safety does not depend on whether a model is open-source.

Transparency is important. As the control is still in a few oligopolies' hands, we ought to have imminent countermeasures. Open-source is one of them.

Protect User Data First

Rather than focusing on enclosing the models, we should prioritize the protection of user data. AI models derive their immense power not from the code but from the data scraped from the internet. Data is forced to be open in the web2 setup, which enables AI models to thrive in the first place. Our anxieties intensify as we realize these companies profit from the data they capture, manipulating it against our wishes, and we have to bear the consequences of their barbarity.

Average Joes not equipped with technical abundance are hopeless of safeguarding their data. Their bargaining power is lackluster, leaving them no choice but leaning on web2 services or completely shut down their digital presence. Big Techs are forging a cartel instead of competing with each other because data is so valuable for their businesses.

Don't get me wrong, arms races always exist within the coalition, but none are propounding the axiom of freeing personal data under the capitalist regime. Instead, Big Techs will polish their end products—like the user interface and experience—to gain traction and retention, constantly trading usability for data.

Web2 has satiated. It is too crowded to expand further in terms of too less residual value to extract. After a quarter of a century since the dot com bubble, we are going back to the fundamental question of the internet—how to let users own their data. Though it is still more of an economic reason, activists expedite the web3 rebellion, longing for a reshuffle of cards.

We Don't Need Another Web2 Data Agent

"Data is the new oil" is a frequently discussed notion insofar as it is identified as a precious asset, at least in the eye of the beholder. Similar to oil, data is controlled by a handful of giant organizations. The difference is operators usually suffice with a license in one location for petroleum exploration; but have to compile multiple data sovereignty policies across countries to deal with user data that lack a shared, common standard.

Indeed, this is not an ideal practice for corporations and users to be limited by actual geographical boundaries in a supposedly borderless digital world. A new set of data usage mechanisms—which should be user-centric and self-sovereign—are thus needed.

Worldcoin—another capital-intensive startup by OpenAI's Sam Altman—has a great ambition of creating a global digital identity system by scanning everyone's eyeballs. Despite the emphasis on decentralization, a prominent factor that distinguishes it from being a genuine web3 project is the treatment of our biometric data. Iris code, essentially a numerical representation of our iris images, will be stored in its database, making it a web2 platform with no self-sovereignty, camouflaged by airdrops of cryptocurrency.

It is a misdirection by throwing money into technology—particularly AI—to resolve the problem of data sovereignty. Though this article is not purposed to whine about how Worldcoin, OpenAI, or other web2 Big Techs are exploiting the data of inferiors, we have to face up to the problem. Over the years, we have been overdependent on intermediaries as agents for our data to the extent that these agents become full-grown and enslave us. What we are intended to do not reconcile with what our agents conceived.

Some suggested setting up agencies that are always charitable to handle personal data as substitutes. Whereas these middlemen or council-like organizations are hypothetically acquiescent, data creators may not entrust these hired agents.

Web3 and Self Identity

Any central authority, such as companies, governments, or institutions, will not own and control people's digital identity in a self-sovereign state. Individuals create their unique digital identities and have to independently verify and authenticate their own identities without the help of third-party agents. Users are the ultimate recourse over their digital identity data, and it is not tied to any particular service, platform, or organization. Of course, due to this self-reliant system design, we reckon it fosters a more private, secure, and trustworthy environment at the cost of user experience. For example, we have to offer a solution to users on how to recover their accidentally lost digital identities in a self-verification universe.

Whether you call it a protocol, an ecosystem design, or a consensus mechanism, the deep-rooted game rules since web2 flourished have to be reinvented.

Economist Eric Glen Weyl, researcher Puja Ohlhaver, and Ethereum co-founder Vitalik Buterin introduced the ideology of soulbound token, which is a proof-of-humanity identity represented in the form of an on-chain and non-transferable token in web3. The tokenized credentials are inscribed in a "Decentralized Society," where souls—that are individuals or entities issuing and holding soulbound tokens from their web3 wallets—and communities can have various interactions such as defining copyright of works, obtaining credit, attestation, tracing of provenance, and distributing rewards, without intervention from any agents or authorities.

It does not mean eliminating trust; instead, it is a matter of who we should trust and how to appoint the one we trust in the transformation to web3. Vitalik has detailed the concept of social recovery in his blog post, laying the foundation for establishing a new form of trust, from agent hegemony to user autonomy. Users can freely select and alter the guardians they trust in advance to help them recover their identities in the case of souls lost.

The role of intermediaries is to streamline information exchange and facilitate trading. However, because of their ample power and control, it is enticing to corrupt for their own profits rather than doing their divined duty.

Web2 has flaws. It reassured the dark side, where our identities, freedom, and equities are controlled by a small group that is supposed to be our loyal servants. By enabling self-sovereign identity, web3 provides a promising avenue to democratize data ownership, where individuals are empowered to control and monetize their data. It is more a paradigm shift in the governance mindset than technological innovation.

Creators get back their data instead of intermediaries under web3. They can select a price to share their data or keep it private. These activities should be driven by individual decisions and market forces in peer-to-peer interactions governed by consensus instead of manipulations from any authorities.

Calling for Infrastructure, not Funding

Admittedly, web3 is still in its infancy, and people will adjudicate it as merely a pipe dream because the underlying infrastructure has not been caught up yet. I expect the trajectory will be similar to other emerging technologies— starting from the initial hype, the setback, the winter time, the recovery, and the tipping point.

As a previous case, AI has waited for over six decades, deemed to have gone through two prolonged winters, before being widely adopted in everyday life, from concepts and papers to commercial and personal use. It can be more optimistic for the development timeline of web3, as we live in a much more technologically advanced age than in the 1950s.

Vested interests have been cozy because web3 is shattered and provides an awful user experience, repelling usage. Going to web3 is obscure, unsafe, and creepy; thus, most users cannot help but oblige Big Techs with their personal data in exchange for services, maintaining the web2 network effect.

But these days are getting close to an end. The momentum towards greater data ownership and self-sovereignty is undeniable, where web3 communities are lurking, and technologies are working in progress. Even if Big Techs try to grip their dominance by deploying hefty capital into AI development, it does not guarantee a conquest in this must-win battle.

Michael Kwok's Newsletter

Discussion about this post