Data as an Asset: DataFi is Pioneering a New Frontier

Blockbeats

22 Jul

Original Article Title: "Data as an Asset: DataFi is Opening a New Blue Ocean"

Original Article Author: anci_hu49074, Biteye Core Contributor

“We are in an era where the world is competitively building the best base model. While computing power and model architecture are important, the true moat is the training data.”

——Sandeep Chinchali, Chief AI Officer at Story

Starting from Scale AI: Exploring the Potential of the AI Data Track

When it comes to the biggest gossip in the AI circle this month, nothing beats Meta showcasing its financial power, with Mark Zuckerberg recruiting talent everywhere to build a high-end Meta AI team mainly composed of Chinese research talents. Leading this team is the 28-year-old Alexander Wang, who created Scale AI from scratch. Scale AI, currently valued at $29 billion, serves not only the U.S. military but also various competitive AI giants like OpenAI, Anthropic, Meta, etc., all relying on Scale AI's data services. The core business of Scale AI is to provide a large amount of accurate labeled data.

Why Can Scale AI Stand Out from All the Unicorns?

The reason lies in its early recognition of the importance of data in the AI industry.

Computing power, models, and data are the three pillars of AI models. If we compare a large model to a person, the model is the body, computing power is the food, and data is the knowledge/information.

During the years of explosive development from LLM to the present day, the industry's focus has also shifted from models to computing power. Nowadays, most models have established transformer as the model framework, with occasional innovations like MoE or MoRe. Major tech giants either build super clusters for computing power or sign long-term agreements with powerful cloud services like AWS. With the basic needs of computing power settled, the importance of data has gradually become prominent.

Unlike traditional top-tier B2B big data companies with a strong presence in the secondary market such as Palantir, Scale AI, as its name suggests, is dedicated to building a solid data foundation for AI models. Its business goes beyond mining existing data and extends its vision to a more long-term data generation business. Scale AI is also attempting to form AI trainer teams composed of experts from different fields to provide higher-quality training data for AI model training.

If you are skeptical about this business, then let's first take a look at how the model is trained.

The training of the model is divided into two parts — pre-training and fine-tuning.

The pre-training part is somewhat similar to the process of a human baby gradually learning to speak. What we usually need is to feed the AI model a large amount of text, code, and other information obtained from web crawlers. The model, through self-learning of this content, learns to speak human language (academically known as natural language) and acquires basic communication skills.

The fine-tuning part is like going to school to study, usually with clear rights and wrongs, answers, and directions. Schools will cultivate students into different talents based on their positioning. Similarly, we will train the model to have the abilities we expect through some pre-processed, targeted datasets.

By now, you may have already realized that the data we need is also divided into two parts.

· One part of the data does not need to undergo much processing; it just needs to be abundant. This data usually comes from large UGC platforms such as Reddit, Twitter, GitHub, public literature databases, and enterprise private databases.

· The other part, like a professional textbook, requires meticulous design and selection to ensure that it can cultivate specific excellent qualities in the model. This requires us to perform some necessary work such as data cleaning, filtering, labeling, and human feedback.

These two sets of data constitute the backbone of the AI Data track. Do not underestimate these seemingly non-technological datasets. Currently, the mainstream view is that, with the gradual loss of the computational advantage in Scaling laws, data will become the most important pillar for different large model manufacturers to maintain a competitive advantage.

As the capability of models further improves, various more sophisticated and professional training data will become the key influencing variables of model capability. If we further compare model training to the cultivation of a martial arts master, then high-quality datasets are the supreme martial arts scriptures (to complete this metaphor, one could also say that computational power is a panacea, and the model itself is inherent aptitude).

From a longitudinal perspective, AI Data is also a long-term track with snowballing capabilities. With the accumulation of early work, data assets will also have compound interest capabilities, becoming more valuable as they age.

Web3 DataFi: The Chosen Soil for AI Data

Compared to Scale AI's remote human labeling team of hundreds of thousands in the Philippines, Venezuela, and other places, Web3 has a natural advantage in the AI data field, and a new term, DataFi, has emerged.

Ideally, the advantages of Web3 DataFi are as follows:

1. Data sovereignty, security, and privacy guaranteed by smart contracts

As existing public data is on the verge of being exhausted for development, exploring non-public data, even privacy data, further is an important direction to expand data sources. This raises a significant trust decision — whether to opt for a paper contract buyout system from a centralized large company, surrendering your data, or choose a blockchain-based approach to retain data IP while also clearly understanding through smart contracts: who, when, and for what purposes your data is being used.

Simultaneously, for sensitive information, there are methods like zk-SNARKs, TEE, etc., to ensure that your privacy data passes through trustworthy machines and remains confidential without leakage.

2. Natural geographical arbitrage advantage: Free distributed architecture attracts the most suitable labor

Perhaps it's time to challenge traditional labor relations. Instead of seeking low-cost labor worldwide like Scale AI, it's better to leverage the distributed nature of blockchain and, through openly transparent incentive measures secured by smart contracts, allow labor from all over the world to participate in data contribution.

For labor-intensive tasks like data labeling, model evaluation, etc., compared to the centralized data factory approach, using Web3 DataFi is also beneficial for participant diversity, which has long-term implications for avoiding data biases.

3. Clear incentives and settlement advantages of blockchain

How to avoid tragedies like the "Jiangnan Leather Factory"? The answer is naturally to use a smart contract with a clear incentive system instead of the darkness of human nature.

In an inevitable deglobalization context, how to continue achieving low-cost geographical arbitrage? Setting up companies all over the world is obviously more difficult now, so why not bypass the barriers of the old world and embrace settlement on the chain?

4. Facilitating the construction of a more efficient, open "one-stop" data market

The "middleman markup" is a perennial pain for both supply and demand sides. Rather than having a centralized data company act as a middleman, it's better to create a platform on the chain, similar to Alibaba's open market, enabling more transparent and efficient matching of data supply and demand.

As the on-chain AI ecosystem develops, the on-chain data demand will become more vigorous, segmented, and diverse. Only decentralized markets can efficiently meet this demand and translate it into ecosystem prosperity.

For retail investors, DataFi is also the most favorable decentralized AI project for ordinary retail investor participation.

Although the emergence of AI tools has to some extent lowered the learning threshold, the original intention of decentralized AI is also to break the current monopolistic situation of giants dominating the AI business. However, it must be admitted that many current projects are not very inclusive for retail investors with no technical background—participating in decentralized computing power networks often comes with expensive upfront hardware investment, and the technical barriers of model markets can easily deter ordinary participants.

In contrast, there are few opportunities that ordinary users can seize in the AI revolution—Web3 allows you to participate without signing a contract with a data blood factory. You just need to log in with a simple wallet click to participate through various simple tasks, including providing data, intuitively and instinctively labeling models, evaluating simple tasks, or further using AI tools for some simple creation, participating in data transactions, etc. The difficulty level is close to zero for experienced users.

Web3 DataFi Potential Projects

Where the money flows, the direction follows. In addition to Scale AI receiving a $14.3 billion investment from Meta and Palantir's stock skyrocketing more than 5 times in one year in the Web2 world, in Web3 financing, DataFi's performance in the race is also outstanding. Here, we will provide a simple introduction to these projects.

Sahara AI, @SaharaLabsAI, $49 million raised

The ultimate goal of Sahara AI is to build a decentralized AI super infrastructure and trading market, with the first pilot being the AI Data sector. Its Data Services Platform (DSP) Beta will be launched on July 22, where users can earn token rewards by contributing data, participating in data labeling tasks, etc.

Link: app.saharaai.com

Yupp, @yupp_ai, $33 million raised

Yupp is an AI model feedback platform that mainly collects user feedback on model output. The primary task at the moment is that users can compare the outputs of different models for the same prompt and then select the one they consider better. Completing tasks can earn Yupp points, which can be further exchanged for USDC and other fiat stablecoins.

Link: https://yupp.ai/

Vana, @vana, Raised $23M

Vana's focus is on converting users' personal data (such as social media activities, browsing history, etc.) into monetizable digital assets. Users can authorize the upload of their personal data to the corresponding DataDAOs' Data Liquidity Pool (DLP), where this data will be aggregated for tasks like AI model training, and users will receive token rewards accordingly.

Link: https://www.vana.org/collectives

Chainbase, @ChainbaseHQ, Raised $16.5M

Chainbase's business revolves around on-chain data, currently covering over 200 blockchains, transforming on-chain activities into structured, verifiable, and monetizable data assets for dApp development. Chainbase primarily acquires business data through methods like multi-chain indexing, processes the data through its Manuscript system and Theia AI model, but the general user participation is currently low.

Sapien, @JoinSapien, Raised $15.5M

Sapien aims to massively convert human knowledge into high-quality AI training data. Anyone can perform data annotation work on the platform and ensure data quality through peer validation. It also encourages users to build long-term credibility or make commitments through staking to earn more rewards.

Link: https://earn.sapien.io/#hiw

Prisma X, @PrismaXai, Raised $11M

Prisma X envisions an open coordination layer for robots, where physical data collection is key. This project is currently in its early stages, and based on a recently released whitepaper, participation methods may include investing in robot data collection, remotely operating robot data, among others. Currently, there is an open quiz activity based on the whitepaper where participants can earn points.

Link: https://app.prismax.ai/whitepaper

Masa, @getmasafi, Raised $8.9M

Masa is one of the flagship subnet projects in the Bittensor ecosystem, currently operating the 42nd Data Subnet and 59th Agent Subnet. The Data Subnet aims to provide real-time access to data, primarily obtained by miners crawling real-time data on X/Twitter using TEE hardware. For the average user, participation is both difficult and costly.

Irys, @irys_xyz, Raised $8.7 Million

Irys focuses on programmable data storage and computation, aiming to provide efficient and cost-effective solutions for AI, decentralized applications (dApps), and other data-intensive applications. In terms of data contribution, opportunities for regular users to participate are currently limited, but there are multiple activities to participate in during the current testnet phase.

Link: https://bitomokx.irys.xyz/

ORO, @getoro_xyz, Raised $6 Million

ORO's goal is to empower ordinary people to participate in AI data contribution. Supported methods include: 1. Linking personal accounts to contribute personal data, including social media accounts, health data, e-commerce, and financial accounts; 2. Completing data tasks. The current testnet is live and open for participation.

Link: app.getoro.xyz

Gata, @Gata_xyz, Raised $4 Million

Positioned as a decentralized data layer, Gata has launched three key products for participation: 1. Data Agent: a series of AI Agents that automatically run to process data whenever the user opens a webpage; 2. All-in-One Chat: a mechanism similar to Yupp's model evaluation to earn rewards; 3. GPT-to-Earn: a browser plugin that collects user conversation data on ChatGPT.

Links: https://app.gata.xyz/dataAgent

https://chromewebstore.google.com/detail/hhibbomloleicghkgmldapmghagagfao?utm_source=item-share-cb

How to Evaluate These Projects Currently?

At present, these projects generally have low barriers to entry. However, it must be acknowledged that once they accumulate users and ecosystem stickiness, platform advantages will accumulate rapidly. Therefore, early-stage efforts should focus on incentive measures and user experience. Only by attracting a sufficient number of users can these platforms succeed in the lucrative business of data.

As labor-intensive projects, these data platforms need to consider how to manage labor while ensuring the quality of data output. After all, a common issue in many Web3 projects is that most users on the platform are merely ruthless free-riders. In pursuit of short-term gains, they often sacrifice quality. If allowed to become the main user base of a platform, they will drive out quality contributors, ultimately leading to compromised data quality and an inability to attract buyers. Currently, projects like Sahara and Sapien have emphasized data quality, striving to establish long-term healthy cooperative relationships with labor on their platform.

Furthermore, lack of transparency is another issue plaguing many current on-chain projects. Undeniably, the blockchain trilemma has forced many projects in their early stages to take a path of "centralization driving decentralization." However, an increasing number of on-chain projects nowadays give off the impression of being more like "Web2 projects with a Web3 facade" — with scarce publicly available on-chain traceable data, and even the roadmap barely showing a commitment to openness and transparency. This lack of transparency is undoubtedly detrimental to the long-term healthy development of Web3 DataFi, and we also look forward to more projects staying true to their original intentions and accelerating towards openness and transparency.

Lastly, the path to mass adoption of DataFi should also be viewed in two parts: one part is to attract a sufficient number of toC participants to join this network, forming the backbone of data collection/generation engineering and the consumers of the AI economy, thereby creating an ecosystem loop; the other part is to gain recognition from current mainstream to B companies, as in the short term, their financial clout is the main source of big data deals. In this regard, we have also seen good progress made by Sahara AI, Vana, and others.

Conclusion

In a deterministic sense, DataFi nurtures machine intelligence with human intelligence in the long term, while using smart contracts as a covenant to ensure that human intelligence's labor is rewarded and ultimately benefits from machine intelligence in return.

If you are anxious about the uncertainties of the AI era, if you still hold blockchain ideals in the ups and downs of the crypto world, then following the footsteps of many capital giants and joining DataFi is undoubtedly a timely and wise choice.

This article is a contributed submission and does not represent the views of BlockBeats.

Welcome to join the official BlockBeats community:

Telegram Subscription Group: https://t.me/theblockbeats

Telegram Discussion Group: https://t.me/BlockBeats_App

Official Twitter Account: https://twitter.com/BlockBeatsAsia

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Tiger Brokers