IOSG OFR13th Panel Recap<Data Alchemy: Transforming AI>

IOSG
11 min readSep 29, 2024

--

Question 1: Why do I think AI and Crypto work for each other and how will those things benefit each other?

NEAR: So I feel like already now it was shown that like centralized AI, so if we look at the current actors like OpenAI, Google, Meta, we don’t really know what they’re doing in the closed doors. And it’s a black box that you are using and it becomes obvious that you need some trustless way of doing transactions between agents or even to interact with these agents. So you need proofs and this kind of stuff, what’s happening in the background and everything, you need to be able to trace. And so it becomes clear that you need some layer which could be blockchain. It will be blockchain, I guess, in the future.

And so that’s why it makes a lot of sense for us to use blockchain to leverage AI in that case for the long term, beause if we stick with centralized AI and no way to prove what they’re doing, it’s going to end up pretty bad.

Chris: I think that’s also a language. like what Mina has been doing, proof of privacy.

Mina: Exactly. We’re focusing on how to mitigate the risks of AI while also harnessing its benefits. On one hand, there’s a need for tools to prove key facts, such as verifying human identity online to combat bots, or confirming the authenticity of media to ensure it hasn’t been generated by AI. This helps address the growing concerns about misinformation and synthetic content.

On the other hand, there’s the potential to verify the outputs of AI models and place them on-chain as autonomous agents that can interact with humans, decentralized applications (dApps), and even other AIs. With advancements like zero-knowledge proofs and new technologies on the horizon, we’re getting closer to creating AI agents that can operate transparently and securely within decentralized ecosystems.

Vana: You’re right, decentralized AI often comes with ideological arguments, but product adoption typically depends on practical advantages. The key question is: what new possibilities does decentralization bring to AI that weren’t possible before? A great example is Vana’s DataDAO. When Reddit data reached 100,000 users, Reddit tried to shut down the project, claiming it violated API terms. However, because Vana’s system is decentralized and uses non-custodial data (where users control their own data), the project stayed online.

This is a tangible benefit of decentralization — it allows projects to operate independently, without relying on centralized intermediaries that can impose restrictions. Instead of purely ideological arguments, this demonstrates how decentralization enables more resilient and user-controlled systems, opening up new opportunities that weren’t possible in traditional, centralized models.

OpenLayer: In our case, we are focusing on the data layer, particularly using decentralized methods to collect and serve data. When discussing AI, it’s important to consider the three core components: models, computation, and data. Among these, we believe that data will see the strongest demand and traction, especially as Web2 companies exhaust their public datasets. For example, many AI leaders, like the CEO of Anthropic, have noted that we are nearing the limit of publicly available data for training.

This increasing demand is driving interest in two new types of datasets: synthetic data and private user data. While synthetic data offers the benefit of being limitless and machine-generated, it requires costly human validation and its effectiveness has been questioned by recent research. On the other hand, private user data — data behind login pages or within walled Web2 gardens — holds tremendous potential. Unlocking this data could open up new use cases and dramatically enhance AI capabilities.

This is where our focus lies: leveraging decentralized systems to enable access to valuable datasets, positioning us as contributors to AI innovation from a data infrastructure perspective.

Question 2: I’m a big fan of your initiative of DataDAO. I think the initiative of DataDAO involves multiple parties, data owners, contributors, and creators. For me, it’s very hard to incentivize all of them, and how you guys decentralize everything and get users own data, utilize, and incentivize from other parts.

Vana: Each DataDAO operates with its own dataset and a unique token, which governs its tokenomics. For example, the Reddit DataDAO uses RDAC as its token, while the LinkedIn DataDAO has its own as well. While each DataDAO has full control over its tokenomics, we provide guidance to help shape their approach.

One dynamic we observed with the Reddit DataDAO is that initially, crypto-native users connected their Reddit data, engaging with the token. As speculation around the dataset’s token grew, non-crypto users started contributing their data, realizing they could earn significant amounts, like $300-$400, for their contributions, particularly those with high karma.

This model led to a high-quality dataset, which has now been used to train the first user-owned AI model. Although this model specializes in “shitposting,” it highlights the full data flow: users contributing data and receiving proportional rewards. The challenge lies in ensuring that contributions are properly rewarded based on their actual value to the AI model. For instance, a Twitter account with AI-like generated tweets wouldn’t provide as much value as more substantial contributions. Vana’s proof of contribution model helps determine the worth of each input, ensuring fair rewards across different data sources.

Question 3: The next one is for Mina. I checked your website yesterday. It said profile privacy. It’s quite interesting for me since AI is eating everything right now, and we always use AI to help us to seek for the answers So I think it touched on a big question of AI is transparency and how we can know the answer and How we can know the logic insight? There’s a term called puzzle reasoning. I think that’s also touch-based on the verifier of everything.

Mina: Zero-knowledge proofs (ZKPs) offer valuable capabilities in verifying the origins and integrity of generative AI outputs. While they can’t fully explain why an AI produced specific tokens, they can confirm where the output came from. For example, you can verify that a set of tokens was generated by a specific model in a given context, ensuring that the expected model was used behind the scenes when hitting an API. This level of proof is useful for confirming authenticity but falls short of offering a detailed explanation of the reasoning behind the AI’s output.

Another promising application of ZKPs is enabling AI models to process private data without exposing it. For instance, an AI could take private inputs from a company, produce an output, and provide proof that the data was used, all while keeping the sensitive information hidden. Although this doesn’t fully address the need for interpretability (i.e., explaining why a model generated a specific result), it does allow for a more secure interaction between private data and AI, ensuring data privacy while still benefiting from AI capabilities.

Question 4: The next one is for OpenLayer, Your project features two key concepts, One is open network, the other is sort of trustless process. How do you guys tackle the strategies and what barrier will you develop in the AI project related to those two concepts?

OpenLayer: Trustlessness is a crucial principle, especially when handling user data, and maintaining privacy while providing data to consumers is challenging but essential. I can share a few approaches we use to address this.

First, we use secure connections to extract data from web2 sites like LinkedIn. When a user accesses a page, we initiate a few-second process that involves setting up a client-side notary. This notary, along with the server, works to notarize the content being transmitted via the TLS session. This allows us to export user data, such as first and second-degree connections, without exposing the entirety of the user’s data to consumers.

In addition, we implement zero-knowledge proofs to further protect privacy. For example, while users can share proofs of their income, number of connections, or work experience, they do so without revealing sensitive details. This combination of notary-based verification and zero-knowledge proofs helps ensure user privacy while building a robust data layer for consumers.

Question 5: Next question is for NEAR. I think we all know that AI is one of the key strategic focuses of NEAR right now. And I heard Illia said you guys working on the full data stack were full, I mean, technical stack to onboarding develop project from model data in applications. And I want to know the program. how AI projects could join your ecosystem to build AI together.

NEAR: At Near AI, we initially focused on dataset crowdfunding, which has been running successfully since 2021. The platform has gathered millions of data points contributed by the community, with each contribution peer-reviewed to ensure quality. Contributors are paid fairly, and if a contribution is incorrect, it gets flagged, maintaining the integrity of the data. Much of this data is now open-source, with the remainder being prepared for release.

Our next focus is on building a decentralized agentic framework, where users can create agents. To evaluate these agents, we are introducing benchmark crowdfunding. A benchmark, in simple terms, is a set of tasks given to a model to assess its performance. While some benchmarks exist today, like MMLU, they are either general or limited in number. We aim to simplify the process of creating specialized benchmarks, which will multiply the available benchmarks and enrich model evaluation.

Our goal is to provide a complete infrastructure from data collection, through decentralized model training, to benchmarking. This will allow companies and researchers alike to conduct cutting-edge public AI research openly and transparently, ensuring that innovation isn’t locked behind closed doors as we’ve seen with proprietary models like OpenAI’s latest release. If we don’t make open-source AI research accessible, it risks being dominated by private entities, so our mission is to keep it open and available for all.

Question 6: Since the AI, we talked about open source, we talked about the ownership, and we didn’t need that much public goods right now. But I think it works for all of us to discuss this topic. So what is to say how do you guys want to provide some universal, accessible capabilities for the communal trade and open source update and to make it to public goods besides by as many communities and users as possible? And is it possible to create that much public goods under unique feature of crypto to see that any of you?

Vana: It’s very expensive to train these models. It can cost ten to $100 million a day. I think in a few years it’s going to cost like a billion to $10 billion to train these models for both the compute and data perspective. So I think that our view at Mana is that you actually really need to keep the model weights and the data private in order to be able to monetize them. And so you have something that is collectively owned. But I’m not sure exactly fall into the definition of a public good kind of necessarily, but I’d be curious to hear what other people think.

Mina: I think it’s tricky. Like, yeah, I think it’s tricky. I think what is nice is a lot of the tools MKI developed end up being public goods and open source. Like a lot of the ZK tooling that’s happening around ML right now, which is nice, but then it is tricky to know, like you’re saying, like, if it’s something going to be like, you know, you want to be able to monetize it, does it have to be private? Then you could use something like ZK to prove, for example, in that case that it’s coming from like the model, even though it’s private and have it be collectively owned in some way. Maybe. I like that idea, but I don’t know if I have more to say on the actual public good nature of the weights and everything. I think it’s important though.

OpenLayer: You’re right to focus on making your product widely accessible and highly utilized by end users, beyond just integrating token economies. One idea that I’m personally excited about is using blockchain technology to combat deepfakes in today’s AI landscape. As AI continues to advance, the line between authentic and artificial content becomes increasingly blurred. This creates a growing demand for verifying the authenticity and originality of digital content. Deepfakes, for instance, have already shown their potential to cause harm, such as bypassing KYC/AML checks or generating fake adult content using someone’s image.

Several approaches have emerged to address this, including using cryptographic algorithms to hash video files and store these hashes on-chain. This allows anyone to later verify the authenticity of a video by comparing the current hash with the one stored on the blockchain. However, I think a more seamless solution lies in leveraging secure enclaves — hardware-based security features — during content creation. For example, the originality of a photo or video could be hashed and recorded directly when captured, without requiring any additional steps from the user. This would simplify the verification process, making it user-friendly and more widely adoptable.

I believe this approach can become highly beneficial for both individual users and companies, as the demand for verifying authenticity grows in various sectors. Implementing blockchain as a safeguard for digital content integrity not only serves as a public good but also offers a practical way to attract more users by addressing a real and pressing issue in the digital world.

NEAR: You’re right to emphasize the value of decentralizing dataset creation, making it more transparent and equitable for contributors. By leveraging blockchain for dataset crowdfunding, we can remove intermediaries and ensure fair compensation, unlike current models where large vendors retain significant margins and pay contributors less than they deserve. This decentralized approach allows anyone in the community to contribute, get paid fairly, and create a more transparent and accountable system.

In this model, data contributions are peer-reviewed for quality, and blockchain enables accountability through mechanisms like slashing for low-quality or inaccurate data. This not only improves the quality of datasets but also encourages more participation by fairly compensating contributors based on the value they provide, rather than just the quantity of data they submit.

Even if some companies want to keep their datasets private or for profit, they can still use blockchain for creation and storage while benefiting from the transparency and fairness of the process. This approach balances public good with the commercial interests of businesses, allowing for broader participation and competition in data markets. By opening up the process, you encourage more contributions from the community while offering companies the flexibility to monetize their datasets if needed.

--

--