Skip to content

Knowledge Sharing Incentives post AI

January 11, 2025 at 03:22 PM

Note: This is not a blog, it's a semi-private digital garden with mostly first drafts that are often co-written with an LLM. Unless I shared this link with you directly, you might be missing important context or reading an outdated perspective.


Incentives for Knowledge Sharing in the Service of AI and Alignment

Background

My career has been deeply intertwined with the intersection of knowledge, search, and incentives. I’ve worked on search at Mozilla and Facebook, explored knowledge extraction and generation at Quora, and delved into the complexities of motivating people to contribute to a shared pool of internet knowledge. At my startup, we tackled search for clinical contexts—building a tool for doctors to find critical information—and later pivoted into a mobile web browser with robust search extensions. Most recently, I worked at ScaleAI, focusing on reference data and fine-tuning models for alignment.

This journey has given me a front-row seat to observe how knowledge-sharing ecosystems evolve and how people are incentivized (or disincentivized) to contribute to them.

The Current Landscape of Data and AI

Language models today operate by effectively crawling and compressing the vast expanse of the internet. Labs like OpenAI, Anthropic, and others are beginning to license data directly from publishers, shifting away from purely scraping methods. A critical question arises: will individuals and organizations continue to keep their data openly available, especially as they come to recognize its inherent value and the diminishing returns of ad-based monetization?

Additionally, we must address the role of preference data. This data—user feedback that helps align models with specific values—acts as a refinement layer on top of the internet’s massive knowledge repository. Understanding the incentives behind sharing both source and preference data becomes crucial as the field advances.

Established Incentive Structures for Knowledge Sharing

  1. Academic and Peer-Reviewed Content

    • Academia has long been a bastion of open knowledge sharing, supported by funding structures that allow researchers to publish with minimal commercial bias. This remains one of the most reliable sources for high-quality data.
    • For example, Microsoft’s “Textbooks Are All You Need” demonstrates the power of high-quality, curated datasets, enabling models to achieve impressive results with smaller datasets.
    • However, academia often neglects entire categories of valuable knowledge—from fiction and spirituality to practical areas like fitness or nuanced health advice.
  2. Commercial and Open Publishing Models

    • The internet’s early promise of democratized knowledge has given way to more guarded approaches as organizations recognize the value of their data.
    • Validated source data is becoming increasingly important, especially in critical domains like healthcare and law, where generative model output isn’t reliable enough without trustworthy inputs.
    • Revenue-sharing models (e.g., Luma.FYI) and direct licensing agreements offer pathways to incentivize continued publishing.
  3. Social and Community-Driven Contributions

    • Platforms like Reddit and Quora have relied on social motivations for user-generated content (UGC). Contributors share knowledge in exchange for recognition, community engagement, and status (e.g., karma points or upvotes).
    • Experiments with cryptocurrency and other token-based systems have added layers of potential monetization but haven’t fundamentally shifted the model.
    • Wikipedia remains a standout example of pro-social motivations driving high-quality, open contributions, with no direct monetary incentives.

Emerging Opportunities and Challenges

  1. Human-AI Collaboration for Data Generation

    • Conversations between humans and AI can generate high-quality data with minimal friction, especially when experts are involved.
    • Companies like ScaleAI have explored cash-based incentives for structured data generation, while UGC platforms rely on intrinsic motivations.
  2. Specialized Knowledge and Sensitivity

    • Certain fields, such as clinical information or niche technical domains, demand validated and transparent data. The value of these datasets may drive new business models where contributors are compensated for their expertise.
  3. Evolving Monetization Models

    • The decreasing viability of ad-supported content raises the question of how knowledge-sharing platforms will sustain themselves. Licensing agreements with AI labs and direct payment systems for contributors could become more prominent.
    • The risk is that essential knowledge becomes gated or fragmented, reducing its accessibility for future innovation.

The Path Forward

The incentives for knowledge sharing in the age of AI will need to evolve alongside the models they serve. A few key considerations:

Incentives shape ecosystems. As we navigate this new era of knowledge sharing, we have the opportunity to create structures that benefit individuals, organizations, and society at large. The challenge lies in balancing these interests to ensure that the knowledge powering AI remains robust, diverse, and aligned with human values.

Raw

Right now, language models are in a state where they essentially crawl and compress the whole internet. Big labs are beginning to license data from publishers. What will be interesting to see play out is whether people will keep this data openly available as they realize the value of it and also realize it can no longer be effectively monetized by advertising in this model. The other bucket of data we need to talk about, of course, at some point, is preference data. This data then takes this giant compressed blob of all data on the internet and nudges it to resemble data that the companies' values align with—or the company here is the language model lab. The interesting problem to think about is what are the incentives for people to share data.
The first thing to consider is academics, textbooks, and publication. That is a cocooned environment where society has decided to carve off some money to let people think and publish with minimal bias. That data is probably going to be the best source, and we're seeing it in Microsoft's five models, for example, where they are able to now go pretty far by learning off of a little data. "Textbooks Are All You Need" was one of the first papers in this area that I remember being interesting. But there's a lot of topics of value that academics are not serious about lately. Fiction, of course, is a big category, but even things at the nebulous end, like spirituality, come to mind. Certain practical takes on topics like fitness are another category worth considering. Health and clinical is a very sensitive topic as well.
The overall point being that a lot of information that was being put on the internet will not have a good reason to be put on the internet. One pathway is that there are some motivations to publish source data openly or to sell it directly. The more critical the field becomes, the more important it is that validated source data, instead of generative model output, is what the user will care about. Another thing that comes to mind is people like Luma.FYI are trying to produce essentially revenue share-based models similar to what Quora did.
A couple of interesting things to consider are that humans, in the process of conversation with an AI or in conversation with other humans, tend to generate really good data with very little friction, especially if they're experts. Companies like ScaleAI, on one end of the spectrum, are trying to motivate them with cash. On the other hand, there are UGC engines like Reddit or Quora that motivate people with largely social motivations. I believe Reddit has experimented with cryptocurrency in the past, and they have their karma points, but largely they to keep the lion's share of the money both through licensing to the labs and by advertising. There's also the very pro-social Wikipedia and that family of products as an example.
So, yeah, those are my initial thoughts on this. I keep adding stuff as I think of it and eventually process it into a blog post.