OECD-EU Online Workshop on Regulation of Artificial Intelligence
I listened to two panels: "Competition in the market for foundation models” and “The role of data in development models.” The participants were Thibault Schrepel, Associate Professor of Law, Vrijie Universiteit, Netherlands; Shin-Shin Hua, Assistant Director, Digital Market Unit, Competition and Markets Authority, UK; Lawrence Moroney, Head of AI Advocacy at Google, USA; Anton Korinek, Professor University of Virginia; Yann LeCun, VP and Chief AI Scientist, Meta, USA; Pilar del Castillo, Member of European Parliament and rapporteur for the EU Data Act, Belgium; Bertin Martens, Visiting fellow, Bruegel Belgium; Francesca Rossi, IBM Fellow and AI Ethics Global Leader, USA; Clara Neppel, Senior Director, IEEE.
Cristina Volpin, Competition Policy Expert, EU, and Jerry Sheehan, Director STI Department, OECD, moderated these panels.
Roughly here are the key themes, though the participants did not agree on the order of importance. I quote from the new report, AI Index 2024, throughout this blog.
1. Talent: Talent is being poached by Big Tech; according to datasets maintained by Epoch AI, of the 51 machine learning models introduced in 2023, academia contributed less than 30%. AI job postings declined by 1.6% in 2023 in the USA (by the way, a recent trip to India also anecdotally confirmed this). In 2022, 70.7% of AI PhDs joined the industry rather than academia. While AI is embraced by companies (and replacing people), many debates exist. AI still cannot exceed humans in complex cognitive tasks – raising questions about autonomous AI agents and when to delegate decision making. While LLMs are not accurate, Chat GPT has an error rate of 19.5% (or hallucinates); the fundamental issue is the quality of data, where it is sourced, and the need to understand explicit data does not equal implicit data (next AI challenge with significant ethical issues). Lawrence spoke about knowledge transfer – he said, “Big tech has the curse of knowledge – we know how to do it, but we don't know how to share the learning easily.”
2. Compute Power & Cost: This is expensive. It was estimated that OpenAI’s GPT-4 training cost was $78-100 million, and the compute for Google’s Gemini Ultra cost $191 million. At the same time, while the cost of computing was decreasing for GenAI, it was important to realize that frontier tech would always require large investments. Multimodal foundational models (text, images, voice) are reaching 50 billion petaFLOPs. Having access to data is not enough; you need resources to use the data, such as training, tuning, fine-tuning, etc., which are expensive. The recent advancement in foundational models (think deep fakes) also highlights the gap in our ability to manage them. Take, for example, the Samsung mobile phone that appears to take high-definition images of the moon. The Samsung camera lens is not as advanced but uses AI to enhance the pictures.
3. Data: Epoch AI estimates, at the current rate of data consumption for training, that by 2024, we will deplete our stock of high-quality language; within the next two decades, we will finish the low-quality language data, and by 2030-2040 use-up image data. There has been a trend to recommend synthetic data, but research in text and images finds that these models collapse over time. These findings highlight the importance of human interpretation and insights – AI is not yet a thinking machine. Further, panelists were divided on whether it was about access to big data or the quality and trustworthiness of data (clean data). There needed to be a shift from gathering data for training to curating, cleaning, or preparing data (which required more investment).
4. Fine Tuning, red teaming, cybersecurity: LeCun highlighted that all models will have an inherent bias. He stated, “We cannot remove bias; we can only control it by having competition, so you have different types of bias.” The fine-tuning of models helps reduce bias (Google’s search algorithm updates several times a day). Finetuning needs human interventions but is leading to the exploitation of human labor to fund the cost of AI. AI agents trained on human data may also raise questions about their impact on society (jobs, well-being, human brain development, ethics, etc.). While red teaming allows testing for the robustness of models before release, it is expensive and time-consuming. The more we end up AI-fying our society, the more money organizations need to spend on fine-tuning, red teaming, and cybersecurity. At this point, the ability to do this requires huge funds that only BigTech can access. In the AI Index report, out of 5 measures surveyed, companies, on average, only implemented 1.94.
5. Economies of Scale: According to Stanford’s new AI Index Report, 149 new large language models were released in 2023, and two-thirds were open-source. Yann LeCun mentioned how, in India, they are using Llama to translate the 22 official languages (via a project by Infosys Founder). However, closed-model LLMs outperformed open ones, with a median performance advantage of 24.2%. Francesca (from IBM) stated that open source is a spectrum. What was really open? – data (which now companies are reluctant to do due to lawsuits), operating systems (LINUX), model or libraries (HuggingFace, even TensorFlow just provides a free library collection of workflows with intuitive, high-level API to several of Google’s proprietary models), model weights (Llama 2), public API (ChatGPT), hosting (GitHub - free hosting and access to open-source projects) or the results of the tests on the data (for greater transparency). This understanding of the spectrum of opensource is an AI literacy problem. As a panelist, Bertin had a different interpretation of data and copyright. He felt there was a need to distinguish copyright training (input) and copyright on output. If governments wanted to accrue the economic benefit of copyright, we might need to look at copyright like patents. Daniel Rock’s study of TensorFlow launch finds an approximate market value increase of $11 million per 1 % increase in AI skills for AI-using firms.
6. Value Chain and Power: Despite the introduction of GitHub Copilot powered by Chat GPT (which puts increasing concentration of power with Microsoft – see figure below on pricing), there was an increase in projects from 4.0 million in 2022 to 12.2 million in 2023. This is interesting as GitHub is a volunteer open source community (no payment) with over 100 million developers. However, once Andreessen Horowitz invested US$100 million in 2012, it was just a question of time before the commercialization aspect came about. Microsoft acquired GitHub in 2018. This phenomenon highlights that AI models (especially platform models) are expensive and must eventually be commercialized (unless governments get involved and subsidize the cost). In a recent development, UAE’s G42, a government entity, received an investment of US$ 1.5 billion from Microsoft, with the blessings of both the UAE and USA governments via an Intergovernmental Assurance Agreement (IGAA).
Another critical point of consideration in the value chain, besides the human and society, is the environment. For example, it is not just carbon emissions but environmental degradation (sourcing and end-of-life) and even water consumption that needs to be considered and truthfully reported (not just offset numbers). Also, as Francisca noted, consumer-facing tech companies had more access to data than others and were often proprietary about the data collected. On another note, the data brokerage business is thriving, though it would be considered quasi-legal from a privacy point of view. Francesca highlighted that data access was necessary for foundation models, not just for training and fine-tuning but for specific solutions downstream where other models are built on top of foundational models.
7. Regulations & AI Risks: Reported AI risks are growing according to the AI Incident Database (32% increase from 2022). While there are plenty of regulations and policies – the reality is that this is a complex place. Defining and assessing something as simple as privacy is daunting, as highlighted by the AI Index 2024 report. Often, the cost of the regulatory burden may be only something big tech (with big bucks can do), which may impact smaller firms. An example was given with Microsoft, which began with three employees and now has more than 1000 working on technical documentation. Further jurisdiction is an issue when AI is clearly global. As one participant asked, what happens if the AI foundational model is trained in a country with lots of access to data and no regulatory oversight and then deployed in the EU? As AI enters into not just an ethical but moral space, this will have implications for society (see research by Stanford). The question is whether policymakers can keep up with the rapid frontier developments in the AI space, especially when AI can be created to self-learn, as was the case of a program called Voyager for Minecraft. The fact that AI is dual-use – can be used for good or bad- is also a challenge when we don't have basic societal digital skills or AI literacy skills. Some points raised for policymakers:
· Need for a commons infrastructure for data and computing and open-source foundational models. We also need a knowledge common (what works, does not work, etc.).
· AI regulations not based on hype (like the current GenAI Hype)
· AI regulations do not disadvantage smaller players who do not have access to money and legal resources Big Tech does.
· Need to incentivize research for competition and to create new models (to reduce bias).
· Policymakers to make available clean datasets
· Ensure practices do not restrict competitiveness
· Manage the global legal tangles for AI, copyright, crimes, and competition laws
8. Geopolitics: This is clearly an issue. In 2023, the USA outpaced the EU and China in terms of AI models (63: 21: 15). As for AI investments, the USA was ~ 8.7 times China, the next highest investor, though China outpaced the USA in terms of patents (61.1% of global AI patents). For example, the US government spends ~$3 billion per year on AI, according to Govini. This space is complex – think of Mistral (France), which heavily lobbied the EU for the AI Act and ended up in bed with Microsoft. Regarding AI skills penetration, India leads the USA, though talent concertation lies with smaller countries – leading with Israel, Singapore, and South Korea, for example. A new entry is UAE, which is attracting AI migratory talent.
Comments