The Hidden Carbon Footprint of AI Model Training: A New Perspective on Emissions Accounting

Deep News
05/08

Calculating the carbon footprint of large model training appears increasingly outdated. With the rise of inference models and AI agents, market focus has shifted toward inference-side costs, viewing "training costs" as a story of the past—even though the environmental externalities of inference remain unclear. Interestingly, in the U.S., some still research and disclose such data, but the White House shows little concern; in China, policies emphasize importance, yet few studies or disclosures on the carbon footprint of large models exist.

Contrary to this trend, the Allen Institute for AI (AI2), a leader in the Western open-source model community, presents a data-rich paper that argues: the carbon footprint of model training is not obsolete. Traditional carbon accounting methods, centered on "pre-training" and focusing on "final training runs," have fallen behind evolving training paradigms.

Indeed, Anthropic founder Dario Amodei recently estimated that current industry computing expenditures are roughly split between training and inference. This makes sense: fierce competition in the large model frontier means excessive inference spending could hinder future R&D, while excessive training spending may not generate sufficient revenue. Consider Google's recent natural gas power purchase agreements, Microsoft's potential abandonment of its "hourly clean energy matching" goal, and Anthropic taking over xAI's Colossus 1 gas-powered cluster. These giants allocate half their carbon emissions to model training, implying that disclosing even minor training details could allow estimates of their AI carbon footprints.

Moreover, the concept of "model training" itself is becoming blurred. Previously, large model vendors discussing their models' "carbon footprint" focused solely on the last successful full training run during pre-training, known as the "final training run." This excludes interrupted runs and extensive early experimental explorations. Some models are even scrapped before release, never known to the market.

In fact, the last such "limited disclosure" dates back two years to Meta's release of Llama 3. At that time, AI was still in the "instruction model" era, with training primarily focused on pre-training and fine-tuning. It wasn't until six months later that o1, known for "slow thinking," was released, establishing a new paradigm for reinforcement learning scaling laws and fundamentally altering training cost structures.

Current state-of-the-art model training involves not only typical pre-training but also significant computing power for intermediate training on curated data, long-context extension, large-scale synthetic data generation, supervised fine-tuning (SFT), preference optimization (e.g., DPO), and reinforcement learning (RL). The proportion of computing power dedicated to pre-training is decreasing. Each training phase involves its own "final runs" and early exploratory development, obscuring the environmental externalities of most model training.

At the end of last year, the Allen Institute open-sourced the Olmo 3 family, including 7-billion and 32-billion parameter versions, each with instruction-following (Instruct) and reasoning (Think) variants. Trained on H100 clusters, they consumed 8.34 million GPU hours, with synthetic data generation additionally using AMD chips. Only 18% of GPU hours were spent on "final runs," a proportion declining over time.

This aligns with EpochAI's March research, which estimated that the cost of "final training runs" for OpenAI, Minimax, and Zhipu AI accounts for 10%-20% of total R&D expenditure, with OpenAI at just 9.6%, the lowest among the three. This suggests that the true expense lies not in the last successful training run but in the lengthy, high-failure-rate original exploration preceding it.

More notably, the energy consumption of post-training for reasoning models far exceeds that of traditional instruction models. The paper shows post-training energy use for reasoning models is about 17 times that of instruction models, primarily due to "rollouts" in reinforcement learning. In a sense, this process resembles large-scale inference deployment, indicating that post-training is becoming "inference-like," while inference itself increasingly resembles part of training.

The data center used by the Allen Institute to train the Olmo 3 family had GPU energy consumption accounting for 57.5% of total IT infrastructure, a power usage effectiveness (PUE) of 1.2, and a local grid carbon intensity (CI) of 0.332. Calculations show the "final runs" emitted 647 metric tons of CO₂ equivalent (tCO₂eq), while early exploratory development emitted 2,757 tons. Additionally, synthetic data generation, independent of model "training," emitted 675 tons, and the embodied emissions from manufacturing the cluster hardware, amortized, amounted to 172 tons.

In total, training this set of billion-parameter models emitted 4,251 tCO₂eq. For comparison, Meta self-estimated emissions for Llama-3-8B and Llama-3-70B at 390 and 1,900 tons, respectively. Google's 2024 environmental report disclosed an ambition-based total carbon footprint (Scope 1, 2, and 3 combined) of 11.5 million tons.

However, given that cutting-edge models now scale to trillions of parameters with increasing iteration frequency, the environmental cost of model training remains significant. The paper also separately analyzes data center water consumption, including cooling on the electricity consumption side (closed-loop cooling was used here, with near-zero consumption, but evaporation cooling towers in other data centers contribute significantly) and water evaporated or consumed in power generation.

The complete training process for this model family consumed 15,887 tons (or 15,887 kiloliters) of water, equivalent to about 140 years of water use for an average American individual. If reasoning models already show such consumption, models optimized for AI agents in post-training could introduce more actions, observations, and reasoning steps, potentially increasing consumption by several orders of magnitude. Future developments like recursive iterative training and automated optimization frameworks could further amplify this trend.

This implies the boundary between "training" and "inference" is blurring, with model training increasingly resembling a continuously operating industrial system. Therefore, the paper calls for the industry to disclose not only pre-training costs but also post-training costs, and to report not just "final runs" but at least an additional "multiplier" for early exploratory development stages.

Perhaps Chinese model vendors should heed this call. While the U.S., from government to tech companies, seems to be gradually abandoning serious carbon neutrality commitments, China has more model vendors still competing in pre-training. However, its overall chip and computing infrastructure energy efficiency lags behind the U.S. Additionally, although China has abundant electricity, green energy resources are unevenly distributed in time and space. As China remains committed to carbon neutrality, the AI carbon bill remains a pressing issue.

免责声明:投资有风险,本文并非投资建议,以上内容不应被视为任何金融产品的购买或出售要约、建议或邀请,作者或其他用户的任何相关讨论、评论或帖子也不应被视为此类内容。本文仅供一般参考,不考虑您的个人投资目标、财务状况或需求。TTM对信息的准确性和完整性不承担任何责任或保证,投资者应自行研究并在投资前寻求专业建议。

热议股票

  1. 1
     
     
     
     
  2. 2
     
     
     
     
  3. 3
     
     
     
     
  4. 4
     
     
     
     
  5. 5
     
     
     
     
  6. 6
     
     
     
     
  7. 7
     
     
     
     
  8. 8
     
     
     
     
  9. 9
     
     
     
     
  10. 10