
An employee stacks newspapers at a printing plant in Toronto. Photo by Brent Lewin /Bloomberg
AI foundation models extensively train on Canadian journalism content but rarely attribute sources when reproducing information learned from news articles, according to research released March 18 that intensifies publisher demands for compensation as media organizations argue AI companies profit from content they didn't create or license, the National Post reported.
The study analyzed training datasets used by major AI models and found substantial Canadian news content from CBC, Globe and Mail, National Post, and other outlets included in data scraped from the internet without publisher permission or payment. When researchers tested whether AI systems cite sources appropriately, models frequently reproduced facts, analysis, and reporting from news articles while providing vague or nonexistent attribution that prevents readers from identifying original journalism sources.
Publishers Argue AI Companies Owe Compensation for Training Data
Canadian media organizations have escalated demands that AI companies negotiate licensing agreements compensating publishers for content used in model training and reproduced in AI outputs. Publishers argue that foundation models derive commercial value from journalism investments in reporting, fact-checking, and editorial judgment while offering nothing in return and potentially reducing traffic to news sites by answering questions directly rather than linking to original sources.
The dispute mirrors conflicts between AI companies and publishers globally, with OpenAI negotiating deals worth tens of millions annually with Associated Press, Axel Springer, and News Corp while resisting broader industry demands for systematic compensation. Publishers emphasize that journalism requires substantial investment in reporters, editors, legal review, and infrastructure that AI companies exploit for free while building billion-dollar businesses.
AI companies counter that training on publicly available web content constitutes fair use, comparing model training to search engines indexing content or humans reading articles to learn information. They argue that AI outputs transform training data into new content rather than simply copying journalism, and that requiring licensing for every piece of training data would make AI development economically impossible and legally unprecedented.
Attribution Failures Compound Publisher Concerns
Beyond training data disputes, the study's findings that AI models rarely attribute sources properly raise separate concerns about AI systems misleading users about information origins. When ChatGPT or Claude answer questions using facts from specific news articles, failure to cite sources prevents readers from evaluating original reporting credibility, understanding full context, or recognizing when AI synthesizes multiple potentially conflicting sources.
This attribution problem matters particularly for breaking news, investigative journalism, and complex policy analysis where nuance, sourcing methodology, and reporter expertise significantly affect information reliability. AI models trained on journalism can reproduce conclusions without conveying the reporting rigor supporting them, potentially spreading misinformation if models hallucinate details while presenting outputs with the confidence that accurate journalism would justify.
Publishers also emphasize that proper attribution drives traffic to news sites, generating advertising revenue and subscriptions supporting journalism production. When AI answers questions directly without linking to sources, it captures value that would otherwise flow to publishers while undermining business models funding the reporting AI systems exploit.
Regulatory and Legal Responses Emerging
The study's release coincides with Canadian government consideration of regulations requiring AI companies to negotiate fair compensation with publishers, similar to Australia's News Media Bargaining Code forcing Google and Facebook to pay news organizations. Publishers advocate for legislation establishing mandatory licensing frameworks rather than relying on voluntary deals that AI companies can avoid by excluding specific publishers from training data.
Legal challenges also proceed in multiple jurisdictions, with publishers suing AI companies for copyright infringement arguing that training on copyrighted content without permission violates intellectual property law regardless of whether outputs directly copy training data. These cases will establish precedents determining whether AI training constitutes fair use or requires licensing similar to other commercial content uses.
The conflict's resolution significantly affects both AI development costs and journalism sustainability. If courts or regulators require AI companies to license training data, foundation model development becomes substantially more expensive while creating new revenue streams supporting news production. If AI companies prevail on fair use arguments, publishers lose potential compensation while facing ongoing competitive threats from AI systems answering questions using journalism content.




