Apple has just published a technical document detailing the models developed for Apple Intelligence, its range of generative AI features that will arrive publicly on iOS, macOS and iPadOS in the coming months. The Cupertino firm takes the opportunity to respond to accusations about its training methods, reaffirming that it has not used any private user data.
A mix of public and licensed data
To train its Apple Foundation Models (AFM), Apple says it used a mix of data from three main sources:
- Licensed content obtained from publishers. According to the New York Times, Apple signed multi-year agreements worth more than $50 million with NBC, Condé Nast and IAC in late 2023 to exploit their archives.
- Carefully curated public datasets, with licenses that allow training AI models. Apple claims to have filtered this data to remove any sensitive information.
- Public information collected by its Applebot crawler on the web.
Apple stresses that no private user data was included in this mix. The firm was singled out in July for using a dataset called “The Pile,” containing YouTube subtitles without the creators’ consent. It said at the time that it had no plans to use these specific models for its future AI features.
Open-source code and maths on the program
The AFM models were also trained on open-source code hosted on GitHub (Swift, Python, C…). A controversial subject, many repositories do not allow such use in their terms of use. Apple assures to have filtered to keep only those with the least restrictions, such as those under MIT, ISC or Apache licenses.
To strengthen the math capabilities of its AIs, Apple specifically included problems and answers from web pages, forums, blogs, tutorials and seminars on the subject. “High-quality” public datasets were also used for fine-tuning, in order to smooth out unwanted behaviors.
In total, the AFM model training dataset weighs about 6.3 billion tokens, compared to 15 billion for Meta’s flagship model, Llama 3.1 405B. Apple says it also used human feedback and synthetic data to better align with user needs and adhere to its responsible AI principles at every step.
Gray areas persist despite the transparency displayed
While the document is meant to be transparent, it is sparing on details, presumably to avoid any legal issues. Apple does allow websites to block their data from being indexed by its crawler, but that doesn’t help individual creators protect their works if they are hosted on third-party sites that refuse to do so.
The upcoming legal battles will have to decide on the practices of training generative AI, with some companies invoking “fair use” to justify the use of public data. In the meantime, Apple is trying to position itself as an ethical player, by forming financial partnerships with news agencies and other media.
Not sure that’s enough to allay all concerns about privacy and intellectual property. But the firm intends to capitalize on its image as a privacy champion. With iOS 18.1 and macOS 15.1, users will be able to access a detailed report on how their requests were processed by Apple Intelligence, including whether they were processed on the device or in Apple’s secure cloud.