The University of Michigan has allegedly sold 85 hours of audio recordings from various academic settings including lectures, interviews, office hours, study groups, and student presentations to third parties for the purposes of training artificial intelligence. The school has also sold a dataset of 829 academic papers from students to help fine tune large language models (LLMs) as well.

It is unclear whether those included in the data consented to having their audio and texts used in such a manner. However, a sample dataset downloaded by The Daily Beast included a recording of a lecture from 1999 making it highly unlikely that they knew their data would be used to train future generative AI models.

AI engineer Susan Zhang took to X to post a screenshot showing what looks to be an advertisement from Catalyst Research Alliance, a firm selling the UM data, that she recently received on LinkedIn. The sender wrote that they were “reaching out because, based on your profile, you may be working with” LLMs.

“I wanted to let you know that the University of Michigan is licensing academic speech data and student papers that could be very useful for training or tuning LLMs,” the user wrote.

“So I guess this is a thing now,” Zhang said. “Universities running ads to resell students data for training LLMs.”

UM and Catalyst Research Alliance did not respond when reached for comment.

The cost of licensing the datasets varies depending on whether or not customers want to purchase just the audio recordings or the papers as well. However, the price goes as high as $25,000 for both datasets.

“The University of Michigan has recorded 65 speech events from a wide range of academic settings, including lectures, discussion sections, interviews, office hours, study groups, seminars and student presentations,” Catalyst Research Alliance said on its website. “Speakers represent broad demographics, including male and female and native and non-native English speakers from a wide variety of academic disciplines.”

The sample dataset included an audio lecture titled “Graduate Cellular Biotechnology Lecture” dated Feb. 1, 1999. In it, the unidentified lecturer speaks for roughly an hour and a half. The dataset also included a .txt file of a paper titled “The Democratic Inadequacies of the European Union.”

If true, the licensing deal is just another example of how personal data is being packaged and sold to help fuel emerging technologies such as generative AI. Even students whose work is completely unrelated to AI and LLMs can find their voice and writings being used in order to help train them.

“The whole thing feels deeply unethical,” Charles Logan, a learning sciences PhD candidate at Northwestern University, told The Daily Beast. Logan saw Zhang’s post on X and also commented on the situation, decrying it as the “logical progression of data capitalism.”

“When students are in a class or attending office hours there’s a trust implicit in that relationship,” Logan said. “They’re there to learn.”

He added that even if they are consenting to be a part of these datasets “there are still ways that they’re leaky.” “Private companies are monetizing student intellectual property and conversations that, if you’re in office hours or study groups, are deeply personal.”

That said, there is some room for doubt. Some experts are even skeptical that an educational institution like UM would do such a thing.

“My first reaction is one of skepticism,” Vincent Conitzer, an AI ethics researcher at Carnegie Mellon University, told The Daily Beast. “Also, even taking this message mostly at face value, I suppose it may just all be based on recordings and papers that are anyway in the public domain.”

He added that “it seems odd to me to imagine the university at the highest levels standing behind something like what this message is suggesting.”