How Generative AI Makes Content Moderation Both Harder and Easier
By Numa Dhamani and Maggie Engler
Numa Dhamani (LinkedIn, Twitter) is an engineer and researcher working at the intersection of technology and society. She is a natural language processing expert with domain expertise in influence operations, security, and privacy. Numa has developed machine learning systems for Fortune 500 companies and social media platforms, as well as for start-ups and nonprofits. She has advised companies and organizations, served as the Principal Investigator on the United States Department of Defense’s research programs, and contributed to multiple international peer-reviewed journals.
Maggie Engler (LinkedIn) is an engineer and researcher currently working on safety for large language models at Inflection AI. She focuses on applying data science and machine learning to abuses in the online ecosystem, and is a domain expert in cybersecurity and trust and safety. Maggie is a committed educator and communicator, and has taught as an adjunct instructor at the University of Texas at Austin School of Information.
Numa and Maggie co-authored Introduction to Generative AI, to be published by Manning Publications. Introduction to Generative AI illustrates how LLMs could live up to their potential while building awareness of the limitations of these models. It also discusses the broader economic, legal, and ethical issues that surround them, as well as recommendations for responsible development and use, and paths forward.
Earlier this year, the European Digital Media Observatory (EDMO) published a piece asserting that “Generative AI marks the beginning of a new era for disinformation.” Axios claimed that generative AI would be the “next misinformation nightmare,” and Wired interviewed several leading disinformation experts stating that “Generative AI won't just flood the internet with more lies — it may also create convincing disinformation that’s targeted at groups or even individuals.” Meanwhile, Nobel Peace Prize laureate Maria Ressa warned that we are facing a “tech-enabled Armageddon” that has been “turbocharged by the advent of generative artificial intelligence.”
Ever since generative AI exploded into the mainstream, experts and observers alike have been increasingly concerned about generative AI’s impact on our information environment. Even more so, we’re left wondering what it means for content moderation on social media platforms — a task that is terrible by design, fundamentally broken, and impossible to do well at scale. Content moderation was already an extremely difficult and thankless job, and with generative AI potentially increasing the quantity, quality, and personalization of adversarial content, is it borderline impossible for social media platforms to moderate content now?
First, let’s discuss the concerns about the effects of generative AI on the information landscape. In a study published earlier this year, researchers aim to assess how large language models (LLMs) will change influence operations, concluding that they will significantly impact the information ecosystem with the ability to automate the creation of persuasive, targeted adversarial content at scale while driving down the cost of producing propaganda. The increased quantity and quality of mis- and disinformation stands at the forefront of concerns — given the accessibility of LLMs, there is the prospect of highly scalable influence campaigns. In addition to the global number of users jumping to 4.9 billion on social media platforms and the volume of content skyrocketing, it is getting very difficult to distinguish AI-generated content from human-written content where existing basic keyword or tagging-based moderation tools will fall short.
It is also possible that the low barrier of entry expands access to a greater number of actors. Now, you don’t need to have sufficient expertise in creating long-form propaganda or even be fluent in English (or any language, for that matter). Influence operations can be more tailored, more personalized, and ultimately, more effective. Of course, you don’t need generative AI to manipulate emotions or carry out influence operations — motivated adversarial actors have already been doing so, while platforms have been developing more robust moderation services to tackle these issues at scale. Misinformation and disinformation are certainly not new and have been features of human communication since the dawn of time. But what is new is that generative AI has provided such actors a distinct advantage, or a strategic boost. These tactics aren’t only limited to influence operations either — motivated actors can use them to generate code for malware or conduct large-scale social engineering campaigns. If history tells us anything, it’s that motivated actors will likely use generative AI in novel and unexpected ways.
On the other hand, the very capabilities that enable large language models to be so effective at generating content also make them powerful as text classifiers. AI has long been used for content moderation by social media platforms, who mostly develop their own AI models internally; smaller companies and startups might leverage third-party vendors to detect unwanted content like spam and abuse. Models designed to detect this content might use metadata such as account and post details, the text of the post alone, or a combination of the two.
For the purpose of trust and safety enforcement, models are most commonly supervised models, meaning that they learn a prediction task (such as whether or not a post is spam) by processing labeled data (posts that are marked as either spam or not-spam). However, there are limitations inherent in these models; they often require lots of data to work well, and might fail to generalize beyond the types of violations seen in the training set. Platforms have historically been poor at moderating in non-English languages largely because of a dearth of available labeled data.
What large language models provide is an alternative approach. They are not without their own limitations — LLMs, too, perform better in English compared to non-English languages and have other inherent biases due to their training data — but could help enable platforms to automate content moderation earlier and faster. Consider a hypothetical scenario where a social app has decided to institute a new policy banning incitement of violence. In order to enforce this policy, developers might build a traditional text classifier trained from scratch, which would require the likely manual curation of thousands of examples and non-examples. Or, they might take an open-source LLM, fine-tune it on a handful of examples, and use that to detect examples of the messages they are looking for. They might distill that model into a smaller classifier to improve the speed of inference and substantially the most time-consuming part of development.
Text classification isn't the only area of trust and safety enforcement for which people have begun to employ generative AI tools. In an August 2023 blog post, OpenAI outlines how they integrated GPT-4 into the policy development workflow. After the policy is created, "policy experts can create a golden set of data by identifying a small number of examples" and labeling them. GPT-4 will then label the same data based on the policy alone, not looking at the labels assigned by the human experts. "By examining the discrepancies between GPT-4’s judgments and those of a human, the policy experts can ask GPT-4 to come up with reasoning behind its labels, analyze the ambiguity in policy definitions, resolve confusion, and provide further clarification in the policy accordingly." One could also imagine using the LLM to produce edge cases for new golden examples. In the incitement to violence case, should messages referring to video games or other fictional situations count? What about real threats that are disguised by figurative language? After the policy is refined in this way, it is then translated into a classifier as described previously.
In content moderation, generative models such as LLMs, don't necessarily change the fundamental mechanics of the ecosystem, but they do provide new capabilities to both would-be attackers and integrity workers aiming to stop them. Although purveyors of mis- and disinformation can already create convincing synthetic content, they can now do so much more easily and quickly with the help of generative models. It seems inevitable that the future will hold an ever-increasing amount of AI-generated media, and uncertainty around whether documentary evidence can be trusted or not. Ressa and other observers fear that this lack of consensus could have serious repercussions for eroded trust in the democratic process and other offline harms, though we note that generative models are not required for conspiratorial beliefs to take hold. The same breakthroughs in AI mean systems that can better understand language and images. The result could well be improved detection systems for specific types of content online — civic misinformation, extremist propaganda — in an environment where the origin, human or AI, is entirely unknown. Ultimately, platforms will need to grapple with the actors, behaviors, and content in much the same way as before, but their success will depend on an understanding of the applications and limitations of generative AI.