Unleashing the Potential of Generative AI in Integrity, Trust & Safety Work: Opportunities, Challenges, and Solutions
By Alex Rosenblatt, Swapneel Mehta, Laila Wahedi, Talha Baig, and Sandeep Abraham. In April 2023, Integrity Institute members held a generative writing session on AI’s implications for integrity work.
The utility and popularity of generative AI (GenAI) are growing exponentially, and Integrity, Trust & Safety (T&S) professionals are rushing to consider its implications on their work. It’s clear that this technology may benefit bad actors - there are early indicators and proofs-of-concept for synthetic child sex abuse material and AI-generated mis- and dis-information. While most of the discourse regarding GenAI within T&S is focused on preempting such malicious uses of the technology, far less has considered how GenAI can enhance T&S work. In this article, we will focus on the capabilities of these technologies, briefly ground them in limitations, and outline how T&S professionals might effectively leverage GenAI to tackle our most complex challenges. This post will largely focus on the text (and image-to-text) modalities of large language models as these are most accessible at the moment.
GenAI Capabilities
While there are various GenAI models with different strengths and weaknesses, the core capability of these language models is the capacity to process vast amounts of data in multiple languages, gaining a deep understanding of patterns, concepts, and context in order to generate human-usable output. These models can adapt to a variety of tasks without task-specific data through “zero-shot” or “few-shot” learning, giving the impression of reasoning, problem-solving, and nuanced language translation. They offer several opportunities to improve T&S work, including:
Generating content for communications and reporting: GenAI can create coherent and well-structured text, allowing T&S professionals to efficiently generate reports, policy summaries, and explanatory materials, streamlining internal communications, public disclosures, and regulatory documentation.
Classifying data: GenAI models can categorize and organize data (including text, images, and video) in prescribed ways, improving data management and facilitating speedy and accurate decision-making in content moderation and risk management.
Retrieving knowledge: GenAI can draw upon a vast array of information to answer specific queries in the context of T&S work. This may include understanding compliance policies, extracting insights from past incidents, and identifying potential risks. This ability to tap into existing knowledge can support informed decision-making among T&S professionals. It also enables non-technical teams to write natural language queries over databases, and write rules without code.
Interpreting information and relationships: GenAI can analyze and interpret complex relationships between various data points and sources. This capability can help T&S professionals quickly grasp the interconnections between different risk factors, policy violations, and trends in their domain.
Summarizing content: GenAI can condense lengthy or complex content into shorter, more manageable summaries (including content in foreign languages, with exciting implications for global moderation parity). By understanding the essence of the information, T&S professionals can use these summaries for briefings, risk assessments, or policy reviews.
Describing visual content: GenAI models can provide detailed descriptions of images and videos, assisting in the detection, analysis, and moderation of this content in line with internal policies or external regulations.
Analyzing data: These systems can quickly consume and evaluate large data sets, identifying new insights, trends, and patterns that may not be observable otherwise. These insights can empower T&S professionals to proactively detect risks, develop targeted countermeasures, and ensure effective policy enforcement.
Challenges in GenAI Implementation
We are still in the early days of GenAI, and, while these opportunities are exciting, several factors hinder the widespread adoption of GenAI in T&S work. These include:
Lack of trustworthiness: GenAI's apparent "understanding" and "creativity" are based on patterns learned from data, not on conscious comprehension or original thought. While temptingly believable in their conversational delivery, available GenAI models have been found to provide unreliable outputs, including outright fabrications (aka “hallucinations”). Therefore, at least for the time being, asserting error-free knowledge retrieval is difficult, and remains an elusive application of GenAI in T&S work.
Implementation challenges: Using these models for one-off demonstration of value is easy, but deploying them at scale is currently difficult. Latency (the more powerful the model, the slower it processes), privacy concerns, cost (compute or third-party) , and complexity of implementation can impede the deployment of GenAI in T&S work.
Model-specific biases: Training data bias, intentional limitations aimed at mitigating adversarial use, and built-in worldviews can affect the performance and reliability of GenAI models. Because these models are by their very nature a reflection of the human language they are trained on, bias is baked into how they process data. As such, we may never be able to fully “correct” these biases, even with explicit rules and countermeasures. For example, it is extremely difficult to measure the effects of assuming the gender of certain professions when that assumption is implicit rather than explicit.
Leveraging GenAI to Address Integrity, Trust & Safety Challenges
To be clear, each of these areas - and in particular, the model-specific bias concerns, are incredibly complex problem spaces that warrant their own deep dives. With the above limitations in mind, there are three key capabilities that we believe these models can improve in the immediate term, which we've expanded on and provided specific use cases/examples for below:
Improve detection of known risks
Automate and refine classification tasks, quickly identifying and prioritizing content that poses known risks, such as offensive content or disinformation. In particular, these models can allow teams to move away from brittle keyword based mitigations and leverage more resilient semantic descriptions.
Enhanced detection capabilities also allow T&S professionals to focus on emerging threats or more complex cases that require their expertise and judgment. The models can provide exciting features such as foreign language classification out of the box or use global user-reported content as an input dataset alongside reviewer labeled or policy detected content to offset cultural/context bias. As a mitigation against model reliability, they can be deployed in a double-blind review capacity (e.g. if the model and agent disagree, send to a second human for review).
Another exciting use case — particularly when image and video classification is more widely available — is to preserve moderator health by improving ability to detect what type and where violative content is present in videos or images. This can then be used to lessen the surprise of seeing content and subsequent trauma for reviewers.
It’s important to note, that even given the limitations regarding model bias and reliability mentioned above, it is not necessary that these models be 100% accurate to represent an improvement from the current state -- after all, our current classifiers and human agents are far from 100% accurate. In terms of model bias, we should also consider it in light of the alternative - human moderators - who are susceptible to their own implicit biases. It may in fact, prove more feasible to mitigate systemic bias in these models than it has been to do so at scale across an ever-changing group of human moderators.
Increase speed and effectiveness of human processes
GenAI can work alongside human reviewers to facilitate faster and more effective content moderation. By providing initial evaluations of case context and assisting with decision-making, these models can help hasten the review process by gathering data and improving accuracy.
GenAI models are not only well-suited to assisting T&S reviewers - they may also conduct trend analysis for media coverage regarding the public discourse on a particular issue, provide better tooling feedback via freeform text transformation from agents into structured data, and categorize and triage content directly to the appropriate agents, reducing both time to resolve the case and individual agent exposure to harmful content.
Communicate more transparently with users
GenAI can also produce tailored content to clarify complex workflows and policies to users, fostering greater transparency, understanding, and trust. For example, the model can generate context-specific notifications about content removal or account suspension, including detailed reasoning behind these actions based on the specific nuances of the user’s infraction and the underlying policies or guidelines.They can more efficiently connect customer service, advertising, and UX insights to understand which segments of users would best respond to which types of interstitials and frictions.
Conclusion
While the current limitations of these models are significant and should not be ignored, we believe these models will give great leverage to Trust and Safety teams, enabling them to do their work more effectively and efficiently without sacrificing moderator health. To get there, companies using these models need to make sure that they are implemented in an observable and testable manner — and have effective systems for ensuring that inadvertent consequences are not magnified at scale. We’re excited to continue to be part of the conversation regarding how these technologies can be used correctly and safely.