
- 25-04-2025
- Artificial Intelligence
MIT researchers developed SASA, a method that reduces AI language toxicity and bias in real-time, maintaining fluency without retraining.
A new method known as Self-Disciplined Autoregressive Sampling (SASA), developed by researchers at MIT and IBM, introduces an efficient way for large language models (LLMs) to control and detoxify their own language generation without the need for retraining. SASA works during the inference phase by analyzing the internal representation of the LLM to distinguish between toxic and non-toxic language. It builds a classifier in the model's embedding space to define a boundary between harmful and acceptable content. By re-weighting the sampling probabilities of each potential next token based on its proximity to this boundary, SASA guides the generation toward safer and more appropriate responses while preserving fluency.
This method offers a lightweight and effective alternative to traditional detoxification approaches that rely on costly retraining or complex reward models. Evaluations across different LLMs and datasets, such as RealToxicityPrompts, BOLD, and AttaQ, show that SASA significantly reduces toxicity and mitigates biases, such as gender-based differences in output. It performs comparably to state-of-the-art models in detoxification while maintaining a lower computational burden. Moreover, SASA can be extended to incorporate multiple human values like fairness, helpfulness, and truthfulness, making it a promising framework for generating ethically aligned AI communication.