The AI Evaluation Conundrum: Universality and Ideology, Phenomenology and Ontology
Evaluating a human being systematically is a multifaceted endeavor. While specific metrics like height, weight, or academic performance offer a glimpse into an individual’s characteristics, a comprehensive assessment often veers towards more abstract descriptors. Interestingly, despite their seeming vagueness, these descriptors resonate intuitively with us as humans. For instance, consider the way historical figures are often described, combining personal virtues and societal impact in a nuanced manner.
Similarly, the evaluation of artificial intelligence presents analogous challenges. Initially, when AI tasks were more singular and quantifiable, assessment was straightforward. However, as AI technology has evolved to encompass complex tasks like natural language-based logical reasoning and tackling socio-scientific issues, the evaluation process has grown more intricate. This complexity stems from the need to consider multiple perspectives, often influenced by subjective and ideological factors.
This article delves into these challenges, particularly the dual obstacles of universality and ideology in AI evaluation. We analyze their origins and impact on our understanding and development of AI. As models become more generalized, evaluating their quality demands increasingly diverse considerations, adding layers of complexity to the assessment process. This complexity is evident in the academic world, where influential papers on fundamental models often have appendices far exceeding the length of the main text, signifying the depth and breadth of evaluation required.
The Turing Test: Once a Distant Dream
The Turing Test, once hailed as the "ultimate test" of AI, now appears somewhat less formidable. In AI’s evolutionary journey, the Turing Test has been a significant milestone, setting a clear objective: if a machine can converse in text indistinguishably from a human, it can be deemed "intelligent."
On a popular level, many models today would likely pass the Turing Test. For instance, in structured tasks like exams, AI models such as GPT-4 have shown capabilities surpassing human performance. This is particularly true for tasks with identifiable patterns or routines.
Applications that exemplify passing the Turing Test are now prevalent across various domains. Chat applications with personalized characters, like Character IO, enjoy immense popularity, with user engagement times surpassing those of platforms like ChatGPT.
This phenomenon brings to mind the 2013 film "Her," set in 2025, where a man falls in love with an AI. Now, in 2023, such a scenario seems not just plausible but likely, representing a real-life Platonic love story and an intriguing byproduct of AI development.
Moreover, the presence of free-dialogue NPCs in gaming is no longer novel. As technology progresses, the authenticity of character interactions has improved significantly.
Indeed, with carefully crafted prompts, we can even simulate such constrained dialogues in ChatGPT. For example, prompting ChatGPT to role-play as a farmer who has spent their life in a small mountain village demonstrates this capability.
The Turing Test belongs to the realm of phenomenology. It assesses a machine's performance capabilities, not its intrinsic abilities. Similarly, in phenomenology, a person who pretends to do good deeds their entire life is considered goodhao. A machine that achieves results, regardless of its awareness of its actions, is seen as a strong AI.
Additionally, the Turing Test is an abstract generalization. Turing, in his 1950 paper "Computing Machinery and Intelligence," proposed a qualitative test without specific quantitative standards. The closest he came to quantification was suggesting a machine passes the test if the judge cannot discern it from a human more than 70% of the time after five minutes of conversation.
Today, we recognize that merely making a qualitative judgment on whether a program or model passes the Turing Test is insufficient; a quantitative assessment is imperative.
Practical Challenges in AI Evaluation
Challenges from Universality
Contemporary foundational models are becoming increasingly versatile. For example, Google's T5 (Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer) can handle thousands of natural language tasks. With the advent of multimodal multitask models, these AI systems can process problems involving a combination of text, images, videos, and audio, further enhancing their universality.
However, this universality introduces a series of evaluation problems. We need to assess a model's quality from many perspectives, and traditional evaluation methods encounter scalability challenges as the effort required increases linearly with the number of tasks.
In the near future, as Large Language Models (LLMs) and multimodal models fully integrate, evaluating a model will become even more complex due to the higher degree of generalization. This universality challenge compels us to ponder more efficient evaluation methods that don't linearly escalate in cost with the diversity of the model's tasks.
Challenges from Ideology
As our expectations for AI grow, we increasingly want models to handle more subjective matters, like solutions to social science issues. These often involve morals, aesthetics, and ideology, which are subjective and whose "correctness" is often defined by mainstream societal ideologies, making them difficult to measure precisely.
Thus, merely passing the Turing Test is insufficient for a model's acceptance. A qualified model must align with "correct values." It's not enough to just "train" a model; it also needs to align with our "value system."
Value alignment typically involves manually ranking potential answers according to preferences, then training the model to develop its sense of "right and wrong" based on these preferences. This process is akin to teaching a child that "hitting is wrong" or "good things should be shared." We guide learning through a "phenomenon-preference evaluation" approach, using feedback learning (rewards and punishments) to ensure correct actions at the execution level. Whether the model understands and comprehends the underlying principles depends on its "aptitude." This approach is known as Reinforcement Learning from Human Feedback (RLHF).
The Challenges of Value Alignment in AI: Anthropic's Approach and Beyond
Advancing Value Alignment: RLAIF by Anthropic
In their paper "Constitutional AI: Harmlessness from AI Feedback," Anthropic takes value training a step further with the concept of RLAIF. Instead of generating "correct examples" by humans, it allows the model to reflect on its own content generated under induced conditions based on its underlying "sense of right and wrong." The model then aligns its values based on the data produced from this reflection. The aim is to create a "conditional reflex" in the model, enabling it to generate outputs aligned with values without the need for reflection.
Evaluation Methods
For evaluating universal models, Anthropic suggests the HHH evaluation method in the same paper: helpfulness, harmlessness, and honesty. This approach seeks to assess models on a higher level, mainly applicable to natural language processing reasoning tasks. For broader tasks, we still lack a universal method of evaluation, often resorting to manual, task-by-task assessments.
While Anthropic's method addresses the issue of sample size in value alignment, it doesn't solve the fundamental problem: defining what constitutes value-aligned data.
What are values? What are the right values? How are morals defined, and what are the boundaries of moral application? These questions, largely determined by societal ideologies, are difficult to measure accurately and change over time.
It's evident that current methods for evaluating foundational models are becoming increasingly abstract, veering away from the principles of evaluating natural science theories. This trend forces us to consider how we can find an evaluation method that accounts for subjectivity and values while remaining precise and quantifiable.
Perhaps there inherently isn't a universally value-aligned model, which might corroborate the "no free lunch theory" in machine learning, suggesting the impossibility of a single model solving all problems.
Deeds vs. Intent: Phenomenology vs. Ontology in AI
The "Chinese Room" problem raises an important question in AI: Should we be content with phenomenological observations, or should we delve deeper into ontological issues?
The question of whether AI, like humans, possesses consciousness is a hotly debated topic. Critics often claim that current language models lack a "soul," reminiscent of the philosophical thought experiment known as the "Chinese Room," proposed by American philosopher John Searle in his 1980 paper "Minds, Brains, and Programs."
In this experiment, a person who speaks only English is locked in a room with a set of instructions in English for responding to Chinese messages. The person uses these instructions to respond to questions written in Chinese without understanding the language. Searle argues that just as the person in the room doesn't understand Chinese, a computer can't gain understanding through a program. This implies that computers, despite being able to converse fluidly or pass the Turing Test, do not truly understand, as the Turing Test only measures phenomenological capabilities, not the essence of understanding.
Utilizing Models When "Intent" Is Unknown
In human resource management, there was a saying: "Don't employ doubtful people; don't doubt the employed." Later, Jack Ma proposed the opposite: "Doubt who you employ, and employ who you doubt." This emphasizes a cautious yet open-minded approach in utilizing talents, recognizing that everyone has value and potential.
A similar perspective can be applied to machines: "Doubt the machine, yet use the doubted machine." We shouldn't be overly conservative in adopting new tools, nor should we uncritically accept them.
With models like LLMs, an open but cautious approach is essential. We should welcome new tools while ensuring their use is safe and effective through necessary scrutiny and oversight.