Evaluating Artificial Intelligence: When Numbers Meet Ethics
by Dario Ferrero (VerbaniaNotizie.it)
In the past five articles, we have explored the world of artificial intelligence together, starting from its historical roots and technological foundations, and then delving into the complexities of machine learning and deep learning. We have seen how AI is transforming the world of work and study, discovered the wonders of generative AI that creates images, texts, and videos, and analyzed the landscape of companies and tools shaping this sector.
Now, in this final chapter of our journey, we perhaps tackle the most delicate and crucial question: how do we know if an artificial intelligence system truly works well? And above all, how can we ensure it functions ethically and responsibly?
It's a question that becomes increasingly pressing as AI spreads into every aspect of our lives. It's no longer enough for a system to "seem" intelligent โ we must be able to measure its performance, understand its limitations, and ensure it operates according to shared ethical principles.
Beyond the Turing Test: The New Frontier of Evaluation
The famous Turing Test, proposed by British mathematician Alan Turing in 1950, posed a fascinating challenge: a machine could be called intelligent if it managed to deceive a human judge during a conversation, making them believe it was also human. For decades, this test was the benchmark for measuring artificial intelligence.
Today, however, the Turing Test seems almost anachronistic. Modern conversational artificial intelligence systems like ChatGPT, Claude, or Gemini could easily pass it, yet no one would dream of claiming they have achieved true general intelligence. The test only measures the ability to imitate, not deep understanding or reasoning capabilities.
That's why the scientific community has developed a new generation of evaluation tools: benchmarks. These are not simple tests, but true evaluation ecosystems that measure specific capabilities objectively and reproducibly.
Modern Benchmarks: Measuring Intelligence Piece by Piece
FrontierMath: Mathematics as a Testing Ground
One of the most interesting benchmarks recently developed is FrontierMath, which represents a true revolution in testing AI's mathematical reasoning capabilities. Unlike traditional math tests, FrontierMath presents completely original problems, designed by expert mathematicians to be challenging even for professionals in the field.
The genius of this approach lies in its indisputability: a mathematical problem has a precise, automatically verifiable solution. There is no room for subjective interpretations or evaluation bias. When an AI system correctly solves a complex number theory theorem, the result speaks for itself.
ARC: The Fluid Reasoning Test
The ARC Benchmark (Abstraction and Reasoning Corpus) takes a different but equally rigorous approach. By presenting visual patterns that require abstract reasoning, ARC seeks to measure what psychologists call "fluid intelligence" โ the ability to tackle completely new problems without relying on prior knowledge.
It's a test that even children can solve intuitively, but one that challenges the most sophisticated AI systems. This paradox reminds us that intelligence is not just information accumulation, but the ability to adapt and innovate.
Performance Convergence: A 2025 Phenomenon
One of the most significant trends emerging in 2025 is the rapid convergence of performance among different AI models. According to Stanford's AI Index 2025 report, the Elo score difference between the first and tenth model in the Chatbot Arena Leaderboard narrowed from 11.9% in 2024 to just 5.4% in 2025.
Even more surprising is the reduction in the gap between US and Chinese models: while in January 2024 the best American models outperformed Chinese ones by 9.26%, by February 2025 this difference had dropped to only 1.70%. The arrival of DeepSeek-R1 has further narrowed the gap, demonstrating that excellence in AI is no longer the monopoly of a few Western companies.
This phenomenon has profound implications: are we witnessing the democratization of high-quality AI? Or are we approaching a performance plateau that will require completely new approaches to progress further?
Beyond Numbers: The Metrics That Truly Matter
Accuracy, Precision, and the Delicate Balance of Metrics
When evaluating an AI system, numbers tell only part of the story. Accuracy โ the percentage of correct predictions โ may seem like the definitive indicator, but it hides dangerous pitfalls. A system that diagnoses rare diseases with 99% accuracy might seem excellent, but if that percentage comes from always saying "not sick" (correct in 99% of cases because the disease is rare), it is actually completely useless.
This is where more sophisticated metrics like precision (how many of the positive diagnoses are correct?) and recall (how many of the actual positive cases were identified?) come into play. The F1-score, which balances these two aspects, offers a more complete view of performance.
The Usability Challenge: When AI Meets Human
But even the most sophisticated metrics don't capture a crucial aspect: usability. An AI system can be technically perfect but completely unusable in practice. It's like having a Formula 1 car to go grocery shopping: technically superior, practically inadequate.
Evaluating usability requires more human approaches: tests with real users, satisfaction questionnaires, analysis of usage patterns. Microsoft Research recently developed new methodologies that go beyond simply measuring accuracy, evaluating the knowledge and cognitive skills required by a task and comparing them with the model's actual capabilities.
Interpretability: Opening the Black Box
One of the most fascinating challenges in AI evaluation concerns interpretability. Modern deep learning systems are often described as "black boxes" โ they work, but we don't know exactly how or why they make certain decisions.
This is not just an academic problem. Imagine being a doctor who has to explain to a patient why AI suggested a certain therapy, or a judge who has to justify a sentence based on algorithmic recommendations. The "why" becomes just as important as the "what."
LIME and SHAP: Illuminating Algorithmic Darkness
Tools like LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations) represent sophisticated attempts to meet this need. LIME works like an algorithmic detective: it analyzes small variations in the input to understand which elements most influence a decision. SHAP, on the other hand, borrows concepts from game theory to fairly distribute the "credit" of a prediction among all input features.
These tools are not perfect โ they offer approximate explanations, not absolute truths โ but they represent important steps towards more transparent and responsible AI.
The Ethical Dimension: When Numbers Aren't Enough
Bias: The Silent Enemy
No discussion of AI evaluation can ignore the issue of bias. Artificial intelligence systems learn from data, and if this data reflects societal prejudices and inequalities, AI will amplify and perpetuate them.
Bias in AI is not just a technical problem to be solved, but a mirror of our societies. When a personnel selection system discriminates against women, it is not "making a mistake" in a technical sense โ it is reflecting real patterns present in historical hiring data. The challenge is to distinguish between useful patterns and unacceptable prejudices.
New Tools for Ethical Evaluation
Fortunately, the AI community is developing increasingly sophisticated tools to identify and mitigate these problems. New benchmarks like HELM Safety, AIR-Bench, and FACTS offer promising tools for evaluating the factuality and safety of AI systems.
Tools like AIF360 assess fairness across various metrics, such as disparate impact and statistical parity, allowing for continuous recalibration of models to maintain ethical performance. These systems represent a proactive approach to AI ethics, incorporating ethical considerations from the initial stages of development.
The Challenge of Data Contamination
One of the trickiest issues in modern AI evaluation is data contamination. What happens when a model has already "seen" the test questions during its training? It's like allowing a student to consult the answers during an exam.
Recent studies show that this practice is more widespread than previously thought: out of 30 models analyzed in October 2024, only 9 reported information on the overlap between training and test data. This problem not only undermines the reliability of benchmarks but also raises deeper questions about transparency and honesty in AI research.
The Evolution of Benchmarks: Towards More Realistic Tests
From Labs to the Real World
Traditional benchmarks often evaluate isolated capabilities in artificial conditions. But the AI of the future will have to operate in the real world, where problems are messy, incomplete, and interconnected.
New benchmarks are emerging to test the execution speed of AI applications, including one based on Meta's 405 billion parameter Llama 3.1 model, which tests a system's ability to process complex queries and synthesize data. These tests reflect a maturation of the sector, which is shifting from pure research towards practical applications.
The Era of AI Agents
2025 has seen the emergence of increasingly "agentic" AI systems โ capable of acting autonomously in the environment to achieve complex goals. The focus is shifting towards creating customer-facing products and developing complex agentic workflows, requiring new types of evaluation that go beyond traditional metrics.
How do you evaluate an AI agent that must coordinate different activities, adapt to unforeseen situations, and interact with different systems and people? It's a challenge that requires completely new approaches to evaluation.
Voices from the World: What AI's Great Thinkers Are Saying
Redefining Being Human: Harari and the Challenge of Uniqueness
Yuval Noah Harari, the Israeli historian who has become one of the most influential contemporary thinkers, posed a question that should make us reflect deeply: what does it mean to be human in the age of artificial intelligence? In his book "21 Lessons for the 21st Century", Harari highlights how AI is challenging our traditional understanding of human uniqueness.
"It is no longer enough to define ourselves through intelligence or learning ability," writes Harari, "as machines are proving they can excel in these areas." We all experience a daily example of this reality: Netflix or Amazon recommendation systems often predict our preferences better than we do ourselves. This raises fundamental questions about our self-awareness and how AI is redefining the very concept of individuality.
The Question of Consciousness: Chalmers and the Mystery of the Artificial Mind
Australian philosopher David Chalmers took the debate to an even deeper level in his work "Reality+", raising questions about the possibility of AIs developing a form of consciousness. Chalmers explores the possibility that AI experiences could be qualitatively different from ours, but equally valid from a phenomenological point of view.
"If an AI were conscious," Chalmers asks, "what rights should we grant it?" This is not a purely academic question. Many people already develop an emotional attachment to virtual assistants like Siri, Alexa, or ChatGPT, treating them with a courtesy that suggests a natural human tendency to anthropomorphize machines. This tendency confronts us with new ethical and psychological challenges that traditional AI evaluation struggles to capture.
Social Impact: Turkle and the Transformation of Relationships
Sherry Turkle, an MIT psychologist and one of the most authoritative voices on the study of the impact of digital technologies, has dedicated decades to understanding how AI is changing human relationships. In her influential "Alone Together", Turkle highlights a paradox of our era: never so technologically connected, never so emotionally alone.
A concrete example of this transformation can be seen in dating apps, where algorithms decide our potential romantic compatibilities, radically changing the traditional process of forming human relationships. "We are delegating to machines not only calculations," observes Turkle, "but also intimacy and emotional understanding."
Preserving Humanity: Nussbaum and Fundamental Capabilities
Martha Nussbaum, an American philosopher and Princess of Asturias Award laureate, emphasizes the crucial importance of maintaining and cultivating fundamental human capabilities in the age of AI. Her reflections remind us that as we automate more and more aspects of our lives, we must preserve those uniquely human qualities such as empathy, creativity, and critical thinking.
"Education must not only prepare us to coexist with AI," Nussbaum argues, "but to remain fully human despite AI." It is a warning that has direct implications for how we evaluate artificial intelligence systems: it is not enough for them to function well technically, they must also preserve and enhance our humanity.
Cognitive Transformation: Carr and the Digital Brain
Nicholas Carr, in his groundbreaking "The Shallows: What the Internet Is Doing to Our Brains", offers an illuminating perspective on how AI is changing not only the way we think, but the very structure of our brain. Carr argues that constant exposure to algorithms and automation is altering our cognitive processes, reducing our capacity for deep concentration and contemplative thought.
A practical example we all recognize: when we read online, bombarded by hyperlinks and notifications, our brain develops a "skimming" reading pattern, losing the ability to immerse itself deeply in a text. "We are becoming more efficient at superficially processing information," writes Carr, "but at the expense of our capacity for deep reflection."
Carr does not offer a nostalgic critique of the past, but invites us to consciously reflect on how integration with AI is creating a new form of hybrid cognition. His analysis leads us to a fundamental question that should guide all AI evaluation: as we increasingly rely on artificial intelligence for cognitive tasks, are we losing essential mental abilities that have characterized human evolution for millennia?
Critical Voices: Lanier and Critical Thinking at Risk
Jaron Lanier, a virtual reality pioneer and one of the most lucid critics of contemporary technology, raises crucial concerns in his "Ten Arguments for Deleting Your Social Media Accounts Right Now". Lanier highlights how AI algorithms managing social media are influencing not only what we think, but how we think.
"Algorithms don't just show us content," Lanier warns, "they are changing our cognitive processes." A daily example is personalized feeds that create "information bubbles," limiting our exposure to different viewpoints and reducing our critical thinking skills. This has direct implications for AI evaluation: we cannot limit ourselves to measuring technical accuracy, we must also assess cognitive and social impact.
Alignment with Human Values: Russell and Compatibility
Stuart Russell, a Berkeley computer scientist and author of "Human Compatible", is an authoritative voice in the debate on aligning AI with human values. Russell emphasizes the fundamental importance of developing AI systems that are truly compatible with human goals and values.
"The problem is not that AI becomes evil," Russell explains, "but that it pursues goals that are not aligned with ours." In everyday life, this manifests in seemingly trivial but ethically complex situations: when a self-driving car has to choose between protecting the passenger or pedestrians, what ethical algorithm should guide that decision?
Algorithmic Inequalities: Crawford and Noble
Kate Crawford, in her "Atlas of AI", and Safiya Noble, author of "Algorithms of Oppression", draw attention to an often-overlooked dimension of AI evaluation: the impact on social inequalities.
Crawford highlights how gender biases can be embedded in AI systems in subtle but pervasive ways. Noble has systematically documented how AI systems can perpetuate and amplify racial, religious, and gender inequalities. A concrete example is personnel selection systems that, trained on historical hiring data, can unconsciously discriminate against women or ethnic minorities.
"It is not enough for an algorithm to be technically accurate," Noble argues, "it must also be socially just." This principle should be at the heart of every AI evaluation methodology.
Spiritual Perspectives: Beyond Technology
The Dalai Lama, in various public speeches, has emphasized the importance of maintaining compassion and ethics as we develop increasingly advanced technologies. "Technology should serve humanity, not replace it," he stated, highlighting the need to consider not only the technical efficiency of AI but also its impact on people's spiritual and emotional well-being.
Pope Francis has repeatedly addressed the topic of AI from the Vatican, stressing the need for technological development that respects human dignity and promotes the common good. "Artificial intelligence can be a blessing," he said, "but only if we use it to reduce inequalities, not to amplify them."
The Infosphere: Floridi and the New Human Environment
Luciano Floridi, a philosopher of information at the University of Oxford, introduces the revolutionary concept of the infosphere โ an environment where the boundary between online and offline, between natural and artificial, becomes increasingly blurred. In everyday life, this manifests every time we use GPS to navigate: we are not simply using a tool, but we are delegating a fundamental part of our decision-making process to an artificial system.
"We have become informational entities," writes Floridi, "that exist and interact in an environment increasingly permeated by artificial intelligence." When a doctor uses AI for diagnosis, they are not just using a tool โ they are entering a new form of human-machine collaboration that profoundly redefines their professional role and identity.
The Cultural Dimension of AI Ethics
AI as a Mirror of Societies
All these thinkers converge on a fundamental point: AI alignment is not just a technical issue, but a process that deeply reflects the values, ethics, and culture of its developers. Every artificial intelligence system is "educated" through enormous datasets that are never neutral, but always imbued with the values, biases, and perspectives of the people and institutions that select and curate them.
The country of origin of an AI thus becomes a crucial factor: ethical norms, legislative constraints, cultural sensitivities, and even censorship systems inevitably influence how artificial intelligence processes information and formulates responses. An AI developed in Silicon Valley will likely have responses more oriented towards individualism and innovation, while an artificial intelligence created in contexts with greater state control might reflect different social priorities.
The Need for Critical Thinking
It therefore becomes essential for every user to develop critical awareness. Knowing the origin of an artificial intelligence means being able to interpret its responses with a conscious filter. Just as we evaluate a journalistic source by considering its editorial line, the same must happen with AI.
Asking where an AI system comes from, who developed it, what cultural and ethical values influence it, becomes a fundamental exercise in critical thinking. The information returned should not be accepted as absolute truths, but as perspectives to be analyzed, compared, and critically examined, aware that behind every response lie choices, filters, and perspectives that go beyond mere informational data.
The Paradox of Ethical Universality
This leads us to a fascinating paradox that emerges from the reflections of all these thinkers: while we seek universal ethical standards for AI, we inevitably clash with human cultural diversity. What is considered "fair" or "equitable" varies significantly across different cultures. How can we develop AI systems that respect this diversity while maintaining fundamental ethical principles?
As IBM notes in its 2025 analysis, diversity, equity, and inclusion are fundamental to an AI innovation strategy not only for ethical reasons, but because diverse perspectives promote more creative problem-solving and inclusive design that reduces unwanted biases.
Towards Global AI Governance
International Frameworks
The issue of ethical AI evaluation has prompted international bodies to develop shared frameworks. UNESCO promotes public understanding of AI through open and accessible education, civic engagement, digital skills, and AI ethics training.
These efforts represent attempts to create common standards, but their effectiveness will depend on the willingness of nations and companies to adhere to them voluntarily.
The Role of Tech Companies
Large technology companies are taking an increasingly active role in developing ethical principles for AI. Google has described progress made in risk mitigation techniques through various generative AI launches, including improved safety techniques and filters, security and privacy controls, and extensive AI literacy education.
Microsoft defines responsible AI as a set of steps to ensure that AI systems are reliable and respect societal principles, working on issues such as fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability.
However, the question remains: can we trust self-regulation, or are more robust control mechanisms needed?
Future Challenges of AI Evaluation
The Benchmark Arms Race
One emerging problem is what we might call "the benchmark arms race." As models become increasingly capable of passing existing tests, ever more sophisticated benchmarks are needed. But there is a risk that this dynamic will lead to an excessive focus on metrics at the expense of real-world applications.
Artificial General Intelligence: How Will We Evaluate It?
As we (perhaps) approach the development of Artificial General Intelligence (AGI), our evaluation methodologies will have to evolve radically. How do you measure an intelligence that could surpass human intelligence in all domains? What metrics would we use for a system that could be more creative, more rational, and more efficient than us?
Continuous Real-Time Evaluation
The future of AI evaluation may not consist of occasional tests, but of continuous monitoring. Systems that constantly adapt and learn require equally dynamic evaluations. Are we entering the era of "living evaluation," where a system's performance and ethics are monitored in real time?
Towards Truly Responsible AI: Guiding Principles for the Future
Transparency Without Compromise
The first principle for responsible AI must be total transparency. This does not necessarily mean making every technical detail public, but ensuring that stakeholders โ users, regulators, civil society โ have access to the information needed to evaluate and control AI systems.
Inclusivity in Design and Evaluation
AI systems and their evaluation methods must be developed with diverse input from the outset. It is not enough to correct biases retrospectively โ we must prevent them through diverse development teams and inclusive evaluation processes.
Distributed Responsibility
Responsible AI cannot exist without clear chains of responsibility. Who is responsible when an AI system makes a mistake? How do we distribute responsibility among developers, users, and regulators?
Participatory Evaluation
The future of AI evaluation must include the voices of all those affected by it. This means developing mechanisms for public involvement in defining ethical standards and evaluation methodologies.
AI as a Tool for Growth
Democratizing Access to Evaluation
One of the most important challenges is to make AI evaluation tools accessible not only to experts, but to everyone who uses these systems. We need intuitive interfaces, understandable documentation, and tools that allow anyone to verify the performance and ethics of the AI systems they use.
AI Education and Literacy
We cannot have responsible AI without a digitally literate population. This means investing in education, not just for technicians, but for all citizens who will have to coexist with these systems.
Looking to the Future: Predictions and Challenges
The Evolution of Benchmarks in the Coming Years
In the next 2-3 years, we can expect to see benchmarks increasingly oriented towards real-world applications, robustness tests in adverse conditions, and ethical evaluations integrated from the design phase. The trend will be towards more holistic tests that evaluate not only technical performance but also social and environmental impact.
The Emergence of Global Standards
It is possible that by 2027-2028 an international consensus will emerge on minimum standards for the ethical evaluation of AI, similar to what has happened in other technological sectors. This will require a difficult balance between cultural diversity and universal principles.
AI Evaluating AI
An interesting evolution could be the use of AI itself to evaluate other AI systems. This meta-algorithmic approach could allow for more sophisticated and continuous evaluations, but it also raises profound philosophical questions: who controls the controllers?
A Review of Our Journey: Final Reflections
As we reach the end of this series of articles, it is time to stop and reflect on the journey we have taken together. We began by exploring the origins of artificial intelligence, that fascinating human attempt to create thinking machines rooted in the deepest dreams and ambitions of our species.
We discovered that behind the apparent magic of AI lie sophisticated but understandable algorithms, neural networks that mimic the functioning of the human brain, and learning processes that transform raw data into usable knowledge. We have seen how this technology is revolutionizing the world of work and education, creating new opportunities while eliminating others.
Generative AI has shown us a future where artificial creativity complements human creativity, producing art, literature, and content that challenge our traditional conceptions of originality and authorship. We analyzed the industrial landscape, discovering how tech giants and innovative startups are shaping the future of this technology.
And now, in this final chapter, we have perhaps addressed the most crucial question: how to ensure that all this technological power is used responsibly and ethically.
The Importance of Critical Spirit
If there is one lesson that emerges strongly from this journey, it is the importance of maintaining a critical spirit. Artificial intelligence is neither humanity's salvation nor its condemnation โ it is a powerful tool that reflects the intentions, values, and biases of those who develop and use it.
As we have seen, every AI system carries the cultural imprint of the society that created it. Recognizing this fact does not mean being pessimistic, but being aware. It means approaching AI with curiosity and openness, but also with intelligent questions: who developed this system? What data was it trained on? What are its limitations and possible biases?
AI as a Mirror of Humanity
One of the most fascinating aspects emerging from our exploration is how AI functions as a mirror of humanity. Artificial intelligence systems do not create prejudices out of thin air โ they reflect them from the data they are trained on, which in turn reflects human societies with all their imperfections.
This presents us with a twofold responsibility: on the one hand, we must work to create fairer and more representative AI systems; on the other, we must use AI as an opportunity to critically reflect on our societies and our values.
The Democratization of Intelligence
We have seen how AI is becoming increasingly accessible. Tools that only a few years ago were available only to researchers and large companies are now within reach of students, small businesses, and creatives worldwide. This democratization represents an extraordinary opportunity for human innovation and creativity.
But as Spiderman would say, with great power comes great responsibility. Every user of AI technologies becomes, in a sense, an active participant in shaping the future of this technology. Our choices, our feedback, the way we use these tools contribute to the evolution of AI.
An Invitation to Conscious Action
As we conclude this journey, my invitation is not to consider AI as something that happens to us, but as something we are co-creators of. Every time you use an artificial intelligence system โ whether to search for information, create content, or solve problems โ remember that you are participating in a global experiment that will determine the future of our species.
Inform yourselves. Ask questions. Stay curious. But above all, don't be afraid to be critical. AI has extraordinary potential to improve our lives, but this potential will only be realized if we are active in demanding that it be developed and used ethically and responsibly.
Towards a Future of Collaboration
The future will likely not be characterized by the supremacy of AI over humans or humans over AI, but by their collaboration. The most powerful and beneficial systems will be those that amplify human capabilities rather than replace them, that enrich our experience rather than impoverish it.
This collaboration will require new skills from us: not only technical, but also ethical, critical, and creative. We will have to learn to live with systems that in some aspects surpass us, while maintaining our humanity and our values.
A Thank You and Farewell
This journey through the world of artificial intelligence ends here, but your exploration has just begun. AI will continue to evolve at an ever-increasing pace, bringing new challenges and opportunities that today we can only imagine.
I thank those who have followed this series of articles for their patience and curiosity. Artificial intelligence is a complex and rapidly evolving field, but I hope these articles have provided useful tools for navigating this changing landscape.
Remember: in a world increasingly dominated by algorithms and data, your ability to think critically, ask intelligent questions, and maintain a human perspective has never been more valuable. Artificial intelligence can be an extraordinary ally in this process, but it can never replace uniquely human curiosity, empathy, and wisdom.
The future of AI is us. Let's build it together, wisely and responsibly.