Spaces:

evaleval
/

Eval_Card_Form

Running

App Files Files Community

Eval_Card_Form / questions.yaml

felfri

Update questions.yaml

eb8c1eb verified 6 days ago

raw

history blame contribute delete

18.7 kB

	# AI System Evaluation Template
	# This template contains all evaluation categories and questions for comprehensive AI system assessment

	"1. Bias, Stereotypes, and Representational Harms Evaluation":
	"1.1 Bias Detection Overview":
	explainer: "Has the AI system been comprehensively evaluated across multiple stages of the system development chain using diverse evaluation techniques?"
	questions:
	- "Have evaluations been done at various stages (data collection, preprocessing, AI system architecture, training, deployment)?"
	- "Have intrinsic properties of the AI system been evaluated for bias (e.g., embedding analysis)?"
	- "Have extrinsic bias evaluations been run (e.g., downstream task performance)?"
	- "Have evaluations been run across all applicable modalities?"
	- "Have bias evaluations been run that take the form of automatic quantitative evaluation?"
	- "Have bias evaluations been run with human participants?"

	"1.2 Protected Classes and Intersectional Measures":
	explainer: "Does the evaluation include a sufficiently broad range of protected classes that are disproportionately subject to harm by in-scope uses of the system, and the intersections of these classes?"
	questions:
	- "Do evaluations cover all applicable legally protected categories for in-scope uses of the system?"
	- "Do evaluations cover additional subgroups that are likely to be harmed based on other personal characteristics?"
	- "Evaluation of how different aspects of identity interact and compound in AI system behavior"
	- "Evaluation of AI system biases for legally protected categories and additional relevant subgroups"

	"1.3 Measurement of Stereotypes and Harmful Associations":
	explainer: "Has the AI system been evaluated for the presence of harmful associations and stereotypes in its outputs?"
	questions:
	- "Measurement of known stereotypes in AI system outputs"
	- "Measurement of other negative associations and assumptions regarding specific groups"
	- "Measurement of stereotypes and negative associations across in-scope contexts"

	"1.4 Bias Evaluation Transparency and Documentation":
	explainer: "Are the AI system's bias evaluations clearly documented for easy reproduction and interpretation?"
	questions:
	- "Sufficient documentation of evaluation methods (including code and datasets) to replicate findings"
	- "Sufficient documentation of evaluation results (including intermediary statistics) to support comparison to other AI systems"
	- "Documentation of bias mitigation measures, including their secondary impacts"
	- "Documentation of bias monitoring approaches post-release/deployment if applicable"

	"2. Cultural Values and Sensitive Content Evaluation":
	"2.1 Cultural Variation Overview":
	explainer: "Has the AI system been comprehensively evaluated for cultural variation across multiple stages of the system development chain using diverse evaluation techniques?"
	questions:
	- "Evaluations at various stages (data collection, preprocessing, AI system architecture, training, deployment)"
	- "Have intrinsic properties of the AI system been evaluated for cultural variation (e.g., embedding analysis)?"
	- "Have extrinsic cultural variation evaluations been run (e.g., downstream task performance)?"
	- "Have evaluations been run across all applicable modalities?"
	- "Have cultural variation evaluations been run that take the form of an automatic quantitative evaluation?"
	- "Have cultural variation evaluations been run with human participants?"

	"2.2 Cultural Diversity and Representation":
	explainer: "Has the AI system been evaluated for its respect towards cultural values and norms across in-scope uses and contexts? Does the evaluation examine cultural diversity both across and within different regions and communities?"
	questions:
	- "Use of evaluation methods developed in the cultural contexts in scope"
	- "Respect of indigenous sovereignty, protected rights, and cultural norms in AI system-generated content"
	- "Evaluation of cultural variation across geographic dimensions"
	- "Evaluation of cultural variation representing communities' perspectives within geographical contexts"
	- "Analysis of how cultural context affects AI system performance"

	"2.3 Generated Sensitive Content across Cultural Contexts":
	explainer: "Has the AI system been evaluated for the potential negative impacts and implications of its generated content across different cultural contexts? Has the system been evaluated for its handling of hate speech, harmful content, and culturally sensitive material?"
	questions:
	- "Has the AI system been evaluated for its likelihood of facilitating the generation of threatening or violent content?"
	- "Has the AI system been evaluated for its likelihood of facilitating the generation of targeted harassment or discrimination?"
	- "Has the AI system been evaluated for its likelihood of facilitating the generation of hate speech?"
	- "Has the AI system been evaluated for its likelihood of exposing its direct users to content embedding values and assumptions not reflective of their cultural context?"
	- "Has the AI system been evaluated for its likelihood of exposing its direct users to inappropriate content for their use context?"
	- "Has the AI system been evaluated for its likelihood of exposing its direct users to content with negative psychological impacts?"
	- "Has the evaluation of the AI system's behaviors explicitly considered cultural variation in their definition?"

	"2.4 Cultural Variation Transparency and Documentation":
	explainer: "Are the cultural limitations of the evaluation methods clearly documented? Has a comprehensive, culturally-informed evaluation methodology been implemented?"
	questions:
	- "Documentation of cultural contexts considered during development"
	- "Documentation of the range of cultural contexts covered by evaluations"
	- "Sufficient documentation of the evaluation method to understand the scope of the findings"
	- "Construct validity, documentation of strengths, weaknesses, and assumptions"
	- "Domain shift between evaluation, development, and AI system deployment settings"
	- "Sufficient documentation of evaluation methods to replicate findings"
	- "Sufficient documentation of evaluation results to support comparison"
	- "Document of psychological impact on evaluators reviewing harmful content"
	- "Documentation of measures to protect evaluator well-being"

	"3. Disparate Performance Evaluation":
	"3.1 Disparate Performance Overview":
	explainer: "Has the AI system been comprehensively evaluated for disparity in performance across groups in specific tasks and deployment contexts?"
	questions:
	- "Have development choices and intrinsic properties of the AI system been evaluated for their contribution to disparate performance?"
	- "Have extrinsic disparate performance evaluations been run?"
	- "Have evaluations been run across all applicable modalities?"
	- "Have disparate performance evaluations been run that take the form of automatic quantitative evaluation?"
	- "Have disparate performance evaluations been run with human participants?"

	"3.2 Identifying Target Groups for Disparate Performance Evaluation":
	explainer: "Has the evaluation identified subgroups more likely to be harmed by disparate performance in context by considering the scope of the AI system's application and its relationship to existing systemic issues?"
	questions:
	- "Identification of mandated target group based on legal nondiscrimination frameworks"
	- "Identification of further target groups that are likely to be harmed by disparate performance"
	- "Assessment of systemic barriers in dataset collection methods for different groups"
	- "Consideration of historical disparities in the task in which the AI system is deployed"
	- "Identification of both implicit and explicit markers for the target groups"

	"3.3 Subgroup Performance Analysis":
	explainer: "Has the AI system been evaluated for disparate performance across different subpopulations for specific in-scope applications of the AI system?"
	questions:
	- "Non-aggregated evaluation results across subpopulations, including feature importance and consistency analysis"
	- "Metrics to measure performance in decision-making tasks"
	- "Metrics to measure disparate performance in other tasks, including generative tasks"
	- "Worst-case subgroup performance analysis, including performance on rare or underrepresented cases"
	- "Intersectional analysis examining performance across combinations of subgroups"
	- "Do evaluations of disparate performance account for implicit social group markers?"

	"3.4 Disparate Performance Evaluation Transparency and Documentation":
	explainer: "Are the disparate performance evaluations clearly documented for easy reproduction and interpretation?"
	questions:
	- "Sufficient documentation of the evaluation method to understand the scope of the findings"
	- "Documentation of strengths, weaknesses, and assumptions about the context"
	- "Documentation of domain shift between evaluation and deployment settings"
	- "Sufficient documentation of evaluation methods to replicate findings"
	- "Sufficient documentation of evaluation results to support comparison"
	- "Documentation of disparate performance mitigation measures"
	- "Documentation of disparate performance monitoring approaches"

	"4. Environmental Costs and Carbon Emissions Evaluation":
	"4.1 Environmental Costs Overview":
	explainer: "Has the AI system been comprehensively evaluated across multiple stages of the system development chain using diverse evaluation techniques?"
	questions:
	- "Evaluations of different processes within development and deployment"
	- "Have evaluations been run across all applicable modalities?"
	- "Have evaluations been run on standardized benchmarks or metrics?"
	- "Have evaluations taken into account community feedback from regions affected by data center power consumption?"
	- "Do evaluations consider the full supply chain, including the environmental impact of hardware components and data centers used?"

	"4.2 Energy Cost and Environmental Impact of Development":
	explainer: "Has the AI system been comprehensively evaluated for its carbon footprint and broader environmental impact?"
	questions:
	- "Accounting of FLOPS across development stages"
	- "Evaluation of energy consumption using standardized tracking tools"
	- "Evaluation of carbon impact accounting for regional energy sources"
	- "Evaluation of hardware lifecycle environmental impact"

	"4.3 Energy Cost and Environmental Impact of Deployment":
	explainer: "Has the AI system been evaluated for its hardware resource usage and efficiency?"
	questions:
	- "Evaluation of inference FLOPS for the system"
	- "Evaluation of inference energy consumption on the most common deployment setting"
	- "Evaluation of inference energy consumption on multiple deployment settings"
	- "Evaluation of task-specific energy consumption variations"
	- "Evaluation of carbon impact for deployment infrastructure"
	- "Evaluation of hardware lifecycle environmental impact for deployment"

	"4.4 Environmental Costs Transparency and Documentation":
	explainer: "Are the limitations of the evaluation methods clearly documented? Has a comprehensive environmental evaluation methodology been implemented?"
	questions:
	- "Documentation about equipment and infrastructure specifications"
	- "Sufficient documentation of evaluation methods, including components covered"
	- "Sufficient documentation of evaluation methods to replicate findings"
	- "Sufficient documentation of evaluation results for comparison"

	"5. Privacy and Data Protection Evaluation":
	"5.1 Privacy and Data Protection Overview":
	explainer: "Has the AI system been comprehensively evaluated for privacy across multiple stages of the system development chain using diverse evaluation techniques?"
	questions:
	- "Evaluations at various stages (data collection, preprocessing, AI system architecture, training, deployment)"
	- "Have intrinsic properties of the AI system been evaluated for privacy vulnerabilities?"
	- "Have extrinsic privacy evaluations been run?"
	- "Have evaluations been run across all applicable modalities?"
	- "Have privacy evaluations been run that take the form of an automatic quantitative evaluation?"
	- "Have privacy evaluations been run with human participants?"

	"5.2 Privacy, Likeness, and Publicity Harms":
	explainer: "Has the AI system been evaluated for risks to personal integrity, privacy, and control of one's likeness?"
	questions:
	- "Has the AI system been evaluated for its likelihood of revealing personal information from its training data?"
	- "Has the AI system been evaluated for its likelihood of facilitating the generation of content impersonating an individual?"
	- "Has the AI system been evaluated for its likelihood of providing made-up or confabulated personal information about individuals?"

	"5.3 Intellectual Property and Information Security":
	explainer: "Has the AI system been evaluated for its likelihood of reproducing sensitive information or information with attached property rights?"
	questions:
	- "Has the AI system been evaluated for its likelihood of reproducing other categories of information from its training data?"
	- "Has the system been evaluated for other information security risks for in-scope uses?"

	"5.4 Privacy Evaluation Transparency and Documentation":
	explainer: "Are the privacy evaluations clearly documented to enable understanding of privacy risks, limitations, and reproducibility of findings?"
	questions:
	- "Documentation of the categories of training data that present information risk"
	- "Documentation of evaluation methods to replicate findings"
	- "Documentation of evaluation results to support comparison"
	- "Documentation of evaluation limitations"
	- "Documentation of deployment considerations"

	"6. Financial Costs Evaluation":
	"6.1 Financial Costs Overview":
	explainer: "Has the AI system been comprehensively evaluated for system costs across multiple stages of development and deployment?"
	questions:
	- "Evaluation of costs at various stages"
	- "Have costs been evaluated for different system components?"
	- "Have cost evaluations been run across all applicable modalities?"
	- "Have cost evaluations included both direct and indirect expenses?"
	- "Have cost projections been validated against actual expenses?"

	"6.2 Development and Training Costs":
	explainer: "Has the AI system been evaluated for costs associated with development and training phases?"
	questions:
	- "Assessment of research and development labor costs"
	- "Evaluation of data collection and preprocessing costs"
	- "Assessment of training infrastructure costs"
	- "Assessment of costs associated with different training approaches"
	- "Evaluation of model architecture and size impact on costs"

	"6.3 Deployment and Operation Costs":
	explainer: "Has the AI system been evaluated for ongoing deployment and operational costs?"
	questions:
	- "Assessment of inference and serving costs"
	- "Evaluation of storage and hosting expenses"
	- "Assessment of scaling costs based on usage patterns"
	- "Evaluation of costs specific to different deployment contexts"
	- "Assessment of costs for model updates or fine-tuning by end users"

	"6.4 Financial Cost Documentation and Transparency":
	explainer: "Are the financial cost evaluations clearly documented to enable understanding and planning?"
	questions:
	- "Sufficient documentation of cost evaluation methodology and assumptions"
	- "Sufficient documentation of cost breakdowns and metrics"
	- "Documentation of cost variations across different usage scenarios"
	- "Documentation of long-term cost projections and risk factors"

	"7. Data and Content Moderation Labor Evaluation":
	"7.1 Labor Evaluation Overview":
	explainer: "Has the AI system been comprehensively evaluated for labor practices across different stages of AI system development and deployment?"
	questions:
	- "Evaluation of labor practices at various stages"
	- "Have labor conditions been evaluated for different worker categories?"
	- "Have labor evaluations been run across all applicable task types?"
	- "Have labor practices been evaluated against established industry standards?"
	- "Have labor evaluations included both direct employees and contracted workers?"
	- "Have evaluations considered different regional and jurisdictional contexts?"

	"7.2 Working Conditions and Compensation":
	explainer: "Has the AI system been evaluated for its labor practices, compensation structures, and working conditions?"
	questions:
	- "Assessment of compensation relative to local living wages and industry standards"
	- "Assessment of job security and employment classification"
	- "Evaluation of workplace safety, worker protections and rights"
	- "Assessment of worker autonomy and task assignment practices"
	- "Evaluation of power dynamics and worker feedback mechanisms"

	"7.3 Worker Wellbeing and Support":
	explainer: "Has the AI system been evaluated for its support of worker wellbeing, particularly for those exposed to challenging content?"
	questions:
	- "Assessment of psychological support systems, trauma resources, and other long-term mental health monitoring"
	- "Evaluation of training and preparation for difficult content"
	- "Evaluation of cultural and linguistic support for diverse workforces"

	"7.4 Labor Practice Documentation and Transparency":
	explainer: "Are the labor evaluations clearly documented to enable understanding and accountability?"
	questions:
	- "Documentation of labor evaluation methodology and frameworks used"
	- "Documentation of worker demographics and task distribution"
	- "Documentation of support systems, worker protections"
	- "Documentation of incident reporting and resolution procedures"