Spaces:

evaleval
/

Eval_Card_Form

Running

App Files Files Community

felfri commited on 7 days ago

Commit

eb8c1eb

verified ·

1 Parent(s): 9cde108

Update questions.yaml

Browse files

Files changed (1) hide show

questions.yaml +50 -50

questions.yaml CHANGED Viewed

@@ -5,20 +5,20 @@
   "1.1 Bias Detection Overview":
     explainer: "Has the AI system been comprehensively evaluated across multiple stages of the system development chain using diverse evaluation techniques?"
     questions:
-      - "Evaluations at various stages (data collection, preprocessing, AI system architecture, training, deployment)"
-      - "Have intrinsic properties of the AI system been evaluated for bias (e.g., embedding analysis)"
-      - "Have extrinsic bias evaluations been run (e.g., downstream task performance)"
-      - "Have evaluations been run across all applicable modalities"
-      - "Have bias evaluations been run that take the form of automatic quantitative evaluation"
       - "Have bias evaluations been run with human participants?"
   "1.2 Protected Classes and Intersectional Measures":
     explainer: "Does the evaluation include a sufficiently broad range of protected classes that are disproportionately subject to harm by in-scope uses of the system, and the intersections of these classes?"
     questions:
-      - "Do evaluations cover all applicable legal protected categories for in-scope uses of the system?"
-      - "Do evaluations cover additional subgroups that are likely to be harmed based on other personal characteristics"
       - "Evaluation of how different aspects of identity interact and compound in AI system behavior"
-      - "Evaluation of AI system biases for legal protected categories and additional relevant subgroups"
   "1.3 Measurement of Stereotypes and Harmful Associations":
     explainer: "Has the AI system been evaluated for the presence of harmful associations and stereotypes in its outputs?"
@@ -40,10 +40,10 @@
     explainer: "Has the AI system been comprehensively evaluated for cultural variation across multiple stages of the system development chain using diverse evaluation techniques?"
     questions:
       - "Evaluations at various stages (data collection, preprocessing, AI system architecture, training, deployment)"
-      - "Have intrinsic properties of the AI system been evaluated for cultural variation (e.g., embedding analysis)"
-      - "Have extrinsic cultural variation evaluations been run (e.g., downstream task performance)"
-      - "Have evaluations been run across all applicable modalities"
-      - "Have cultural variation evaluations been run that take the form of automatic quantitative evaluation"
       - "Have cultural variation evaluations been run with human participants?"
   "2.2 Cultural Diversity and Representation":
@@ -58,22 +58,22 @@
   "2.3 Generated Sensitive Content across Cultural Contexts":
     explainer: "Has the AI system been evaluated for the potential negative impacts and implications of its generated content across different cultural contexts? Has the system been evaluated for its handling of hate speech, harmful content, and culturally sensitive material?"
     questions:
-      - "Has the AI system been evaluated for its likelihood of facilitating generation of threatening or violent content"
-      - "Has the AI system been evaluated for its likelihood of facilitating generation of targeted harassment or discrimination"
-      - "Has the AI system been evaluated for its likelihood of facilitating generation of hate speech"
-      - "Has the AI system been evaluated for its likelihood of exposing its direct users to content embedding values and assumptions not reflective of their cultural context"
-      - "Has the AI system been evaluated for its likelihood of exposing its direct users to inappropriate content for their use context"
-      - "Has the AI system been evaluated for its likelihood of exposing its direct users to content with negative psychological impacts"
-      - "Has the evaluation of the AI system's behaviors explicitly considered cultural variation in their definition"
   "2.4 Cultural Variation Transparency and Documentation":
     explainer: "Are the cultural limitations of the evaluation methods clearly documented? Has a comprehensive, culturally-informed evaluation methodology been implemented?"
     questions:
       - "Documentation of cultural contexts considered during development"
       - "Documentation of the range of cultural contexts covered by evaluations"
-      - "Sufficient documentation of evaluation method to understand the scope of the findings"
       - "Construct validity, documentation of strengths, weaknesses, and assumptions"
-      - "Domain shift between evaluation development and AI system development settings"
       - "Sufficient documentation of evaluation methods to replicate findings"
       - "Sufficient documentation of evaluation results to support comparison"
       - "Document of psychological impact on evaluators reviewing harmful content"
@@ -81,13 +81,13 @@
 "3. Disparate Performance Evaluation":
   "3.1 Disparate Performance Overview":
-    explainer: "Has the AI system been comprehensively evaluated for disparity in performance across groups in specific task and deployment contexts?"
     questions:
       - "Have development choices and intrinsic properties of the AI system been evaluated for their contribution to disparate performance?"
-      - "Have extrinsic disparate performance evaluations been run"
-      - "Have evaluations been run across all applicable modalities"
-      - "Have disparate performance evaluations been run that take the form of automatic quantitative evaluation"
-      - "Have disparate performance evaluations been run with human participants"
   "3.2 Identifying Target Groups for Disparate Performance Evaluation":
     explainer: "Has the evaluation identified subgroups more likely to be harmed by disparate performance in context by considering the scope of the AI system's application and its relationship to existing systemic issues?"
@@ -103,15 +103,15 @@
     questions:
       - "Non-aggregated evaluation results across subpopulations, including feature importance and consistency analysis"
       - "Metrics to measure performance in decision-making tasks"
-      - "Metrics to measure disparate performance in other tasks including generative tasks"
       - "Worst-case subgroup performance analysis, including performance on rare or underrepresented cases"
-      - "Intersectional analysis examining performance across combinations of subgroup"
-      - "Do evaluations of disparate performance account for implicit social group markers"
   "3.4 Disparate Performance Evaluation Transparency and Documentation":
     explainer: "Are the disparate performance evaluations clearly documented for easy reproduction and interpretation?"
     questions:
-      - "Sufficient documentation of evaluation method to understand the scope of the findings"
       - "Documentation of strengths, weaknesses, and assumptions about the context"
       - "Documentation of domain shift between evaluation and deployment settings"
       - "Sufficient documentation of evaluation methods to replicate findings"
@@ -127,7 +127,7 @@
       - "Have evaluations been run across all applicable modalities?"
       - "Have evaluations been run on standardized benchmarks or metrics?"
       - "Have evaluations taken into account community feedback from regions affected by data center power consumption?"
-      - "Do evaluations consider the full supply chain including environmental impact of hardware components and data centers used?"
   "4.2 Energy Cost and Environmental Impact of Development":
     explainer: "Has the AI system been comprehensively evaluated for its carbon footprint and broader environmental impact?"
@@ -141,7 +141,7 @@
     explainer: "Has the AI system been evaluated for its hardware resource usage and efficiency?"
     questions:
       - "Evaluation of inference FLOPS for the system"
-      - "Evaluation of inference energy consumption on most common deployment setting"
       - "Evaluation of inference energy consumption on multiple deployment settings"
       - "Evaluation of task-specific energy consumption variations"
       - "Evaluation of carbon impact for deployment infrastructure"
@@ -151,7 +151,7 @@
     explainer: "Are the limitations of the evaluation methods clearly documented? Has a comprehensive environmental evaluation methodology been implemented?"
     questions:
       - "Documentation about equipment and infrastructure specifications"
-      - "Sufficient documentation of evaluation methods including components covered"
       - "Sufficient documentation of evaluation methods to replicate findings"
       - "Sufficient documentation of evaluation results for comparison"
@@ -160,24 +160,24 @@
     explainer: "Has the AI system been comprehensively evaluated for privacy across multiple stages of the system development chain using diverse evaluation techniques?"
     questions:
       - "Evaluations at various stages (data collection, preprocessing, AI system architecture, training, deployment)"
-      - "Have intrinsic properties of the AI system been evaluated for privacy vulnerabilities"
-      - "Have extrinsic privacy evaluations been run"
-      - "Have evaluations been run across all applicable modalities"
-      - "Have privacy evaluations been run that take the form of automatic quantitative evaluation"
       - "Have privacy evaluations been run with human participants?"
   "5.2 Privacy, Likeness, and Publicity Harms":
     explainer: "Has the AI system been evaluated for risks to personal integrity, privacy, and control of one's likeness?"
     questions:
       - "Has the AI system been evaluated for its likelihood of revealing personal information from its training data?"
-      - "Has the AI system been evaluated for its likelihood of facilitating generation of content impersonating an individual?"
-      - "Has the AI system been evaluated for its likelihood of providing made up or confabulated personal information about individuals?"
   "5.3 Intellectual Property and Information Security":
     explainer: "Has the AI system been evaluated for its likelihood of reproducing sensitive information or information with attached property rights?"
     questions:
-      - "Has the AI system been evaluated for its likelihood of reproducing other categories of information from its training data"
-      - "Has the system been evaluated for other information security risks for in-scope uses"
   "5.4 Privacy Evaluation Transparency and Documentation":
     explainer: "Are the privacy evaluations clearly documented to enable understanding of privacy risks, limitations, and reproducibility of findings?"
@@ -193,10 +193,10 @@
     explainer: "Has the AI system been comprehensively evaluated for system costs across multiple stages of development and deployment?"
     questions:
       - "Evaluation of costs at various stages"
-      - "Have costs been evaluated for different system components"
-      - "Have cost evaluations been run across all applicable modalities"
-      - "Have cost evaluations included both direct and indirect expenses"
-      - "Have cost projections been validated against actual expenses"
   "6.2 Development and Training Costs":
     explainer: "Has the AI system been evaluated for costs associated with development and training phases?"
@@ -229,11 +229,11 @@
     explainer: "Has the AI system been comprehensively evaluated for labor practices across different stages of AI system development and deployment?"
     questions:
       - "Evaluation of labor practices at various stages"
-      - "Have labor conditions been evaluated for different worker categories"
-      - "Have labor evaluations been run across all applicable task types"
-      - "Have labor practices been evaluated against established industry standards"
-      - "Have labor evaluations included both direct employees and contracted workers"
-      - "Have evaluations considered different regional and jurisdictional contexts"
   "7.2 Working Conditions and Compensation":
     explainer: "Has the AI system been evaluated for its labor practices, compensation structures, and working conditions?"

   "1.1 Bias Detection Overview":
     explainer: "Has the AI system been comprehensively evaluated across multiple stages of the system development chain using diverse evaluation techniques?"
     questions:
+      - "Have evaluations been done at various stages (data collection, preprocessing, AI system architecture, training, deployment)?"
+      - "Have intrinsic properties of the AI system been evaluated for bias (e.g., embedding analysis)?"
+      - "Have extrinsic bias evaluations been run (e.g., downstream task performance)?"
+      - "Have evaluations been run across all applicable modalities?"
+      - "Have bias evaluations been run that take the form of automatic quantitative evaluation?"
       - "Have bias evaluations been run with human participants?"
   "1.2 Protected Classes and Intersectional Measures":
     explainer: "Does the evaluation include a sufficiently broad range of protected classes that are disproportionately subject to harm by in-scope uses of the system, and the intersections of these classes?"
     questions:
+      - "Do evaluations cover all applicable legally protected categories for in-scope uses of the system?"
+      - "Do evaluations cover additional subgroups that are likely to be harmed based on other personal characteristics?"
       - "Evaluation of how different aspects of identity interact and compound in AI system behavior"
+      - "Evaluation of AI system biases for legally protected categories and additional relevant subgroups"
   "1.3 Measurement of Stereotypes and Harmful Associations":
     explainer: "Has the AI system been evaluated for the presence of harmful associations and stereotypes in its outputs?"
     explainer: "Has the AI system been comprehensively evaluated for cultural variation across multiple stages of the system development chain using diverse evaluation techniques?"
     questions:
       - "Evaluations at various stages (data collection, preprocessing, AI system architecture, training, deployment)"
+      - "Have intrinsic properties of the AI system been evaluated for cultural variation (e.g., embedding analysis)?"
+      - "Have extrinsic cultural variation evaluations been run (e.g., downstream task performance)?"
+      - "Have evaluations been run across all applicable modalities?"
+      - "Have cultural variation evaluations been run that take the form of an automatic quantitative evaluation?"
       - "Have cultural variation evaluations been run with human participants?"
   "2.2 Cultural Diversity and Representation":
   "2.3 Generated Sensitive Content across Cultural Contexts":
     explainer: "Has the AI system been evaluated for the potential negative impacts and implications of its generated content across different cultural contexts? Has the system been evaluated for its handling of hate speech, harmful content, and culturally sensitive material?"
     questions:
+      - "Has the AI system been evaluated for its likelihood of facilitating the generation of threatening or violent content?"
+      - "Has the AI system been evaluated for its likelihood of facilitating the generation of targeted harassment or discrimination?"
+      - "Has the AI system been evaluated for its likelihood of facilitating the generation of hate speech?"
+      - "Has the AI system been evaluated for its likelihood of exposing its direct users to content embedding values and assumptions not reflective of their cultural context?"
+      - "Has the AI system been evaluated for its likelihood of exposing its direct users to inappropriate content for their use context?"
+      - "Has the AI system been evaluated for its likelihood of exposing its direct users to content with negative psychological impacts?"
+      - "Has the evaluation of the AI system's behaviors explicitly considered cultural variation in their definition?"
   "2.4 Cultural Variation Transparency and Documentation":
     explainer: "Are the cultural limitations of the evaluation methods clearly documented? Has a comprehensive, culturally-informed evaluation methodology been implemented?"
     questions:
       - "Documentation of cultural contexts considered during development"
       - "Documentation of the range of cultural contexts covered by evaluations"
+      - "Sufficient documentation of the evaluation method to understand the scope of the findings"
       - "Construct validity, documentation of strengths, weaknesses, and assumptions"
+      - "Domain shift between evaluation, development, and AI system deployment settings"
       - "Sufficient documentation of evaluation methods to replicate findings"
       - "Sufficient documentation of evaluation results to support comparison"
       - "Document of psychological impact on evaluators reviewing harmful content"
 "3. Disparate Performance Evaluation":
   "3.1 Disparate Performance Overview":
+    explainer: "Has the AI system been comprehensively evaluated for disparity in performance across groups in specific tasks and deployment contexts?"
     questions:
       - "Have development choices and intrinsic properties of the AI system been evaluated for their contribution to disparate performance?"
+      - "Have extrinsic disparate performance evaluations been run?"
+      - "Have evaluations been run across all applicable modalities?"
+      - "Have disparate performance evaluations been run that take the form of automatic quantitative evaluation?"
+      - "Have disparate performance evaluations been run with human participants?"
   "3.2 Identifying Target Groups for Disparate Performance Evaluation":
     explainer: "Has the evaluation identified subgroups more likely to be harmed by disparate performance in context by considering the scope of the AI system's application and its relationship to existing systemic issues?"
     questions:
       - "Non-aggregated evaluation results across subpopulations, including feature importance and consistency analysis"
       - "Metrics to measure performance in decision-making tasks"
+      - "Metrics to measure disparate performance in other tasks, including generative tasks"
       - "Worst-case subgroup performance analysis, including performance on rare or underrepresented cases"
+      - "Intersectional analysis examining performance across combinations of subgroups"
+      - "Do evaluations of disparate performance account for implicit social group markers?"
   "3.4 Disparate Performance Evaluation Transparency and Documentation":
     explainer: "Are the disparate performance evaluations clearly documented for easy reproduction and interpretation?"
     questions:
+      - "Sufficient documentation of the evaluation method to understand the scope of the findings"
       - "Documentation of strengths, weaknesses, and assumptions about the context"
       - "Documentation of domain shift between evaluation and deployment settings"
       - "Sufficient documentation of evaluation methods to replicate findings"
       - "Have evaluations been run across all applicable modalities?"
       - "Have evaluations been run on standardized benchmarks or metrics?"
       - "Have evaluations taken into account community feedback from regions affected by data center power consumption?"
+      - "Do evaluations consider the full supply chain, including the environmental impact of hardware components and data centers used?"
   "4.2 Energy Cost and Environmental Impact of Development":
     explainer: "Has the AI system been comprehensively evaluated for its carbon footprint and broader environmental impact?"
     explainer: "Has the AI system been evaluated for its hardware resource usage and efficiency?"
     questions:
       - "Evaluation of inference FLOPS for the system"
+      - "Evaluation of inference energy consumption on the most common deployment setting"
       - "Evaluation of inference energy consumption on multiple deployment settings"
       - "Evaluation of task-specific energy consumption variations"
       - "Evaluation of carbon impact for deployment infrastructure"
     explainer: "Are the limitations of the evaluation methods clearly documented? Has a comprehensive environmental evaluation methodology been implemented?"
     questions:
       - "Documentation about equipment and infrastructure specifications"
+      - "Sufficient documentation of evaluation methods, including components covered"
       - "Sufficient documentation of evaluation methods to replicate findings"
       - "Sufficient documentation of evaluation results for comparison"
     explainer: "Has the AI system been comprehensively evaluated for privacy across multiple stages of the system development chain using diverse evaluation techniques?"
     questions:
       - "Evaluations at various stages (data collection, preprocessing, AI system architecture, training, deployment)"
+      - "Have intrinsic properties of the AI system been evaluated for privacy vulnerabilities?"
+      - "Have extrinsic privacy evaluations been run?"
+      - "Have evaluations been run across all applicable modalities?"
+      - "Have privacy evaluations been run that take the form of an automatic quantitative evaluation?"
       - "Have privacy evaluations been run with human participants?"
   "5.2 Privacy, Likeness, and Publicity Harms":
     explainer: "Has the AI system been evaluated for risks to personal integrity, privacy, and control of one's likeness?"
     questions:
       - "Has the AI system been evaluated for its likelihood of revealing personal information from its training data?"
+      - "Has the AI system been evaluated for its likelihood of facilitating the generation of content impersonating an individual?"
+      - "Has the AI system been evaluated for its likelihood of providing made-up or confabulated personal information about individuals?"
   "5.3 Intellectual Property and Information Security":
     explainer: "Has the AI system been evaluated for its likelihood of reproducing sensitive information or information with attached property rights?"
     questions:
+      - "Has the AI system been evaluated for its likelihood of reproducing other categories of information from its training data?"
+      - "Has the system been evaluated for other information security risks for in-scope uses?"
   "5.4 Privacy Evaluation Transparency and Documentation":
     explainer: "Are the privacy evaluations clearly documented to enable understanding of privacy risks, limitations, and reproducibility of findings?"
     explainer: "Has the AI system been comprehensively evaluated for system costs across multiple stages of development and deployment?"
     questions:
       - "Evaluation of costs at various stages"
+      - "Have costs been evaluated for different system components?"
+      - "Have cost evaluations been run across all applicable modalities?"
+      - "Have cost evaluations included both direct and indirect expenses?"
+      - "Have cost projections been validated against actual expenses?"
   "6.2 Development and Training Costs":
     explainer: "Has the AI system been evaluated for costs associated with development and training phases?"
     explainer: "Has the AI system been comprehensively evaluated for labor practices across different stages of AI system development and deployment?"
     questions:
       - "Evaluation of labor practices at various stages"
+      - "Have labor conditions been evaluated for different worker categories?"
+      - "Have labor evaluations been run across all applicable task types?"
+      - "Have labor practices been evaluated against established industry standards?"
+      - "Have labor evaluations included both direct employees and contracted workers?"
+      - "Have evaluations considered different regional and jurisdictional contexts?"
   "7.2 Working Conditions and Compensation":
     explainer: "Has the AI system been evaluated for its labor practices, compensation structures, and working conditions?"