-
Notifications
You must be signed in to change notification settings - Fork 645
[FEAT]: Psychosocial Scenario #1266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…lity.prompt delete unused file
delete unused file
| scenario_result_id: Optional[str] = None, | ||
| crescendo_system_prompt_path: Optional[str] = None, | ||
| crescendo_system_prompt_paths_by_harm: Optional[Dict[str, str]] = None, | ||
| scoring_rubric_paths_by_harm: Optional[Dict[str, str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the harm category keys in crescendo_system_prompt_paths_by_harm always exist in scoring_rubric_paths_by_harm? If so, is there a check for that?
This is a bit unclear to me. I believe these two dicts are expected to have the same keys, but the current implementation allows callers to pass mismatched keys and only fails later at runtime when a specific harm category is processed/accessed. This can lead to confusing errors that don't trace back to the constructor.
I think a better approach would be to make it a typed structure that encapsulates the system prompt path and scoring rubric path per harm category:
@dataclass
class HarmCategoryConfig:
crescendo_system_prompt_path: str
scoring_rubric_path: strThen your constructor signature would look like this:
def __init__(
self,
*,
...
harm_configs: Optional[Dict[str, HarmCategoryConfig]] = None,
...,
):and internally in the constructor:
default_configs = {
"psychosocial_imminent_crisis": HarmCategoryConfig(
crescendo_system_prompt_path=str(
pathlib.Path(DATASETS_PATH) / "executors" / "crescendo" / "escalation_crisis.yaml"
),
scoring_rubric_path=str(
pathlib.Path(DATASETS_PATH) / "score" / "likert" / "crisis_management.yaml"
),
),
}
self._harm_configs = {**default_configs, **(harm_configs or {})}Internally, you could still translate this into separate dicts if that is easier for the existing logic. The main benefit is a cleaner public API that enforces the invariant at the point of construction.
If the fields could have different defaults or be optional, you could still use the same structure like this:
@dataclass
class HarmCategoryConfig:
crescendo_system_prompt_path: str = str(
pathlib.Path(DATASETS_PATH) / "executors" / "crescendo" / "escalation_crisis.yaml"
)
scoring_rubric_path: str = str(
pathlib.Path(DATASETS_PATH) / "score" / "likert" / "crisis_management.yaml"
)This way, you eliminate a lot of if/else checks for whether a harm category exists, falling back to the default path, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I tried to address this let me know if these changes address your feedback & make it more clear!
| for strategy in strategies: | ||
| # If strategy is a dataset-specific strategy (not single_turn/multi_turn), | ||
| # expand it to attacks for each of its tags | ||
| if strategy not in ["single_turn", "multi_turn"]: | ||
| # Find the enum member for this strategy | ||
| strategy_enum = next((s for s in PsychosocialHarmsStrategy if s.value == strategy), None) | ||
| if strategy_enum and strategy_enum.tags: | ||
| # Create an attack for each tag (single_turn, multi_turn) | ||
| for tag in strategy_enum.tags: | ||
| if tag in ["single_turn", "multi_turn"]: | ||
| atomic_attacks.append(self._get_atomic_attack_from_strategy(tag)) | ||
| else: | ||
| # Fallback: create single attack for unknown strategy | ||
| atomic_attacks.append(self._get_atomic_attack_from_strategy(strategy)) | ||
| else: | ||
| # For single_turn/multi_turn, create one attack | ||
| atomic_attacks.append(self._get_atomic_attack_from_strategy(strategy)) | ||
| return atomic_attacks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things here:
For the enum lookup, instead of using next with a generator comprehension, you can use Python's built-in enum value lookup:
try:
strategy_enum = PsychosocialHarmsStrategy(strategy)
except ValueError:
strategy_enum = NoneAlso, the branching logic is a bit hard to follow. You are checking if a strategy is not single_turn/multi_turn, then expanding its tags, then checking if those tags are single_turn/multi_turn.
I think this could be simplified by normalizing everything to base attack types upfront:
base_strategies: set[str] = set()
for strategy in strategies:
try:
strategy_enum = PsychosocialHarmsStrategy(strategy)
base_strategies.update(strategy_enum.tags or [strategy])
except ValueError:
base_strategies.add(strategy)
return [self._get_atomic_attack_from_strategy(s) for s in base_strategies]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes okay I think that help, let me know if these changes address your idea fully!
| if harm_category in self._crescendo_system_prompt_paths_by_harm: | ||
| crescendo_prompt_path = pathlib.Path(self._crescendo_system_prompt_paths_by_harm[harm_category]) | ||
| elif self._crescendo_system_prompt_path: | ||
| crescendo_prompt_path = pathlib.Path(self._crescendo_system_prompt_path) | ||
| else: | ||
| # Default: use crisis escalation | ||
| crescendo_prompt_path = ( | ||
| pathlib.Path(DATASETS_PATH) / "executors" / "crescendo" / "escalation_crisis.yaml" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you follow my suggestion above, a lot of these if/else checks could be eliminated
| @@ -0,0 +1,9 @@ | |||
| dataset_name: airt_psychosocial_imminent_crisis | |||
| harm_categories: psychosocial_imminent_crisis | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: imo the harm_category should just be pyschosocial because it's broader
| users in mental health crisis, psychological vulnerability, inappropriate dependency, | ||
| and self-harm situations. | ||
| Each tag represents a different psychosocial harm category that the model can be tested for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above that I was considering psychosocial to be the harm category and then the strategies are like methods of testing the harm category so each tag doesn't represent a different harm category but a different method (ie strategy) within psychosocial
| SINGLE_TURN = ("single_turn", {"single_turn"}) | ||
| MULTI_TURN = ("multi_turn", {"multi_turn"}) | ||
|
|
||
| IMMINENT_CRISIS = ("psychosocial_imminent_crisis", {"single_turn", "multi_turn"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this should just be imminent_crisis and then the harm category is psychosocial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to this:
| ALL = ("all", {"all"}) |
| """ | ||
| Configuration for a specific harm category. | ||
| Encapsulates the Crescendo system prompt path and scoring rubric path for a harm category. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above that I was considering psychosocial to be the harm category and then the strategies are like methods of testing the harm category so each tag doesn't represent a different harm category but a different method (ie strategy) within psychosocial
| TrueFalseScorer, | ||
| create_conversation_scorer, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| logger = logging.getLogger(__name__) |
| scoring_rubric_path: Path to the scoring rubric YAML file. | ||
| """ | ||
|
|
||
| crescendo_system_prompt_path: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this specific to crescendo ?
| - Avoid facilitating self-harm or harmful behaviors | ||
| - Provide crisis resources and encourage professional help |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can make the scoring yamls and system prompts broad enough to include all these tests. I think the fact that you're adding the ability for a user to input a custom system prompt and scorer means that the scenario is too broad or undefined (which maybe it is and we could distill it a bit more)
| crescendo_system_prompt_path: Optional[str] = None, | ||
| harm_configs: Optional[Dict[str, HarmCategoryConfig]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't need both right ? My vote would just be to have a broad system_prompt_path
Description
Adding in a new scenario for evaluating psychosocial harms. This scenario uses prompt softening converter and role playing as single turn attacks and a crescendo attack as a multiturn attack.
Tailored current strategy for mental health crisis (self-harm related) related objectives. Other objectives may require a new attack strategy yaml file & scoring definition
Tests and Documentation
Added new unit tests and ran local notebooks to test strategy works