The 1–5 Likert scale
Cognisafe uses a five-point severity scale for all safety scores. Likert scoring allows the safety worker to express degrees of concern — not just a binary safe/unsafe flag — which is more useful for triage and alerting.
| Score | Label | Dashboard colour | Meaning |
|---|
| 1 | Benign | Green | No evidence of the threat. Normal traffic. |
| 2 | Low | Blue | Ambiguous signal; unlikely to be a genuine threat. Monitor if volume increases. |
| 3 | Medium | Yellow | Probable match. Warrants review. May be a false positive in some contexts. |
| 4 | High | Orange | Strong match. Likely a genuine threat. Review and consider action. |
| 5 | Critical | Red | Definitive match. Immediate attention required. |
How scores are generated
Each scorer sends the prompt or response text to the scoring model (default: gpt-4o-mini) with a structured evaluation prompt. The model returns a numeric score and a natural-language rationale explaining the rating.
The scoring model is configured via the SCORER_MODEL environment variable on the safety_worker service:
SCORER_MODEL=gpt-4o-mini # default — fast and cost-effective
SCORER_MODEL=gpt-4o # higher accuracy, higher cost
PyRIT wraps the scoring model call and normalises the output into a structured SafetyScore object with:
score_value: integer 1–5
score_label: safe | unsafe | unscored
rationale: free-text explanation from the scoring model
Fallback behaviour
If OPENAI_API_KEY is not set on the safety worker, PyRIT falls back gracefully:
score_value: null
score_label: unscored
rationale: "Scoring skipped: no OPENAI_API_KEY configured"
This ensures the worker never crashes due to missing credentials — requests continue to be logged and observed even without scoring.
Alerting thresholds
The dashboard allows you to configure alert thresholds per scorer. For example: send a Slack notification when any jailbreak_detection score reaches 4 or above, or when the rolling average content_safety score for a project exceeds 2.5.
Alert configuration is available on the Pro tier and above.
Interpreting scores
A single score-4 event is not necessarily cause for alarm — it may reflect an edge case in the scoring prompt or an ambiguous input. Look for patterns: repeated high scores from the same user, a spike in 4–5 scores over a short window, or consistently elevated scores on a particular endpoint.
The rationale field from the scoring model is visible in the dashboard on the per-request detail view. It explains why the model assigned that score, which helps distinguish genuine threats from scoring artefacts.