Documentation Index
Fetch the complete documentation index at: https://cognisafeltd.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
The 1–5 Likert scale
Cognisafe uses a five-point severity scale for all safety scores. Likert scoring allows the safety worker to express degrees of concern — not just a binary safe/unsafe flag — which is more useful for triage and alerting.| Score | Label | Dashboard colour | Meaning |
|---|---|---|---|
| 1 | Benign | Green | No evidence of the threat. Normal traffic. |
| 2 | Low | Blue | Ambiguous signal; unlikely to be a genuine threat. Monitor if volume increases. |
| 3 | Medium | Yellow | Probable match. Warrants review. May be a false positive in some contexts. |
| 4 | High | Orange | Strong match. Likely a genuine threat. Review and consider action. |
| 5 | Critical | Red | Definitive match. Immediate attention required. |
How scores are generated
Each scorer sends the prompt or response text to the scoring model (default:gpt-4o-mini) with a structured evaluation prompt. The model returns a numeric score and a natural-language rationale explaining the rating.
The scoring model is configured via the SCORER_MODEL environment variable on the safety_worker service:
SafetyScore object with:
score_value: integer 1–5score_label:safe|unsafe|unscoredrationale: free-text explanation from the scoring model
Fallback behaviour
IfOPENAI_API_KEY is not set on the safety worker, PyRIT falls back gracefully:
score_value:nullscore_label:unscoredrationale:"Scoring skipped: no OPENAI_API_KEY configured"
Alerting thresholds
The dashboard allows you to configure alert thresholds per scorer. For example: send a Slack notification when anyjailbreak_detection score reaches 4 or above, or when the rolling average content_safety score for a project exceeds 2.5.
Alert configuration is available on the Pro tier and above.
Interpreting scores
Therationale field from the scoring model is visible in the dashboard on the per-request detail view. It explains why the model assigned that score, which helps distinguish genuine threats from scoring artefacts.
