Severity scale

The 1–5 Likert scale

Cognisafe uses a five-point severity scale for all safety scores. Likert scoring allows the safety worker to express degrees of concern — not just a binary safe/unsafe flag — which is more useful for triage and alerting.

Score	Label	Dashboard colour	Meaning
1	Benign	Green	No evidence of the threat. Normal traffic.
2	Low	Blue	Ambiguous signal; unlikely to be a genuine threat. Monitor if volume increases.
3	Medium	Yellow	Probable match. Warrants review. May be a false positive in some contexts.
4	High	Orange	Strong match. Likely a genuine threat. Review and consider action.
5	Critical	Red	Definitive match. Immediate attention required.

How scores are generated

Each scorer sends the prompt or response text to the scoring model (default: gpt-4o-mini) with a structured evaluation prompt. The model returns a numeric score and a natural-language rationale explaining the rating. The scoring model is configured via the SCORER_MODEL environment variable on the safety_worker service:

SCORER_MODEL=gpt-4o-mini   # default — fast and cost-effective
SCORER_MODEL=gpt-4o        # higher accuracy, higher cost

PyRIT wraps the scoring model call and normalises the output into a structured SafetyScore object with:

score_value: integer 1–5
score_label: safe | unsafe | unscored
rationale: free-text explanation from the scoring model

Fallback behaviour

If OPENAI_API_KEY is not set on the safety worker, PyRIT falls back gracefully:

score_value: null
score_label: unscored
rationale: "Scoring skipped: no OPENAI_API_KEY configured"

This ensures the worker never crashes due to missing credentials — requests continue to be logged and observed even without scoring.

Alerting thresholds

The dashboard allows you to configure alert thresholds per scorer. For example: send a Slack notification when any jailbreak_detection score reaches 4 or above, or when the rolling average content_safety score for a project exceeds 2.5. Alert configuration is available on the Pro tier and above.

Interpreting scores

A single score-4 event is not necessarily cause for alarm — it may reflect an edge case in the scoring prompt or an ambiguous input. Look for patterns: repeated high scores from the same user, a spike in 4–5 scores over a short window, or consistently elevated scores on a particular endpoint.

The rationale field from the scoring model is visible in the dashboard on the per-request detail view. It explains why the model assigned that score, which helps distinguish genuine threats from scoring artefacts.

Getting Started

SDKs

LLM Providers

Self-hosting

Safety & Scoring

The 1–5 Likert scale

How scores are generated

Fallback behaviour

Alerting thresholds

Interpreting scores

Getting Started

SDKs

LLM Providers

Self-hosting

Safety & Scoring

Documentation Index

​The 1–5 Likert scale

​How scores are generated

​Fallback behaviour

​Alerting thresholds

​Interpreting scores

The 1–5 Likert scale

How scores are generated

Fallback behaviour

Alerting thresholds

Interpreting scores