Skip to main content

Documentation Index

Fetch the complete documentation index at: https://cognisafeltd.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

The 1–5 Likert scale

Cognisafe uses a five-point severity scale for all safety scores. Likert scoring allows the safety worker to express degrees of concern — not just a binary safe/unsafe flag — which is more useful for triage and alerting.
ScoreLabelDashboard colourMeaning
1BenignGreenNo evidence of the threat. Normal traffic.
2LowBlueAmbiguous signal; unlikely to be a genuine threat. Monitor if volume increases.
3MediumYellowProbable match. Warrants review. May be a false positive in some contexts.
4HighOrangeStrong match. Likely a genuine threat. Review and consider action.
5CriticalRedDefinitive match. Immediate attention required.

How scores are generated

Each scorer sends the prompt or response text to the scoring model (default: gpt-4o-mini) with a structured evaluation prompt. The model returns a numeric score and a natural-language rationale explaining the rating. The scoring model is configured via the SCORER_MODEL environment variable on the safety_worker service:
SCORER_MODEL=gpt-4o-mini   # default — fast and cost-effective
SCORER_MODEL=gpt-4o        # higher accuracy, higher cost
PyRIT wraps the scoring model call and normalises the output into a structured SafetyScore object with:
  • score_value: integer 1–5
  • score_label: safe | unsafe | unscored
  • rationale: free-text explanation from the scoring model

Fallback behaviour

If OPENAI_API_KEY is not set on the safety worker, PyRIT falls back gracefully:
  • score_value: null
  • score_label: unscored
  • rationale: "Scoring skipped: no OPENAI_API_KEY configured"
This ensures the worker never crashes due to missing credentials — requests continue to be logged and observed even without scoring.

Alerting thresholds

The dashboard allows you to configure alert thresholds per scorer. For example: send a Slack notification when any jailbreak_detection score reaches 4 or above, or when the rolling average content_safety score for a project exceeds 2.5. Alert configuration is available on the Pro tier and above.

Interpreting scores

A single score-4 event is not necessarily cause for alarm — it may reflect an edge case in the scoring prompt or an ambiguous input. Look for patterns: repeated high scores from the same user, a spike in 4–5 scores over a short window, or consistently elevated scores on a particular endpoint.
The rationale field from the scoring model is visible in the dashboard on the per-request detail view. It explains why the model assigned that score, which helps distinguish genuine threats from scoring artefacts.