8RR8
← Audit an agent

Rating

How a score is calculated

Every clause produces a verdict and a 0–4 ordinal score. Per-regulation averages roll up; the overall score is the mean across in-scope numeric verdicts.

1 — Per-clause rule scoring

Each clause in the regulation YAML pack defines a list of named deterministic rules with weights:

checker:
  deterministic:
    - rule: structured_logging_imported
      weight: 0.3
    - rule: logging_at_tool_call_boundaries
      weight: 0.5
    - rule: logging_persistent_sink
      weight: 0.2

Each named rule (implemented in src/pipeline/checkers/rules.ts) takes the RECON signals + file inventory + worktree and returns a score in [0, 1] plus evidence records.

The clause’s raw score is the weight-normalised sum:

rawScore = Σ(rule.weight × rule.score) / Σ(rule.weight)

2 — Raw → ordinal mapping

Raw scores map to a 0–4 ordinal scale. Same scale across every regulation, so per-clause results are comparable.

Raw rangeOrdinalVerdictMeaning
≥ 0.854passStrong evidence the clause is met
0.65–0.853passAdequate evidence; minor gaps
0.40–0.652partialSome controls present, incomplete
0.15–0.401failInadequate; significant gaps
< 0.150failAbsent; no evidence found

3 — Polarity (the Article 5 trick)

Most clauses are positive: more evidence of the control = better. But prohibition clauses like Article 5 (subliminal techniques, predictive policing solely from profiling, untargeted facial scraping, …) are negative: any evidence of the prohibited practice = violation. The pipeline detects these via the score_mapping.pass_default field on the clause and inverts the ordinal mapping:

This is why a stock LangChain app correctly passes Art 5(1)(a) “subliminal techniques” — the prompt-pattern detector finds no dark-pattern phrasings, so raw → 0 → pass.

4 — Per-regulation + overall aggregation

5 — Special verdicts

6 — Future: LLM-judge fallback

For clauses whose raw score lands in the ambiguous band [0.3, 0.7], V1 will invoke an LLM judge with the clause text plus the relevant code chunks. The judge produces an independent verdict; disagreement with the deterministic result downgrades the verdict to partial with a note. The hook is structurally present in the pipeline; the call is stubbed in V0.