RMF Assistant: Evidence-Grounded AI for Policy Q&A and Control Gap Analysis

RMF work is documentation-heavy by design. Analysts move constantly between control requirements, policy text, implementation evidence, and gap assessments — and the hard part isn't locating text, it's judging whether a policy actually addresses a control's intent and whether the evidence is strong enough to support a conclusion. A fluent chatbot makes this worse, not better, if it generates confident-sounding answers without traceable sources.

So I built an evidence-first assistant instead.

The system uses hybrid retrieval — dense embeddings through Qdrant plus BM25 sparse retrieval, fused with reciprocal rank and lightly reranked — over a combined corpus of NIST 800-53 Rev. 5 controls (ingested via OSCAL) and a synthetic policy pack. On top of retrieval sits an answer-state layer that classifies every response as strong evidence, limited evidence, conflicting evidence, no evidence, retrieval-only fallback, or backend failure. The UI renders that state alongside the answer, so users see the system's confidence posture, not just its output. Generation through Artificial Intelligence models via OpenRouter is optional; retrieval-only is a first-class mode, not a degraded one.

Evaluation was the part I cared most about. I built a 40-question golden set covering framework questions, policy questions, policy-versus-control questions (the novel case), and out-of-scope probes to test abstention. Scoring covers context precision, coverage accuracy, abstention quality, and an overall composite.

The clearest quantitative result was retrieval tuning. On the policy-versus-control task — the most demanding category, and the closest match to real assessment work — overall score moved from 0.7758 to 0.8485 between the baseline and tuned configurations, and context precision moved from 0.4274 to 0.6456. Perfect abstention behavior was preserved throughout. The system got better at finding the right evidence, not just better at generating fluent text.

I'm careful about what the work does and doesn't show. The policy corpus is synthetic, so it doesn't capture the ambiguity of real artifacts. The truth table labels a focused 12-control subset rather than the full catalog. And I haven't yet run a committed benchmark comparing retrieval-only against retrieval-plus-LLM, so I'm not claiming uplift from generation. Those are real limitations, not hedges.

But the architecture, the evaluation discipline, and the answer-state design transfer cleanly to ATO and cATO workflows, where trustworthy evidence citation matters more than answer fluency. This work sits at the intersection of Systems thinking, RAG architecture, and evidence-grounded Information retrieval.

View repository on GitHub: https://github.com/jasonmmiller1/UMBC-DATA606-Capstone

Linked Pages

Tags

#projects #ai #rag #systems #rmf #capstone