The RFP Scoring Problem

I have reviewed RFP scoring processes for a dozen NZ enterprises. The pattern is remarkably consistent: a panel of three to seven evaluators independently score submissions against weighted criteria, then spend hours reconciling scores that vary wildly because every evaluator interprets the criteria differently. The scoring framework promises objectivity. The process delivers subjectivity at scale.

Why RFP Scoring Fails

The theory is sound. Define evaluation criteria. Assign weights. Score each submission against each criterion. Sum the weighted scores. Select the highest scorer. Objective, defensible, fair.

The practice breaks down at every step.

Criteria interpretation varies. "Demonstrates relevant experience" means different things to different evaluators. One scores based on years in the industry. Another looks for specific project examples. A third values sector diversity. Same criterion, three different interpretations, three different scores.

Scoring calibration varies. Some evaluators use the full range (1-10). Others cluster around 6-8. Some are generous. Others are strict. A "7 from Sarah" and a "7 from Mike" do not represent the same assessment.

Attention varies. The first submission reviewed gets thorough attention. The twelfth gets skimmed. The evaluation quality degrades with volume, particularly when submissions are long and technical.

Anchoring bias is universal. The first submission sets the mental benchmark. Subsequent submissions are scored relative to the first rather than against the criteria independently. Strong first submissions inflate scores. Weak first submissions deflate them.

The result: a process that looks objective on paper but produces inconsistent, biased results in practice. And because the process looks objective, the inconsistency is rarely examined.

42%

average score variance between evaluators on the same RFP submission

Source: RIVER, enterprise engagement analysis, 2025

What AI Fixes

AI does not replace evaluator judgement. It provides a consistent baseline that human evaluators can build on.

Criteria Extraction

The AI reads each submission and extracts evidence relevant to each evaluation criterion. For "demonstrates relevant experience," the system identifies specific project references, client names, outcomes described, and years of experience cited. Every submission gets the same thorough extraction regardless of evaluation order or evaluator fatigue.

Consistency Scoring

Each extracted piece of evidence is scored against a defined rubric. The rubric is more specific than the evaluation criteria: not "demonstrates relevant experience" but "provides specific project examples with named clients and quantified outcomes in the relevant sector." This specificity eliminates interpretation variance.

Gap Identification

The AI identifies what each submission does not address. Missing responses to specific requirements, claims without supporting evidence, and criteria that are addressed superficially versus substantively. Gaps are often harder for human evaluators to spot consistently across many submissions.

Comparative Analysis

The AI produces a structured comparison across all submissions for each criterion, making it easy for evaluators to see relative strengths and weaknesses without reading every submission in full.

Loading demo...

What AI Does Not Fix

Strategic fit. Whether a vendor aligns with the organisation's strategic direction, culture, and long-term goals is a judgement call that requires human assessment.

Relationship factors. Past experience with a vendor, trust, and communication quality are legitimate evaluation factors that AI cannot assess.

Innovation and vision. A submission that proposes a novel approach may score lower on standard criteria but represent the most valuable option. Human evaluators catch this. AI scoring does not.

Political and organisational context. Some procurement decisions carry factors that do not appear in the evaluation criteria. AI does not understand these. Human evaluators do.

The Recommended Model

We recommend AI-assisted RFP scoring as a two-phase process:

Phase 1: AI baseline. The AI scores all submissions against the criteria, producing a structured report with evidence extraction, consistency scores, gap analysis, and comparative tables. This takes hours instead of days.

Phase 2: Human evaluation. Evaluators review the AI baseline, adjust scores based on their expertise and judgement, and add qualitative assessment that the AI cannot provide. The AI baseline ensures every evaluator starts from the same evidence base.

The result is faster evaluation (typically 50-60% time reduction), more consistent scoring, and better-informed human judgement. The process remains human-led. The AI handles the volume and consistency. The humans handle the judgement.

Implementation

For organisations processing more than 10 RFPs per year:

Criteria standardisation (1-2 weeks). Define evaluation criteria with specific, measurable rubrics. This is the most important step.
System configuration (2-3 weeks). Configure the AI scoring system for your criteria, rubrics, and submission formats.
Pilot evaluation (1-2 weeks). Run the AI alongside human evaluation on a current RFP. Compare results. Calibrate.
Production use (ongoing). Integrate AI baseline scoring into your standard procurement process.

Total implementation: 4-7 weeks. ROI is immediate on the first evaluation cycle.

The RFP scoring problem is a consistency problem. AI is very good at consistency. Let it handle what it is good at, and let humans handle what they are good at.