Your enterprise spent $100,000 on AI training. 200 employees attended workshops. 85% rated the training "good" or "excellent." Three months later, AI usage across the organisation is at 12%. The training was popular. It wasn't effective. These are different things, and most organisations can't tell the difference because they never evaluated the outcomes.
What You Need to Know
- Satisfaction scores (Level 1) measure whether people enjoyed the training, not whether they learned or changed behaviour
- Effective evaluation uses Kirkpatrick's four levels: Reaction, Learning, Behaviour, and Results
- Most enterprise AI training programmes only measure Level 1 (satisfaction) and ignore Levels 3 and 4 (behaviour change and business impact)
- Training that doesn't change behaviour is wasted investment, no matter how well-rated it is
12%
of training content is applied on the job 3 months after delivery
Source: Brinkerhoff, 2006
85%
of AI training budgets are evaluated only on participant satisfaction
Source: ATD State of the Industry, 2024
The Four Levels
Level 1: Reaction
Question: Did participants enjoy the training?
How to measure: Post-training survey. "Rate the training quality." "Was the content relevant?" "Would you recommend it?"
Limitation: High satisfaction doesn't predict behaviour change. People enjoy engaging trainers and well-catered workshops. That doesn't mean they'll use AI on Monday.
Level 2: Learning
Question: Did participants learn the intended content?
How to measure: Pre-test and post-test. Before the training, assess knowledge: "What is RAG?" "When would you use AI for document classification?" After the training, re-assess. The delta is learning.
Limitation: Knowledge doesn't equal capability. People who can explain what AI does can't necessarily use it. And people who score well in a test immediately after training may not retain the knowledge.
Level 3: Behaviour
Question: Did participants change their work behaviour?
How to measure: Observable behaviour change 30, 60, and 90 days after training. Are they using AI tools? How frequently? For which tasks? Are they using them correctly?
This is where most training evaluation stops, if it gets here at all. And it's the most important level for enterprise AI, because AI training that doesn't change daily work behaviour has zero business impact.
That's not luck, that's deliberate design. When I took online course completion from 40% to 100%, it wasn't because the content was better. It was because the design ensured students actually did the work. Training evaluation should measure the same thing: did people actually change what they do?
Dr Josiah Koh
Education & AI Innovation
Level 4: Results
Question: Did the behaviour change produce business outcomes?
How to measure: Connect behaviour changes (Level 3) to business metrics. If the training was designed to increase AI usage for document classification, measure: document processing time, error rates, and throughput. Compared to pre-training baseline.
The challenge: Attribution. Multiple factors affect business outcomes. Isolating the training's contribution requires either control groups (trained vs untrained teams) or statistical methods that account for confounding variables.
Evaluation that stops at satisfaction is not evaluation. It's feedback collection. True evaluation asks: "Did this intervention produce the intended change in the intended context for the intended people?" That requires measuring outcomes, not just reactions.
Dr Tania Wolfgramm
Chief Research Officer
Designing for Evaluation
Evaluation should be designed into the training programme, not appended afterward:
Before Training
- Define the intended behaviour change. "After this training, participants will use AI for document classification in their daily workflow." Specific, observable, measurable.
- Establish baselines. Measure current AI usage, task completion times, and error rates before the training.
- Design the assessment instruments. Pre-test, post-test, 30/60/90-day behaviour surveys, and business metric tracking.
During Training
- Include practice with real tasks. Training that uses the participant's actual work data and workflows is more likely to transfer to the job.
- Build peer support structures. Participants who leave training with a peer support network are more likely to sustain behaviour change.
After Training
- Measure at 30, 60, and 90 days. Behaviour change is not immediate. The 30-day measure catches early adoption. The 90-day measure catches sustained adoption.
- Investigate non-adoption. When trained people don't change behaviour, investigate why. The answer is usually one of: insufficient practice, no manager support, workflow friction, or lack of confidence. Each has a different remedy.
- Report honestly. If the training didn't produce behaviour change, say so. And investigate why. The learning is in the gaps, not in the satisfied participants.
Common Training Design Failures
Teaching AI concepts instead of AI skills. "What is a large language model?" is interesting. "How to use AI to classify your incoming documents" is useful. The second produces behaviour change.
Generic training for specific roles. The finance team and the operations team use AI differently. Generic "Introduction to AI" training doesn't connect to either team's specific workflow.
No follow-up support. Training creates initial motivation. Without follow-up (champions, peer support, protected practice time), motivation decays within weeks.
No manager involvement. If the manager doesn't support, reinforce, and model AI use after training, the team defaults to pre-training behaviour.
AI training that isn't evaluated for behaviour change is an act of faith. Evaluate at all four levels, design the evaluation before the training, and be willing to change your approach based on what the data tells you. The goal isn't to deliver training. It's to change behaviour that produces business outcomes.

