Building and Verifying AI Systems with Agentic AI

Why this tutorial · Why now

Generative AI changed who can build AI systems — and who must now verify them.

Generative and agentic AI in the development loop have collapsed the cost of building real-world, multi-user AI systems — but not the cost of understanding or trusting them. With agentic coding tools, a small team — even a single lab — can now stand up a multi-tenant big-data platform that once required a large engineering organization.

Building faster does not make a system easier to see into or to trust. This tutorial teaches a method for closing that gap — build it with agentic AI, visualize it, and use that visualization to test it. A production AI education platform is only the running example; the method is domain-agnostic, and a visualization researcher contributes techniques that make any model legible, not just ours. You take home the method, not a tour of one platform.

Each technical section pairs a methods-first treatment with a worked example from Uedu, a deployed multi-tenant AI tutoring platform built largely through agentic AI development and operated under a single umbrella IRB approval (NTU-REC 202507EM058). Read through a software-engineering lens, every step is a familiar SE practice — requirements, construction, verification, maintenance — applied where AI both builds the system and is the system. The platform is a worked example — not the answer; attendees who prefer to build from scratch take the methodology home.

01

Build it with agentic AI

Stand up a multi-tenant educational big-data platform with spec-driven, agentic AI development — write the spec, let agents implement it. Architecture decision flowcharts, and explicit checkpoints for where AI-generated code still needs human verification.

02

See inside it

Make AI systems legible with data visualization — CAM-style contribution maps and visual anomaly detection. Glass-box inspection, not just black-box pass/fail.

03

Trust it with testing

A six-layer AI-testing rubric covering code, pipeline, behavior, guardrail, governance, and drift, with a live LLM-as-Judge harness and a cross-model reproducibility schema.

§ At a glance

The method, and the worked example.

The method is three repeating moves. The architecture below is one worked example — the platform is the example, not the answer; your own system will differ.

The method · build → visualize → test

From intent to tests · the spec is a continuum

A specification is one artifact at many levels of formality. Translate the intent downward and it becomes executable; at the far end, the spec is the tests — which is why building and verifying can share one source of truth.

The worked example · a heterogeneous AI platform

Dialogue, behavioral, vision, physiological, environmental, financial, and psychometric streams flow into one multi-tenant platform — which you then visualize and test. The same two lenses apply to whatever you build.

§ Learning objectives

What you take home: the methodology.

You take home the methodology — not our platform. By the end of the three hours you should be able to apply these five methods to your own project, in your own domain.

01 Objective

A repeatable, spec-driven method for agentic-AI development — write the spec, let agents implement it, and reuse the spec as your test oracle — with a checklist of where AI-generated code still needs human verification, transferable to your own platform in any domain.

02 Objective

A layered AI-testing rubric (code, pipeline, behavior, guardrail, governance, drift) and a runnable LLM-as-Judge harness with versioned prompts and cross-model reproducibility — drop it into your own evaluation pipeline.

03 Objective

Visualization methods (CAM-style contribution maps, visual anomaly detection) for inspecting why a model behaves as it does — to apply to your own models, not just the ones shown here.

04 Objective

A workflow that uses visualization as a tool inside AI testing — a reusable pattern for making any AI system both visible and verifiable.

05 Objective

A redacted, IRB-governed data-handling and SOP template you can adapt to your own institution and jurisdiction.

§ Apply it to your own project

Two take-home decision tools.

You leave with the methodology — these two tables help you decide which method to reach for, and audit what you have already covered, on your own system.

Decision matrix · which method for your situation

A

If you need to…	Reach for	Taught in	Watch out
Ship a platform fast with a small team	Agentic AI development plus a human-verification checklist	§2 Build	The checklist is non-negotiable — unreviewed AI-generated code is the failure mode.
See why a model focused where it did	CAM-style contribution visualization	§3 Visualize	Designed for CNNs; transformers and LLMs need attention or attribution analogues.
Find outliers and quality problems in image or sensor data at scale	Visual anomaly detection with incremental dimension reduction	§3 Visualize	Needs a baseline of normal examples; rare-but-valid cases can look anomalous.
Judge whether an LLM answer is actually good	LLM-as-Judge with a versioned prompt	§4 Test	Pin the judge model and prompt version, or scores drift between runs.
Catch cost, usage, or behavioral drift over time	The drift / anomaly layer of the six-layer rubric	§4 Test	Set thresholds from real baselines, not guesses.
Handle sensitive data across jurisdictions	IRB-governed data-handling SOP template	§4 Test	Adapt to local law — Taiwan PIPA, Japan APPI, EU GDPR.

Six-layer self-assessment · audit your own system

B

Layer	Ask of your own system	Method
Code	Is the AI-generated code reviewed where it matters?	Human-verification checkpoints (§2)
Pipeline	Do data and model pipelines fail loudly and reproducibly?	Pipeline tests plus a reproducibility schema (§4)
Behavior	Does the model actually answer correctly?	LLM-as-Judge with a versioned prompt (§4)
Guardrail	Are unsafe or out-of-scope outputs blocked?	Guardrail checks and adversarial probes (§4)
Governance	Is sensitive data handled lawfully across jurisdictions?	IRB-governed SOP template (§4)
Drift	Would you notice anomalies or drift after deployment?	Visual anomaly detection on the drift layer (§3 + §4)

§ Schedule · 180 minutes

Three hours, six segments.

Build with generative AI, see inside the model with visualization, and verify behavior with AI testing — each segment pairs methods with a worked example and a take-home artifact.

Tutorial Schedule

3 hours

1

Opening: build it, visualize it, test it

10 min

Why generative AI has changed who can build AI systems — and who must now verify them.
2

BUILD · Building a multi-user big-data platform with agentic AI (Chang + Li) hands-on

40 min

Multi-tenant architecture, developed spec-driven — write the spec, let agents implement it, and verify the result against that same spec. Where the agentic AI coding tools help and where they do not; human-verification checkpoints for AI-generated code. Plus real sensor ingestion at scale — a wearable (Garmin BBI/HRV) edge-to-cloud stream as a worked big-data time-series example.
—

Break

10 min
3

VISUALIZE · Data visualization for AI systems (Teng-Yok Lee)

50 min

Visualization across the AI lifecycle — model decisions (CAM-style contribution maps), training dynamics (loss-contribution in-situ visualization), and learned-policy behavior (visual analytics of LSTM control policies) — turning a model and its data into something you can see, and a working instrument for the testing that follows.
—

Break

10 min
4

TEST · Testing AI-built systems at scale, with visualization as a tool (Chang + Li)

40 min

Six-layer rubric. LLM-as-Judge harness with a versioned prompt. The visualization methods from the previous segment, applied to the drift/anomaly layer. A compact data-governance / IRB sub-section.
5

SYNTHESIS · Live demonstration

10 min

Inject a platform usage / cost anomaly, visualize it (the drift/anomaly layer), then score how the system responds with an LLM-as-Judge — see it and test it in one loop.
6

Open problems and Q&A

10 min

Where build, visualize, and test still break — an honest treatment of what is not yet solved.

Hands-on · you do it, not just watch

§2 Build Write a spec — watch it build

The room proposes (or votes on) a small spec; a presenter builds it live with an agentic tool; together we find where the AI-generated code still needs human verification. A recorded fallback is staged in case live generation misbehaves.

Closing Apply it to your own project

Take the decision matrix and the six-layer self-assessment from this page and run them against your own system, on the spot — leaving with a filled-in plan, not just notes.

Materials

Slides and materials will be posted on this page after the session.

Get in touch

Questions or want to connect? Email us at [email protected].

§ Software-engineering view

Every step is a software-engineering practice.

This is not "AI instead of software engineering" — it is software engineering, at the intersection of two emerging areas: AI4SE (AI building software) and SE4AI (engineering AI-heavy systems). Every part of the tutorial maps onto a classic SE discipline.

Tutorial element → software-engineering discipline

In this tutorial	Software-engineering discipline
Spec-driven development (SDD)	Requirements engineering · executable specifications
Agentic AI development + human-verification checkpoints	Software construction · AI-assisted development · code review · technical-debt control
Six-layer rubric · LLM-as-Judge · reproducibility	Software testing · verification & validation · quality assurance
Drift / anomaly layer	Software maintenance · evolution · runtime monitoring
Multi-tenant, edge-to-cloud platform	Software architecture · service-oriented & distributed systems
Data & model visualization (Teng-Yok Lee)	Program & model comprehension · debugging tools
IRB-governed data handling / SOP	Software process · compliance & governance

§ CISOSE 2026 federation

How this tutorial maps to eight conferences.

CISOSE 2026 federates eight constituent conferences. Build · See · Trust spans service-oriented systems, AI testing, big data, and explainable AI substantively, with edge and IoT adjacent. Cross-track integration is the structural reason CISOSE is the right venue for this content.

Tutorial coverage by federated conference

●●● core · ●● substantive · ● adjacent

Federated conference	Tutorial segment	Coverage
Service-Oriented Systems Engineering	§2 platform architecture · §5 demo
AI Testing & Quality Assurance	§4 six-layer rubric · LLM-as-Judge
Big Data & Machine Learning	§2 big-data backbone · §3 visualization · §4 testing at scale
Cyber-Intelligence (overall)	§3 data visualization / model inspection
Responsible AI	§4 governance / IRB sub-section
Intelligent Mobile Computing	§2 edge-to-cloud wearable ingestion
Smart Cities & IoT	Sensing data sources
Decentralized Apps / Blockchain	Out of tutorial scope	—

§ Speakers

Three presenters: a platform team and a visualization researcher.

Chia-Kai Chang (National Central University) and Kuei-Hao Li (National Tsing Hua University) build and test a large AI education platform; Teng-Yok Lee (Mitsubishi Electric) contributes the data-visualization methods that power that testing. All three present in person at CISOSE 2026 in Fukuoka.

Lead presenter

0000-0003-2575-2738

Chia-Kai Chang (張家凱)

Assistant Professor, Center for General Education

National Central University, Taiwan

[email protected]

Founder and principal investigator of the Educational Omics Lab. Builds and operates Uedu, a multi-tenant AI tutoring platform deployed across multiple universities and developed largely through agentic AI development. Recent work spans large-scale learning analytics (ACM L@S 2026), educational big-data infrastructure (ICMET 2025), and a short paper in the CISOSE 2026 federation (IEEE BigDataService 2026). Holds an umbrella IRB approval for multimodal educational research.

Leads in this tutorial

§1 Opening §2 Build · GenAI development §4 Test · AI testing §5 Live demo

Co-presenter

0009-0007-3474-8489

Kuei-Hao Li (李奎皓)

Ph.D. Candidate, Interdisciplinary Doctoral Program

National Tsing Hua University, Taiwan

Co-founder of the Uedu platform, with research interests in digital learning, AI-assisted instruction, and agentic AI. Co-presents how the platform was built with agentic AI development tools and how it is tested. Co-author on the team's recent work including ACM L@S 2026 and ICMET 2025 (Educational Omics Data Lake). Focus: pedagogical design, cross-institutional deployment, and agentic development workflow.

Leads in this tutorial

§2 Build · GenAI development §4 Test · AI testing §5 Live demo

Co-presenter

Teng-Yok Lee (李庭育)

Principal Researcher

Mitsubishi Electric, Japan

Principal Researcher at Mitsubishi Electric working on data visualization and visual anomaly detection. PhD in scientific visualization from The Ohio State University, with highly-cited work in IEEE TVCG and IEEE PacificVis. Recent work includes IntegralCAM, a method for estimating and visualizing CNN feature contributions (IEEE ICME 2025), and efficient large-scale visual anomaly detection (IEEE AVSS 2025; arXiv 2026), alongside multiple patents on anomaly and object detection. Brings deep visualization and high-performance-computing expertise to the problem of making AI systems and their data legible and inspectable in production.

Leads in this tutorial

§3 Data visualization §5 Live demo

§ Companion publications

Anchor papers behind this tutorial.

The tutorial draws on, and points to, our team's recent peer-reviewed work. Each is positioned alongside the segment in which it is used as a worked example.

ACM L@S 2026

AI Teaching Assistants at Scale: Cross-Disciplinary Patterns of Adoption and Cognitive Engagement Across Hundreds of University Courses

C.-K. Chang, K.-H. Li

Anchor for §2 platform + §4 testing at scale.

ICMET 2025

Designing an Educational Omics Data Lake: A Multimodal Infrastructure for Technology-Enhanced Learning

C.-K. Chang, K.-H. Li

Anchor for §2 big-data backbone.

IEEE ICME 2025

IntegralCAM: Integral-Based Contribution Estimation and Visualization for Convolutional Neural Networks

T.-Y. Lee

Anchor for §3 — visualizing model decisions.

EuroVis 2021

Loss-Contribution-Based In-Situ Visualization for Neural Network Training

T.-Y. Lee et al.

Anchor for §3 — visualizing training to debug a model.

IEEE PacificVis 2020

DynamicsExplorer: Visual Analytics for Robot Control Tasks Involving Dynamics and LSTM-Based Control Policies

T.-Y. Lee et al.

Anchor for §3 — visual analytics of a learned policy’s behavior.

IEEE AVSS 2025

AAAD: Adaptive Activated Anomaly Detection on Varied Backgrounds

K. Miyamoto, T.-Y. Lee, A. Minezawa

Anchor for §3 anomaly detection → §4 drift layer.

§ Tutorial materials

Materials in preparation.

The slide deck, code references, LLM-as-Judge rubric, and IRB SOP template will be released here once the theme and co-presenter lineup are confirmed. Several of the underlying AI-testing artifacts already live inside the Uedu codebase and are referenced from our other publications.

Tutorial proposal (6-page IEEE format)

Drafted; held privately while the scope is being revised. Inquiries are welcome by email.

Slide deck

In preparation — to be released once the lineup is confirmed.

Code references & LLM-as-Judge rubric

The underlying LLM-as-Judge harness, AI-testing scaffolds, and prompt-versioning schema live inside the Uedu codebase and are referenced from our peer-reviewed publications. A tutorial-specific release will follow.

Redacted IRB SOP template

To be prepared as a tutorial handout. The umbrella IRB approval (NTU-REC 202507EM058) remains active and governs our other published work.

§ Venue

Fukuoka, Japan.

IEEE International Conference on Cyber Intelligence and Software-Oriented Service Engineering (CISOSE 2026). July 27 – 30, 2026, Fukuoka, Japan.

https://cisose.fit.ac.jp/2026/

Conference info

Conference

CISOSE 2026

IEEE International Conference on Cyber Intelligence and Software-Oriented Service Engineering

Dates

July 27 – 30, 2026

Location

Fukuoka, Japan

Language

English

Registration

Registration is handled through the CISOSE 2026 main conference website. This tutorial is part of the CISOSE 2026 program.