We teach the method, not a showcase — a production AI education platform is only the worked example. The methodology is yours to take to any field.
Generative and agentic AI in the development loop have collapsed the cost of building real-world, multi-user AI systems — but not the cost of understanding or trusting them. With agentic coding tools, a small team — even a single lab — can now stand up a multi-tenant big-data platform that once required a large engineering organization.
Building faster does not make a system easier to see into or to trust. This tutorial teaches a method for closing that gap — build it with agentic AI, visualize it, and use that visualization to test it. A production AI education platform is only the running example; the method is domain-agnostic, and a visualization researcher contributes techniques that make any model legible, not just ours. You take home the method, not a tour of one platform.
Each technical section pairs a methods-first treatment with a worked example from Uedu, a deployed multi-tenant AI tutoring platform built largely through agentic AI development and operated under a single umbrella IRB approval (NTU-REC 202507EM058). Read through a software-engineering lens, every step is a familiar SE practice — requirements, construction, verification, maintenance — applied where AI both builds the system and is the system. The platform is a worked example — not the answer; attendees who prefer to build from scratch take the methodology home.
Stand up a multi-tenant educational big-data platform with spec-driven, agentic AI development — write the spec, let agents implement it. Architecture decision flowcharts, and explicit checkpoints for where AI-generated code still needs human verification.
Make AI systems legible with data visualization — CAM-style contribution maps and visual anomaly detection. Glass-box inspection, not just black-box pass/fail.
A six-layer AI-testing rubric covering code, pipeline, behavior, guardrail, governance, and drift, with a live LLM-as-Judge harness and a cross-model reproducibility schema.
The method is three repeating moves. The architecture below is one worked example — the platform is the example, not the answer; your own system will differ.
A specification is one artifact at many levels of formality. Translate the intent downward and it becomes executable; at the far end, the spec is the tests — which is why building and verifying can share one source of truth.
Dialogue, behavioral, vision, physiological, environmental, financial, and psychometric streams flow into one multi-tenant platform — which you then visualize and test. The same two lenses apply to whatever you build.
You take home the methodology — not our platform. By the end of the three hours you should be able to apply these five methods to your own project, in your own domain.
A repeatable, spec-driven method for agentic-AI development — write the spec, let agents implement it, and reuse the spec as your test oracle — with a checklist of where AI-generated code still needs human verification, transferable to your own platform in any domain.
A layered AI-testing rubric (code, pipeline, behavior, guardrail, governance, drift) and a runnable LLM-as-Judge harness with versioned prompts and cross-model reproducibility — drop it into your own evaluation pipeline.
Visualization methods (CAM-style contribution maps, visual anomaly detection) for inspecting why a model behaves as it does — to apply to your own models, not just the ones shown here.
A workflow that uses visualization as a tool inside AI testing — a reusable pattern for making any AI system both visible and verifiable.
A redacted, IRB-governed data-handling and SOP template you can adapt to your own institution and jurisdiction.
You leave with the methodology — these two tables help you decide which method to reach for, and audit what you have already covered, on your own system.
| If you need to… | Reach for | Taught in | Watch out |
|---|---|---|---|
| Ship a platform fast with a small team | Agentic AI development plus a human-verification checklist | §2 Build | The checklist is non-negotiable — unreviewed AI-generated code is the failure mode. |
| See why a model focused where it did | CAM-style contribution visualization | §3 Visualize | Designed for CNNs; transformers and LLMs need attention or attribution analogues. |
| Find outliers and quality problems in image or sensor data at scale | Visual anomaly detection with incremental dimension reduction | §3 Visualize | Needs a baseline of normal examples; rare-but-valid cases can look anomalous. |
| Judge whether an LLM answer is actually good | LLM-as-Judge with a versioned prompt | §4 Test | Pin the judge model and prompt version, or scores drift between runs. |
| Catch cost, usage, or behavioral drift over time | The drift / anomaly layer of the six-layer rubric | §4 Test | Set thresholds from real baselines, not guesses. |
| Handle sensitive data across jurisdictions | IRB-governed data-handling SOP template | §4 Test | Adapt to local law — Taiwan PIPA, Japan APPI, EU GDPR. |
| Layer | Ask of your own system | Method |
|---|---|---|
| Code | Is the AI-generated code reviewed where it matters? | Human-verification checkpoints (§2) |
| Pipeline | Do data and model pipelines fail loudly and reproducibly? | Pipeline tests plus a reproducibility schema (§4) |
| Behavior | Does the model actually answer correctly? | LLM-as-Judge with a versioned prompt (§4) |
| Guardrail | Are unsafe or out-of-scope outputs blocked? | Guardrail checks and adversarial probes (§4) |
| Governance | Is sensitive data handled lawfully across jurisdictions? | IRB-governed SOP template (§4) |
| Drift | Would you notice anomalies or drift after deployment? | Visual anomaly detection on the drift layer (§3 + §4) |
Build with generative AI, see inside the model with visualization, and verify behavior with AI testing — each segment pairs methods with a worked example and a take-home artifact.
The room proposes (or votes on) a small spec; a presenter builds it live with an agentic tool; together we find where the AI-generated code still needs human verification. A recorded fallback is staged in case live generation misbehaves.
Take the decision matrix and the six-layer self-assessment from this page and run them against your own system, on the spot — leaving with a filled-in plan, not just notes.
Slides and materials will be posted on this page after the session.
Questions or want to connect? Email us at [email protected].
This is not "AI instead of software engineering" — it is software engineering, at the intersection of two emerging areas: AI4SE (AI building software) and SE4AI (engineering AI-heavy systems). Every part of the tutorial maps onto a classic SE discipline.
| In this tutorial | Software-engineering discipline |
|---|---|
| Spec-driven development (SDD) | Requirements engineering · executable specifications |
| Agentic AI development + human-verification checkpoints | Software construction · AI-assisted development · code review · technical-debt control |
| Six-layer rubric · LLM-as-Judge · reproducibility | Software testing · verification & validation · quality assurance |
| Drift / anomaly layer | Software maintenance · evolution · runtime monitoring |
| Multi-tenant, edge-to-cloud platform | Software architecture · service-oriented & distributed systems |
| Data & model visualization (Teng-Yok Lee) | Program & model comprehension · debugging tools |
| IRB-governed data handling / SOP | Software process · compliance & governance |
CISOSE 2026 federates eight constituent conferences. Build · See · Trust spans service-oriented systems, AI testing, big data, and explainable AI substantively, with edge and IoT adjacent. Cross-track integration is the structural reason CISOSE is the right venue for this content.
| Federated conference | Tutorial segment | Coverage |
|---|---|---|
| Service-Oriented Systems Engineering | §2 platform architecture · §5 demo | |
| AI Testing & Quality Assurance | §4 six-layer rubric · LLM-as-Judge | |
| Big Data & Machine Learning | §2 big-data backbone · §3 visualization · §4 testing at scale | |
| Cyber-Intelligence (overall) | §3 data visualization / model inspection | |
| Responsible AI | §4 governance / IRB sub-section | |
| Intelligent Mobile Computing | §2 edge-to-cloud wearable ingestion | |
| Smart Cities & IoT | Sensing data sources | |
| Decentralized Apps / Blockchain | Out of tutorial scope | — |
Chia-Kai Chang (National Central University) and Kuei-Hao Li (National Tsing Hua University) build and test a large AI education platform; Teng-Yok Lee (Mitsubishi Electric) contributes the data-visualization methods that power that testing. All three present in person at CISOSE 2026 in Fukuoka.
Founder and principal investigator of the Educational Omics Lab. Builds and operates Uedu, a multi-tenant AI tutoring platform deployed across multiple universities and developed largely through agentic AI development. Recent work spans large-scale learning analytics (ACM L@S 2026), educational big-data infrastructure (ICMET 2025), and a short paper in the CISOSE 2026 federation (IEEE BigDataService 2026). Holds an umbrella IRB approval for multimodal educational research.
Co-founder of the Uedu platform, with research interests in digital learning, AI-assisted instruction, and agentic AI. Co-presents how the platform was built with agentic AI development tools and how it is tested. Co-author on the team's recent work including ACM L@S 2026 and ICMET 2025 (Educational Omics Data Lake). Focus: pedagogical design, cross-institutional deployment, and agentic development workflow.
Principal Researcher at Mitsubishi Electric working on data visualization and visual anomaly detection. PhD in scientific visualization from The Ohio State University, with highly-cited work in IEEE TVCG and IEEE PacificVis. Recent work includes IntegralCAM, a method for estimating and visualizing CNN feature contributions (IEEE ICME 2025), and efficient large-scale visual anomaly detection (IEEE AVSS 2025; arXiv 2026), alongside multiple patents on anomaly and object detection. Brings deep visualization and high-performance-computing expertise to the problem of making AI systems and their data legible and inspectable in production.
The tutorial draws on, and points to, our team's recent peer-reviewed work. Each is positioned alongside the segment in which it is used as a worked example.
The slide deck, code references, LLM-as-Judge rubric, and IRB SOP template will be released here once the theme and co-presenter lineup are confirmed. Several of the underlying AI-testing artifacts already live inside the Uedu codebase and are referenced from our other publications.
Drafted; held privately while the scope is being revised. Inquiries are welcome by email.
In preparation — to be released once the lineup is confirmed.
The underlying LLM-as-Judge harness, AI-testing scaffolds, and prompt-versioning schema live inside the Uedu codebase and are referenced from our peer-reviewed publications. A tutorial-specific release will follow.
To be prepared as a tutorial handout. The umbrella IRB approval (NTU-REC 202507EM058) remains active and governs our other published work.
IEEE International Conference on Cyber Intelligence and Software-Oriented Service Engineering (CISOSE 2026). July 27 – 30, 2026, Fukuoka, Japan.
https://cisose.fit.ac.jp/2026/Registration is handled through the CISOSE 2026 main conference website. This tutorial is part of the CISOSE 2026 program.