PhD Candidate · Iowa State University · AI Reliability Lab

Researching Reliable AI Systems Through Software Engineering, Evaluation, and Model Behavior Analysis.

I am Anwar Hossain Zahid, a computer science PhD candidate advised by Prof. Wei Le. My work builds debugging and evaluation infrastructure for machine learning systems whose behavior shifts across model versions, compilers, GPU platforms, and real-world deployment conditions.

Portrait of Anwar Hossain Zahid
ML Model Debugging · AI Reliability · Software Engineering Building tools for hallucination detection, numerical instability, cross-GPU analysis, and trustworthy AI evaluation.

About

Research engineering for AI systems that must be trusted.

My work connects program analysis, machine learning, numerical computing, and production software engineering. Before graduate school, I spent four years building systems for government, banking, AI, and mobile platforms.

I study how AI systems fail, drift, hallucinate, and become numerically unstable. My research focuses on building practical tools that expose these failures early: differential model debugging, soft assertions for unstable ML code, cross-vendor GPU numerical testing, and evaluation workflows for LLM behavior.

At Iowa State University, I work in the Program Analysis and AI Lab. At Lawrence Livermore National Laboratory, I worked on floating-point precision analysis across NVIDIA and AMD GPU platforms for high-performance scientific workloads.

PythonPyTorchTensorFlowC/C++CUDA/HIP Program AnalysisML TestingLLM EvaluationDocker
6Core research tracks across AI reliability, debugging, evaluation, and HPC systems.
4+Years of industry software engineering before PhD research.
13Previously unknown real-world ML bugs found by Soft Assertion Fuzzer.
GPUCross-platform numerical analysis spanning NVIDIA and AMD systems.

Research

Failure analysis for next-generation AI systems.

I work on reliability problems that appear when models, compilers, hardware, datasets, and user contexts interact in ways ordinary tests miss.

01

Hallucination Detection

Evaluation workflows for identifying unsupported, inconsistent, or context-sensitive LLM outputs.

LLMsEvaluation
02

ML Debugging

Differential methods for comparing model versions, surfacing behavioral regressions, and localizing failure causes.

Model VersionsTesting
03

Numerical Stability

Soft assertions and guided fuzzing to trigger silent floating-point failures in ML applications.

FSE 2025Fuzzing
04

AI Reliability

Practical reliability techniques for detecting hidden defects before AI systems reach users.

Trustworthy AISE4AI
05

LLM Evaluation

Testing LLM behavior under geographical, social, and contextual variation in high-impact classification tasks.

Social GoodBenchmarks
06

HPC Systems

Numerical reproducibility and portability analysis across heterogeneous GPU architectures and compiler stacks.

CUDAROCmLLNL

Research Projects

Research prototypes for reproducible evidence.

These systems support measurable behavior, controlled experiments, and research workflows for studying AI reliability in practice.

AI RESEARCH MONITORING

Today’s AI

A compact AI news and research tracking concept for following fast-moving model, tooling, and policy changes.

AI TrendsKnowledge Systems
LLM BEHAVIOR EVALUATION

LLM Evaluation Tools

Testing harnesses for social-good classification, geography-aware hate speech detection, and model behavior analysis.

HuggingFaceTransformers

Publications

Selected research output.

Publications and preprints spanning numerical instability detection, LLM evaluation, GPU numerical testing, and software engineering education systems.

FSE 2025 · PACMSE

Automatically Detecting Numerical Instability in Machine Learning Applications via Soft Assertions

Introduces learned soft assertions for triggering and detecting hidden numerical bugs in ML applications.

arXiv:2502.19612

Evaluation of Hate Speech Detection Using Large Language Models and Geographical Contextualization

Studies how geography and context affect LLM-based hate speech detection in social-good settings.

arXiv:2410.09172

Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs

Analyzes cross-vendor numerical differences in GPU computations and their implications for reproducibility.

ICCIT 2020

A Conceptual Design of Virtual Internship System

Proposes a virtual internship platform for benchmarking software development skills across academic and industry settings.

Experience

Academic research grounded in real systems.

My background spans PhD research, national laboratory systems work, and software engineering in government, banking, AI, and device platforms.

PhD Candidate · Iowa State University

Researching ML debugging, AI reliability, numerical instability, and software engineering for AI systems with Prof. Wei Le.

Computing Graduate Intern · Lawrence Livermore National Laboratory

Built tooling for cross-platform floating-point precision analysis across heterogeneous GPU architectures.

Software Engineer · Reve Systems

Worked on ERP, procurement, accounting, and compliance systems for government operations.

Software Engineer · ERA-InfoTech Ltd.

Built FinTech chatbot, face recognition, remittance integration, and RPA systems for banking workflows.

Junior Software Developer · Walton Digi-Tech Industries Limited

Developed Android ROM and feature-phone operating system enhancements for consumer devices.

Contact

Let’s build reliable AI systems.

I am interested in research collaboration, internships, and engineering conversations around ML reliability, LLM evaluation, numerical correctness, AI safety, and systems that connect research ideas with working software.