Neural Defect Predictors: Training Dynamics Under Data Quality Issues

Emmanuel Charleson Dapaah, Philip Makedonski, Jens Grabowski· June 25, 2026 View original

Summary

This research investigates how coupled data-quality issues like class imbalance and overlap affect the internal training dynamics of neural networks used for software defect prediction. It proposes a controlled study to characterize these patterns, moving beyond just endpoint performance.

Software defect prediction is crucial for maintaining software quality, guiding decisions in testing, release risk, and quality monitoring. Traditional methods often face challenges from data quality issues, specifically class imbalance and class overlap, which frequently occur together in real-world datasets. While previous studies focused on how these issues impact the final performance of defect prediction models, this research delves deeper into the internal workings of neural networks during training. The study aims to understand how these coupled data quality problems manifest within the neural network's training dynamics, such as gradients, weights, biases, and error trajectories. By conducting a controlled intervention study on specific datasets, researchers will train a fixed Multi-Layer Perceptron (MLP) under various conditions: imbalance only, overlap only, and a combination of both. This approach will allow for a detailed characterization of training patterns using effect sizes, trajectories, and sensitivity analyses.

Why it matters

Understanding how data quality issues impact neural network training dynamics can lead to more robust and reliable software defect prediction models. Professionals can use these insights to diagnose model failures more effectively and develop better strategies for data preprocessing and model training.

How to implement this in your domain

  1. 1Analyze existing software defect prediction datasets for class imbalance and overlap.
  2. 2Implement data augmentation or re-sampling techniques to mitigate identified data quality issues.
  3. 3Monitor internal training dynamics (e.g., gradients, loss curves) of neural defect predictors to detect early signs of instability.
  4. 4Experiment with different neural network architectures and regularization methods to improve robustness against coupled data issues.
  5. 5Validate model performance not just on accuracy but also on metrics sensitive to class distribution, like F1-score or AUC.

Who benefits

Software DevelopmentQuality AssuranceCybersecurityIT Services

Key takeaways

  • Coupled data quality issues significantly impact neural network training dynamics in software defect prediction.
  • Monitoring internal training patterns provides deeper insights than just evaluating endpoint performance.
  • The research aims to develop an empirical protocol and taxonomy for understanding these complex interactions.
  • Improved understanding can lead to more robust and reliable defect prediction models.

Original post by Emmanuel Charleson Dapaah, Philip Makedonski, Jens Grabowski

"arXiv:2606.24968v1 Announce Type: new Abstract: Context: Software defect prediction supports maintenance decisions such as testing prioritization, release-risk assessment, and quality monitoring. However, metric-based SDP datasets often contain coupled data-quality issues, especi…"

View on X

Originally posted by Emmanuel Charleson Dapaah, Philip Makedonski, Jens Grabowski on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses