Enhancing Data Quality through Adaptive AI Techniques in Scalable Data Engineering Pipelines
Keywords:
Adaptive AI, Data Quality, Data Engineering Pipelines, Machine Learning, Anomaly Detection, Online Learning, Scalability, Automated Data CleaningAbstract
Modern data engineering pipelines, while scalable, often struggle with dynamic and complex data quality (DQ) issues that traditional, rule-based systems cannot adequately address. This paper explores the integration of adaptive Artificial Intelligence (AI) techniques—including online learning, anomaly detection ensembles, and meta-learning—into scalable data pipelines to achieve automated, self-improving data quality management. We propose a reference architecture that embeds AI models at critical stages of the pipeline to perform real-time validation, profiling, and correction. The paper details two key adaptive mechanisms: a feedback loop for continuous model refinement and a quality-aware routing system. Preliminary conceptual analysis suggests that such an adaptive approach can significantly reduce false positive error rates, improve data trustworthiness, and lower operational overhead compared to static DQ frameworks, especially in environments with evolving data schemas and semantics.
References
Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y., Haque, Z., ... & Mewald, C. (2017). TFX: A TensorFlow-based production-scale machine learning Knowledge Discovery and Data Mining, 1387-1395.
Gentyala, R. (2024). From Pipelines to Predictions: An Empirical Study on the Critical Behavioral Markers and Skill Pathways for Effective AI Data Engineering. Journal of Scientific and Engineering Research, 11(11), 187–197.
Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2019). The ML test score: A rubric for ML production readiness and technical debt reduction. 2019 IEEE International Conference on Big Data (Big Data), 1123-1132.
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3), 1-58.
Gentyala, R. (2024). An Economic Model for Data Quality Tool Selection: Quantifying the Trade-off Between Rule-Based and AI-Driven Approaches in Enterprise Data Pipelines. Journal of Scientific and Engineering Research, 11(4), 409–421.
Gama, J. (2010). Knowledge discovery from data streams. CRC Press.
Ilyas, I. F., & Chu, X. (2019). Data cleaning. ACM Books.
Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2017). Data management in machine learning pipelines. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 40(4), 87-92.
Ruff, L., Kauffmann, J. R., Vandermeulen, R. A., Montavon, G., Samek, W., Kloft, M., ... & Müller, K.-R. (2021). A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE, 109(5), 756-795.
Gentyala, R. (2024). From Bronze to Broken: A Grounded Theory Study of Anti-Patterns and Accruing Data Debt in Medallion Lakehouse Deployments. European Journal of Advances in Engineering and Technology, 11(1), 90–100.
Sadiq, S. (Ed.). (2013). Handbook of data quality: Research and practice. Springer Science & Business Media.
Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biessmann, F., & Grafberger, A. (2018). Automating large-scale data quality verification. Proceedings of the VLDB Endowment, 11(12), 1781-1794.
Treveil, M., Omont, N., Stenac, C., Lefevre, K., Phan, D., & Zentici, J. (2020). Introducing MLOps. O'Reilly Media.
Gentyala, R. (2024). From features to financial personas: Mapping feature transformation efficacy to customer archetypes in behavioral banking data. International Journal of Computer Science and Engineering Research and Development, 14(1), 127–145.
Zha, D., Bhat, Z. P., Lai, K.-H., Yang, F., & Hu, X. (2023). Data-centric AI: Perspectives and challenges. Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), 945-948.






