05/13/2025 | Press release | Distributed by Public on 05/13/2025 19:20
Organizations are rushing to take advantage of AI applications in hopes of transforming their operations, products, and customer experiences. According to some estimates, upward of 80% of all global businesses are now using or exploring the use of at least one AI tool in areas ranging from accounting, inventory, and supply chain management to customer service, recruiting, and more.
Regardless of the use case, the data used is critical to the success of any AI application. Issues with data quality and relevance are frequently cited as the most common causes of failure in AI projects. Without good data, efforts to implement AI applications put companies at risk of wasting time and money, as well as organizational motivation and trust. Creating a strong foundation for a new AI tool means focusing first on whether you have enough of the right kind of data for training and development - which is often more important than the AI tool itself.
Data quality is key to ML models. Data errors and uncertainty can propagate from the point of measurement all the way to the analysis dataset, leading to poor results.
Today, "AI" is often used as a blanket term that covers many different technologies, from simple algorithms that offer results based on user interaction to complex large language models (LLMs) that can hold coherent conversations. In this article, we'll explore machine learning (ML), a broad branch of AI that uses algorithms to find patterns in data. ML algorithms predict new data based on what they recognize in existing data. ML is used for speech recognition, recommendation systems, fraud detection, image processing, medical imaging, and more.
Data quality is key to training ML models. Data errors and uncertainty can propagate from the point of measurement to the training dataset, leading to poor results. The first step in vetting data quality is to assess data completeness. For example, if patient information in a database is required to include quantities of medication prescribed, any record missing quantities is considered incomplete. Likewise, data should be validated, meaning it should conform to data or business rules, such as format, allowable data types, and numerical ranges. Incorrect data values - like the age of physical assets derived from survey records or values derived from medical testing - can degrade model performance, potentially resulting in consequential incorrect predictions.
In many cases data should be timely, or all collected at the same time. Things change over time, and outdated data can skew AI model results, leading to poor decisions and business outcomes. Data consistency is also key, and all representations of a particular item across multiple data stores should ideally match. For example, if information about physical assets is stored in both inspection and maintenance records and a separate system that documents repairs, it's important that all important details match so records can be joined based on overlapping fields.
Every project is unique, and data quality is ultimately defined by how it will be used and the business goals. In many cases, the best way to improve data quality is to improve data collection processes moving forward. New data can be collected to supplement the existing dataset if appropriate - using defined and accurate sources to help ensure results remain valid and usable. If existing data does not meet all standards for the project, process knowledge and domain expertise can be used to ensure ML can take advantage of existing data.
There are also many technical ways to improve and prepare datasets for use in ML training. Data processing serves to clean, harmonize, or otherwise transform data to engineer features appropriate for ML algorithms. For example, images used for training may first need to be centered, cropped, filtered, or converted to monochrome. Information can also be transformed to reduce unnecessary data that could cause issues during machine learning. Processing data in this manner is critically important and should be well planned and executed.
When preparing to deploy AI, it's wise to take a measured approach. Begin with the fundamental business question AI is meant to solve and ensure you have the data to solve it before looking for an AI tool. In some cases, AI might not be the best tool for the job, and there may be other methods that will work just as well. For example, engineering models, rule-based systems, improved data management processes, and even human expertise could deliver the answers you need without the time and energy required to train a large-scale AI model. In either case, however, it's crucial to ensure that the data you use is accurate, readable, and abundant.