23 October 2024 | 4 minutes of reading time
In most machine learning (ML) use cases, regardless of whether it’s a supervised or unsupervised task, the input data must be structured in a specific way to enable effective model training. Typically, each record in an input table represents a single entity, whether it's a customer, user, or product and contains various features (columns) that describe the entity in detail. In many cases, when there are one-to-many relationships (e.g., a customer with multiple transactions), these must be aggregated to fit within a single record.
This structured approach ensures that the model can analyze each entity individually and use the enriched information to make accurate predictions or groupings.
In many use cases, structuring your data into one record per entity is essential for most models to function effectively. Let’s look at a few examples:
In each of these examples the single record per entity turns out to be very valuable for the success of the models.
While this approach works for most cases, some exceptions exist, for example in time-series forecasting models. In use cases like sales forecasting, weather prediction, or stock market analysis, the input data is not reduced to a single row per subject. Instead, each row represents a specific point in time, and the model learns from the sequential data rather than aggregated columns.
Over time, we’ve developed a structured approach to preparing input tables for ML-models. This method helps us systematically consider the features that should be included, ensuring that all relevant aspects of the entity are captured. We always start by selecting the unique identifier for each entity (e.g., customer ID, user ID) and, in supervised learning cases, the target variable we aim to predict (e.g., churn yes/no, response yes/no). We then categorize the features into five distinct groups, with one being optional:
Get in touch with our experts for a free consultation and see how we can help you unlock the full potential of your data.
This structured approach helps ensure that all relevant aspects of the entity’s relationship with the company and its products are captured. Aggregation methods may vary depending on the use case, and we adjust the level of aggregation (e.g., week, month, year) based on initial results.
A common question when preparing data for machine learning is how many features (columns) could be included in your input table. Aggregation often leads to a broad table with many columns, and managing this balance is key. A commonly suggested guideline is to aim for a maximum of around 10-15 columns for every 1,000 records. However, in practice, the complexity of the problem and the nature of the data can lead to much broader tables.
The “Curse of Dimensionality” is an important concept to consider. When the number of features becomes too large relative to the amount of data, model performance can degrade due to overfitting. This occurs when the model learns noise in the data rather than the underlying patterns. To mitigate this, techniques such as Principal Component Analysis (PCA) or feature selection methods can help reduce the number of columns without losing valuable information.
For a deeper understanding, you can explore Wikipedia’s entry on the Curse of Dimensionality or Overfitting, which provide more context about managing wide tables in machine learning.
For most machine learning projects, creating a well-structured input table is key to successful model training. Whether you're building a supervised model for churn or response prediction, or an unsupervised model for customer segmentation, structuring your data so that each entity is represented by a single enriched record is essential.
Our structured approach to input tables, with clear groupings of features, ensures that all relevant aspects of the entity are captured. This method allows us to prepare data that is both comprehensive and flexible, ready to be fed into machine learning models. And while aggregation techniques and the number of columns may vary based on the use case, the thoughtful preparation of input data often makes all the difference in the model's final performance.
As always, balancing the number of features with the amount of data is crucial, and experimenting with feature selection and aggregation methods is often necessary to achieve optimal results.
We provide custom solutions tailored to your organization at a great price. No huge projects with months of lead time, we deliver in weeks.