Structuring Input Tables for Machine Learning Models: A Practical Approach

23 October 2024 | 4 minutes of reading time

In most machine learning (ML) use cases, regardless of whether it’s a supervised or unsupervised task, the input data must be structured in a specific way to enable effective model training. Typically, each record in an input table represents a single entity, whether it's a customer, user, or product and contains various features (columns) that describe the entity in detail. In many cases, when there are one-to-many relationships (e.g., a customer with multiple transactions), these must be aggregated to fit within a single record.

This structured approach ensures that the model can analyze each entity individually and use the enriched information to make accurate predictions or groupings.

Common Use Cases: When One Record Per Entity is Necessary

In many use cases, structuring your data into one record per entity is essential for most models to function effectively. Let’s look at a few examples:

  • Churn Prediction (Supervised):
    In churn prediction, each row represents a customer (or a subscription). Features may include customer tenure, engagement metrics, product usage, and interactions like the number of support tickets or emails received. Aggregating these data points into a single row per customer enables models like decision trees, random forests, or support vector machines (SVMs) to predict the likelihood of customer churn.
  • Response Models (Supervised):
    Response models are used to predict whether a customer will respond to a marketing campaign, typically with a binary outcome (yes/no). Features might include historical response behavior, the number of emails received in the last week, and purchase frequency. The model analyzes these aggregated features per customer to predict who is most likely to respond, enabling more targeted marketing efforts.
  • Customer Segmentation (Unsupervised):
    In customer segmentation, the goal is to group customers based on shared behaviors, demographics, or purchasing patterns. Each row represents a customer enriched with features like demographics, purchase history, and engagement metrics. Clustering algorithms such as k-means are then used to identify meaningful customer segments.

In each of these examples the single record per entity turns out to be very valuable for the success of the models. 

While this approach works for most cases, some exceptions exist, for example in time-series forecasting models. In use cases like sales forecasting, weather prediction, or stock market analysis, the input data is not reduced to a single row per subject. Instead, each row represents a specific point in time, and the model learns from the sequential data rather than aggregated columns.

Structuring Input Tables for Machine Learning: Our Approach

Over time, we’ve developed a structured approach to preparing input tables for ML-models. This method helps us systematically consider the features that should be included, ensuring that all relevant aspects of the entity are captured. We always start by selecting the unique identifier for each entity (e.g., customer ID, user ID) and, in supervised learning cases, the target variable we aim to predict (e.g., churn yes/no, response yes/no). We then categorize the features into five distinct groups, with one being optional:

  1. Group 1: Entity-Specific Variables
    These features describe the entity itself. For customers, this could include demographic details such as age, gender, or country. These are static features that are true regardless of the relationship with the company or product.
  2. Group 2: Relationship Variables
    This group focuses on the relationship between the entity and the company. For customers, features might include their account creation date, loyalty tier (bronze/silver/gold), or assigned account manager.
  3. Group 3: Interaction Data
    In this group we explore all interactions with the company that are not considered ‘usage’ of a particular product. Rather it's about the interaction between a customer and the company. For instance, how many emails has a customer received in the last week, last month, or last six months? How many purchases or support tickets have they logged in the past year? Aggregating this data helps capture the intensity and recency of interactions, which are crucial for many predictive models.
  4. Group 4: Product-Related Data (Hard Transactions)
    This category focuses on concrete transactions or product-related data. For customers, this might include whether they have made a purchase in the last six months or if they hold an active subscription. Unlike interactions, these are hard transactions that reflect the customer's concrete transactions with the company.
  5. Group 5 (optional, for SaaS): Product Usage Data
    In the case of SaaS products, this group captures how the customer interacts with the software. For example, has the customer logged in within the last seven days? How many times have they logged in over the past month? These features help measure engagement with the product itself.

Contact Us

Ready to utilize your data?

Get in touch with our experts for a free consultation and see how we can help you unlock the full potential of your data.

Contact us

Or call us on +31 594 855 888

This structured approach helps ensure that all relevant aspects of the entity’s relationship with the company and its products are captured. Aggregation methods may vary depending on the use case, and we adjust the level of aggregation (e.g., week, month, year) based on initial results.

Rule of Thumb: Number of Records vs. Number of Columns

A common question when preparing data for machine learning is how many features (columns) could be included in your input table. Aggregation often leads to a broad table with many columns, and managing this balance is key. A commonly suggested guideline is to aim for a maximum of around 10-15 columns for every 1,000 records. However, in practice, the complexity of the problem and the nature of the data can lead to much broader tables.

The “Curse of Dimensionality” is an important concept to consider. When the number of features becomes too large relative to the amount of data, model performance can degrade due to overfitting. This occurs when the model learns noise in the data rather than the underlying patterns. To mitigate this, techniques such as Principal Component Analysis (PCA) or feature selection methods can help reduce the number of columns without losing valuable information.

For a deeper understanding, you can explore Wikipedia’s entry on the Curse of Dimensionality or Overfitting, which provide more context about managing wide tables in machine learning.

Final Thoughts

For most machine learning projects, creating a well-structured input table is key to successful model training. Whether you're building a supervised model for churn or response prediction, or an unsupervised model for customer segmentation, structuring your data so that each entity is represented by a single enriched record is essential.

Our structured approach to input tables, with clear groupings of features, ensures that all relevant aspects of the entity are captured. This method allows us to prepare data that is both comprehensive and flexible, ready to be fed into machine learning models. And while aggregation techniques and the number of columns may vary based on the use case, the thoughtful preparation of input data often makes all the difference in the model's final performance.

As always, balancing the number of features with the amount of data is crucial, and experimenting with feature selection and aggregation methods is often necessary to achieve optimal results.

Want to activate your data too?

We provide custom solutions tailored to your organization at a great price. No huge projects with months of lead time, we deliver in weeks.

Contact us

Or call us on +31 594 855 888