Training a machine learning model is like baking a cake: the quality of the end result depends on the ingredients you put in. If your input data is poor, your predictions will be too. But with the right ingredients – in this case, carefully selected input features – you can create a model that’s both accurate and powerful. This is where feature engineering comes in. It’s the process of exploring, creating, and selecting the most relevant and useful features to use in your model. And just like a chef experimenting with different spices and flavors, the process of feature engineering is iterative and tailored to the problem at hand. In this guide, we’ll walk you through a step-by-step process using Python and Scikit-learn to create a strong set of features for a regression problem. By the end, you’ll have the skills to tackle any feature engineering challenge that comes your way.
The remainder of this article proceeds as follows: We begin with a brief intro to feature engineering and describe valuable techniques. We then turn to the hands-on part, in which we develop a regression model for car sales. We apply various techniques that show how to handle outliers and missing values, perform correlation analysis, and discover and manipulate features. You will also find information about common challenges and helpful sklearn functions. Finally, we will compare our regression model to a baseline model that uses the original dataset.
Also: Sentiment Analysis with Naive Bayes and Logistic Regression in Python
What is Feature Engineering?
Feature engineering is the process of using domain knowledge of the data to create features (variables) that make machine learning algorithms work. This is an important step in the machine learning pipeline because the choice of good features can greatly affect the performance of the model. The goal is to identify features, tweak them, and select the most promising ones into a smaller feature subset. We can break this process down into several action items.
Data Scientists can easily spend 70% to 80% of their time on feature engineering. The time is well spent, as changes to input data have a direct impact on performance. This process is often iterative and requires repeatedly revisiting the various tasks as understanding the data and the problem evolves. Knowing techniques and associated challenges helps in adequate feature engineering.
Also: Mastering Prompt Engineering for ChatGPT for Business Use
Core Tasks
The goal of feature engineering is to create a set of features that are representative of the underlying data and that can be used by the machine learning algorithm to make accurate predictions. Several tasks are commonly performed as part of the feature engineering process, including:
- Data discovery: To solve real-world problems with analytics, it is crucial to understand the data. Once you have gathered your data, describing and visualizing the data are means to familiarize yourself with it and develop a general feel for the data.
- Data structuring: The data needs to be structured into a unified and usable format. Variables may have a wrong datatype, or the data is distributed across different data frames and must first be merged. In these cases, we first need to bring the data together and into the right shape.
- Data cleansing: Besides being structured, data needs to be cleaned. Records may be redundant or contaminated with errors and missing values that can hinder our model from learning effectively. The same goes for outliers that can distort statistics.
- Data transformation: We can increase the predictive power of our input features by transforming them. Activities may include applying mathematical functions, removing specific data, or grouping variables into bins. Or we create entirely new features out of several existing ones.
- Feature selection: Only some may contain valuable information from the many available variables. By sorting variables that are less relevant and selecting the most promising features, we can create models that are less complex and yield better results.
Exploratory Feature Engineering Toolset
Exploratory analysis for identifying and assessing relevant features knows several tools:
- Data Cleansing
- Descriptive statistics
- Univariate Analysis
- Bi-variate Analysis
- Multivariate Analysis
Data Cleansing
Educational data is often remarkably perfect, without any errors or missing values. However, it is important to recognize that most real-world data has data quality issues. Some reasons for data quality issues are
- Standardization issues because the data was recorded from different peoples, sensor types, etc.
- Sensor or system outages can lead to gaps in the data or create erroneous data points.
- Human errors
An important part of feature engineering is to inspect the data and ensure its quality before use. This is what we understand as “data cleansing.” It includes several tasks that aim to improve the data quality, remove erroneous data points and bring the data into a more useful form.
- Cleaning errors, missing values, and other issues.
- Handling possible imbalanced data
- Removing obvious outliers
- Standardisation, e.g., dates or adresses
Accomplishing these tasks requires a good understanding of the data. We, therefore, carry out data cleansing activities closely intertwined with other exploratory tasks, e.g., univariate and bivariate data analysis. Also, remember that visualizations can aid in the process, as they can greatly enhance your ability to analyze and understand the data.
Descriptive Statistics
One of the first steps in familiarizing oneself with a new dataset is to use descriptive statistics. Descriptive statistics help understand the data and how the sample represents the real-world population. We can use several statistical measures to analyze and describe a dataset, including the following:
- Measures of Central Tendency represent a typical value of the data.
- The mean: The average-based adds together all values in the sample and divides them by the number of samples.
- The median: The median is the value that lies in the middle of the range of all sample values
- The mode: is the most occurring value in a sample set (for categorical variables)
- Measures of Variability tell us something about the spread of the data.
- Range: The difference between the minimum and maximum value
- Variance: This is the average of the squared difference of the mean.
- Standard Deviation: The square root of the variance.
- and Measures of Frequency inform us how often we can expect a value to be present in the data, e.g., value counts
Univariate Analysis
As “uni” suggests, the univariate analysis focuses on a single variable. Rather than examining the relationships between the variables, univariate analysis employs descriptive statistics and visualizations to understand individual columns better.
Which illustrations and measures we use depends on the type of the variable.
Categorical variables (incl. binary)
- Descriptive measures include counts in percent and absolute values
- Visualizations include pie charts, bar charts (count plots)
Continuous variables
- Descriptive measures include min, max, median, mean, variance, standard deviation, and quantiles.
- Visualizations include box plots, line plots, and histograms.
Bi-variate Analysis
Bi-variate (two-variate) analysis is a kind of statistical analysis that focuses on the relationship between two variables, for example, between a feature column and the target variable. In the case of machine learning projects, bivariate analysis can help to identify features that are potentially predictive of the label or the regression target.
Model performance will benefit from strong linear dependencies. In addition, we are also interested in examining the relationships among the features used to train the model. Different types of relations exist that can be examined using various plots and statistical measures:
Numerical/Numerical
Both variables have numerical values. We can illustrate their relation using lineplots or dot plots. We can examine such relations with correlation analysis.
The ideal feature subset contains features that are not correlated with each other but are heavily correlated with the target variable. We can use dimensionality reduction to reduce a dataset with many features to a lower-dimensional space in which the remaining features are less correlated.
Traditional correlation analysis (e.g., Pearson) cannot consider non-linear relations. We can identify such a relation manually by visualizing the data, for example, using line plots. Once we denote a non-linear relation, we could try to apply mathematical transformations to one of the variables to make their relation more linear.
For pairwise analysis, we must understand which variables we deal with. We can differentiate between three categories:
- Numerical/Categorical
- Numerical/Numerical
- Categorical/Categorical
Numerical/Categorical
Plots that visualize the relationship between a categorical and a numerical variable include barplots and lineplots.
Especially helpful are histograms (count plots). They can highlight differences in the distribution of the numerical variable for different categories.
A specific subcase is a numerical/date relation. Such relations are typically visualized using line plots. In addition, we want to look out for linear or non-linear dependencies.
Categorical/Categorical
The relation between two categorical variables can be studied, including density plots, histograms, and bar plots.
For example, with car types (attributes: sedan and coupe) and colors (characteristics: red, blue, yellow), we can use a barplot to see if sedans are more often red than coupes. Differences in the distribution of characteristics can be a starting point for attempts to manipulate the features and improve model performance.
Multivariate Analysis
Multivariate analysis encompasses the simultaneous analysis of more than two variables. The approach can uncover multi-dimensional dependencies and is often used in advanced feature engineering. For example, you may find that two variables are weakly correlated with the target variable, but when combined, their relation intensifies. So you might try to create a new feature that uses the two variables as input. Plots that can visualize relations between several variables include dot plots and violin plots.
In addition, multivariate analysis refers to techniques to reduce the dimensionality of a dataset. For example, principal component analysis (PCA) or factor analysis can condense the information in a data set into a smaller number of synthetic features.
Now that we have a good understanding of what feature selection techniques are available, we can start the practical part and apply them.
Also: Color-Coded Cryptocurrency Price Charts in Python
Feature Engineering for Car Price Regression with Python and Scikit-learn
The value of a car on the market depends on various factors. The distance traveled with the vehicle and the year of manufacture is obvious dependencies. But beyond that, we can use many other factors to train a machine learning model that predicts the selling price of the used car market. The following hands-on Python tutorial will create such a model. We will work with a dataset containing used cars’ characteristics in the following. For marketing, it is crucial to understand what car characteristics determine the price of a vehicle. Our goal is to model the car price from the available independent variables. We aim to build a model that performs well on a small but powerful input subset.
Exploring and creating features varies between different application domains. For example, feature engineering in computer vision will differ greatly from feature engineering for regression or classification models or NLP models. So the example provided in this article is just for regression models.
We follow an exploratory process that includes the following steps:
- Loading the data
- Cleaning the data
- Univariate analysis
- Bivariate analysis
- Selecting features
- Data preparation
- Model training
- Measuring performance
Finally, we compare the performance of our model, which was trained on a minimal set of features, to a model that uses the original data.
The Python code is available in the relataly GitHub repository.
Prerequisites
Before you proceed, ensure that you have set up your Python environment (3.8 or higher) and the required packages. If you don’t have an environment, follow this tutorial to set up the Anaconda environment.
Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:
- pandas
- NumPy
- matplotlib
- Seaborn
- Scikit-learn
You can install packages using console commands:
- pip install <package name>
- conda install <package name> (if you are using the anaconda packet manager)
About the Dataset
In this tutorial, we will be working with a dataset containing listings for 111763 used cars. The data includes 13 variables, including the dependent target variable
- prod_date: The year of production
- maker: The manufacturer’s name
- model: The car edition
- trim: Different versions of the model
- body_type: The body style of a vehicle
- transmission_type: The way the power is brought to the wheels
- state: The state in which the car is auctioned
- condition: The condition of the cars
- odometer: The distance the car has traveled since manufactured
- exterior_color: Exterior color
- interior_color: Interior color
- sale_price (target variable): The price a car was sold
- sale_date: The date on which the car has been sold
The dataset is available for download from Kaggle.com, but you can execute the code below and load the data from the relataly GitHub repository.
Step #1 Load the Data
We begin by importing the necessary libraries and downloading the dataset from the relataly GitHub repository. Next, we will read the dataset into a pandas DataFrame. In addition, we store the name of our regression target variable to ‘price_usd,’ which is one of the columns in the initial dataset. The “.head ()” function displays the first records of our DataFrame.
# Tested with Python 3.8.8, Matplotlib 3.5, Scikit-learn 0.24.1, Seaborn 0.11.1, numpy 1.19.5 from codecs import ignore_errors import math import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns sns.set_style('white', {'axes.spines.right': False, 'axes.spines.top': False}) from pandas.api.types import is_string_dtype, is_numeric_dtype from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.preprocessing import LabelEncoder from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error from sklearn.model_selection import cross_val_score, train_test_split from sklearn.inspection import permutation_importance from sklearn.model_selection import ShuffleSplit # Original Data Source: # https://www.kaggle.com/datasets/tunguz/used-car-auction-prices # Load train and test datasets df = pd.read_csv("https://raw.githubusercontent.com/flo7up/relataly_data/main/car_prices2/car_prices.csv") df.head(3)
prod_year maker model trim body_type transmission_type state condition odometer exterior_color interior sellingprice date 0 2015 Kia Sorento LX SUV automatic ca 5.0 16639.0 white black 21500 2014-12-16 1 2015 Nissan Altima 2.5 S Sedan automatic ca 1.0 5554.0 gray black 10900 2014-12-30 2 2014 Audi A6 3.0T Prestige quattro Sedan automatic ca 4.8 14414.0 black black 49750 2014-12-16
We now have a dataframe that contains 12 columns and the dependent target variable we want to predict.
Step #2 Data Cleansing
Now that we have loaded the data, we begin with the exploratory analysis. First, we will put it into shape.
2.1 Check Names and Datatypes
If the names in a dataset are not self-explaining, it is easy to get confused with all the data. Therefore, will rename some of the columns and provide clearer names. There is no default naming convention, but striving for consistency, simplicity, and understandability is generally a good idea.
The following code line renames some of the columns.
# rename some columns for consistency df.rename(columns={'exterior_color': 'ext_color', 'interior': 'int_color', 'sellingprice': 'sale_price'}, inplace=True) df.head(1)
prod_year maker model trim body_type transmission_type state condition odometer ext_color int_color sale_price date 0 2015 Kia Sorento LX SUV automatic ca 5.0 16639.0 white black 21500 2014-12-16
Next, we will check and remove possible duplicates.
# check and remove dublicates print(len(df)) df = df.drop_duplicates() print(len(df))
OUT: 111763, 111763
There were no duplicates in the data, which is good.
# check datatypes df.dtypes
prod_year int64 maker object model object trim object body_type object transmission_type object state object condition float64 odometer float64 ext_color object int_color object sale_price int64 date object dtype: object
We compare the datatypes to the first records we printed in the previous section. Be aware that categorical variables (e.g., of type “string”) are shown as “objects.” The data types look as expected.
Finally, we define our target variable’s name, “sale_price.” The target variable will be our regression target, and we will use its name often.
# consistently define the target variable target_name = 'sale_price'
2.2 Checking Missing Values
Some machine learning algorithms are sensitive to missing values. Handling missing values is, therefore a crucial step in exploratory feature engineering.
Let’s first gain an overview of null values. With a larger DataFrame, it would be inefficient to review all the rows and columns individually for missing values. Instead, we use the sum function and visualize the results to get a quick overview of missing data in the DataFrame.
# check for missing values null_df = pd.DataFrame(df.isna().sum(), columns=['null_values']).sort_values(['null_values'], ascending=False) fig = plt.subplots(figsize=(16, 6)) ax = sns.barplot(data=null_df, x='null_values', y=null_df.index, color='royalblue') pct_values = [' {:g}'.format(elm) + ' ({:.1%})'.format(elm/len(df)) for elm in list(null_df['null_values'])] ax.bar_label(container=ax.containers[0], labels=pct_values, size=12) ax.set_title('Overview of missing values')
The bar chart shows that there are several variables with missing values. Variables with many missing values can negatively affect model performance, which is why we should try to treat them.
2.3 Overview of Techniques for Handling Missing Values
There are various ways to handle missing data. The most common options to handle missing values are:
- Custom substitution value: Sometimes, the information that a value is missing can be important information to a predictive model. We can substitute missing values with a placeholder value such as “missing” or “unknown.” The approach works particularly well for variables with many missing values.
- Statistical filling: We can fill in a statistically chosen measure, such as the mean or median for numeric variables, or the mode for categorical variables.
- Replace using Probabilistic PCA: PCA uses a linear approximation function that tries to reconstruct the missing values from the data.
- Remove entire rows: It is crucial to ensure that we only use data we know is correct. In those cases, we can drop an entire row if it contains a missing value. This also solves the problem but comes at the cost of losing potentially important information – especially if the data quantity is small.
- Remove the entire column: It is another alternative way of resolving missing values. This is typically the least option, as we lose an entire feature.
How we handle missing values can dramatically affect our prediction results. To find the ideal method, it is often necessary to experiment with different techniques. Sometimes, the information that a value is missing can also be important. This occurs when the missing values are not randomly distributed in the data and show a pattern. In such a case, you should create an additional feature that states whether values are missing.
2.4 Handle Missing Values
In this example, we will use the median value to fill in the missing values of our numeric variables and the mode to replace the missing values of categorical variables. When we check again, we can see that odometer and condition have no more missing values.
# fill missing values with the mean for numeric columns for col_name in df.columns: if (is_numeric_dtype(df[col_name])) and (df[col_name].isna().sum() > 0): df[col_name].fillna(df[col_name].median(), inplace=True) # alternatively you could also drop the columns with missing values using .drop(columns=['engine_capacity']) print(df.isna().sum())
prod_year 0 maker 2078 model 2096 trim 2157 body_type 2641 transmission_type 13135 state 0 condition 0 odometer 0 ext_color 173 int_color 173 sale_price 0 date 0 dtype: int64
Next, we handle the missing values of transmission_type by filling them with the mode.
# check the distribution of missing values for transmission type print(df['transmission_type'].value_counts()) # fill values with the mode df['transmission_type'].fillna(df['transmission_type'].mode()[0], inplace=True) print(df['transmission_type'].isna().sum())
automatic 108198 manual 3565 Name: transmission_type, dtype: int64 0
We handle body_type analogs as transmission_type and fill the missing values with the mode. The mode is the value that appears most often in the data. The mode of transmission_type is “Sedan.” However, this value is not that prevalent, as half of the cars have other body types, e.g., “SUV.” Therefore, we will replace the missing values with “Unknown.”
# check the distribution of missing values for body type print(df['body_type'].value_counts()) # fill values with 'Unknown' df['body_type'].fillna("Unknown", inplace=True) print(df['body_type'].isna().sum())
Sedan 39955 SUV 23836 sedan 8377 suv 4934 Hatchback 4241 ... cts-v coupe 2 Ram Van 1 Transit Van 1 CTS Wagon 1 beetle convertible 1 Name: body_type, Length: 74, dtype: int64 0
Now we have handled most of the missing values in our data. However, some variables are still left, with a few missing values. We will make things easy and simply drop all remaining records with missing values. Considering that we have more than 100k records and only a few variables, we can afford to do this without fear of a severe impact on our model performance.
# remove all other records with missing values df.dropna(inplace=True) print(df.isna().sum())
prod_year 0 maker 0 model 0 trim 0 body_type 0 transmission_type 0 state 0 condition 0 odometer 0 ext_color 0 int_color 0 sale_price 0 date 0 dtype: int64
Finally, we check again for missing values and see that everything has been filled. Now, we have a cleansed dataset with 13 columns.
2.3 Save a Copy of the Cleaned Data
Before exploring the features, let’s make a copy of the cleaned data. We will later use this “full” dataset to compare the performance of our model with a baseline model.
# Create a copy of the dataset with all features for comparison reasons df_all = df.copy()
Step #3 Getting started with Statistical Univariate Analysis
Now it’s time to analyze the data and explore potential useful features for our subset. Although the process follows a linear flow in this example, you may notice in practice that you must go back and forth between different steps of the feature exploration and engineering process.
First, we will look at the variance of the features in the initial dataset. Machine learning models can only learn from variables that have adequate variance. So, low-variance features are often candidates to exclude from the feature subset.
We use the .describe() method to display univariate descriptive statistics about the numerical columns in our dataset.
# show statistics for numeric variables print(df.columns) df.describe()
Next, we check the categorical variables. All variables seem to have a good variance. We can measure the variance with statistical measures or observe it manually using bar charts and scatterplots.
We can use histplots to visualize the distributions of the numeric variables. The example below shows the histplot for our target variable sale_price.
# Explore the variance of the target variable variable_name = 'sale_price' fig, ax = plt.subplots(figsize=(14,5)) sns.histplot(data=df[[variable_name]].dropna(), ax=ax, color='royalblue', kde=True) ax.get_legend().remove() ax.set_title(variable_name + ' Distribution') ax.set_xlim(0, df[variable_name].quantile(0.99))
The histplot shows that sale prices are skewed to the left. This means there are many cheap cars and fewer expensive ones, which makes sense.
Next, we create bar plots for categorical values.
# 3.2 Illustrate the Variance of Numeric Variables f_list_numeric = [x for x in df.columns if (is_numeric_dtype(df[x]) and df[x].nunique() > 2)] f_list_numeric # box plot design PROPS = { 'boxprops':{'facecolor':'none', 'edgecolor':'royalblue'}, 'medianprops':{'color':'coral'}, 'whiskerprops':{'color':'royalblue'}, 'capprops':{'color':'royalblue'} } sns.set_style('ticks', {'axes.edgecolor': 'grey', 'xtick.color': '0', 'ytick.color': '0'}) # Adjust plotsize based on the number of features ncols = 1 nrows = math.ceil(len(f_list_numeric) / ncols) fig, axs = plt.subplots(nrows, ncols, figsize=(14, nrows*1)) for i, ax in enumerate(fig.axes): if i < len(f_list_numeric): column_name = f_list_numeric[i] sns.boxplot(data=df[column_name], orient="h", ax = ax, color='royalblue', flierprops={"marker": "o"}, **PROPS) ax.set(yticklabels=[column_name]) fig.tight_layout()
We can observe two things: First, the variance of transmission type is low, as most cars have an automatic transmission. So transmission_type is the first variable that we exclude from our feature subset.
# Drop features with low variety df = df.drop(columns=['transmission_type']) df.head(2)
prod_year maker model trim body_type state condition odometer ext_color int_color sale_price date 0 2015 Kia Sorento LX SUV ca 5.0 16639.0 white black 21500 2014-12-16 1 2015 Nissan Altima 2.5 S Sedan ca 1.0 5554.0 gray black 10900 2014-12-30
Second, int_color and ext_color have many categorical values. By grouping some of these values that hardly ever occur, we can help the model to focus on the most relevant patterns. However, before we do that, we need to take a closer look at how the target variable differs between the categories.
Step #4 Bi-variate Analysis
Now that we have a general understanding of our dataset’s individual variables, let’s look at pairwise dependencies. We are particularly interested in the relationship between features and the target variables. Our goal is to keep features whose dependence on the target variable shows some pattern – linear or non-linear. On the other hand, we want to exclude features whose relationship with the target variable looks arbitrary.
Visualizations have to take the datatypes of our variables into account. To illustrate the relation between categorical features and the target, we create boxplots and kdeplots. For numeric (continuous) features, we use scatterplots.
4.1 Analyzing the Relation between Features and the Target Variable
We begin by taking a closer look at the int_color and ext_color. We use kdeplots to highlight the distribution of prices depending on different colors.
def make_kdeplot(column_name): fig, ax = plt.subplots(figsize=(20,8)) sns.kdeplot(data=df, hue=column_name, x=target_name, ax = ax, linewidth=2,) ax.tick_params(axis="x", rotation=90, labelsize=10, length=0) ax.set_title(column_name) ax.set_xlim(0, df[target_name].quantile(0.99)) plt.show() make_kdeplot('ext_color')
make_kdeplot('int_color')
In both cases, a few colors are prevalent and account for most observations. Moreover, distributions of the car price differ for these prevalent colors. These differences look promising as they may help our model to differentiate cheaper cars from more expensive ones. To simplify things, we group the colors that hardly occur into a color category called “other.”
# Binning features df['int_color'] = [x if x in(['black', 'gray', 'white', 'silver', 'blue', 'red']) else 'other' for x in df['int_color']] df['ext_color'] = [x if x in(['black', 'gray', 'white', 'silver', 'blue', 'red']) else 'other' for x in df['ext_color']]
Next, we create plots for all remaining features.
# Vizualising Distributions f_list = [x for x in df.columns if ((is_numeric_dtype(df[x])) and x != target_name) or (df[x].nunique() < 50)] f_list_len = len(f_list) print(f'numeric features: {f_list_len}') # Adjust plotsize based on the number of features ncols = 1 nrows = math.ceil(f_list_len / ncols) fig, axs = plt.subplots(nrows, ncols, figsize=(18, nrows*5)) for i, ax in enumerate(fig.axes): if i < f_list_len: column_name = f_list[i] print(column_name) # If a variable has more than 8 unique values draw a scatterplot, else draw a violinplot if df[column_name].nunique() > 100 and is_numeric_dtype(df[column_name]): # Draw a scatterplot for each variable and target_name sns.scatterplot(data=df, y=target_name, x=column_name, ax = ax) else: # Draw a vertical violinplot (or boxplot) grouped by a categorical variable: myorder = df.groupby(by=[column_name])[target_name].median().sort_values().index sns.boxplot(data=df, x=column_name, y=target_name, ax = ax, order=myorder) #sns.violinplot(data=df, x=column_name, y=target_name, ax = ax, order=myorder) ax.tick_params(axis="x", rotation=90, labelsize=10, length=0) ax.set_title(column_name) fig.tight_layout()
Again, for categorical variables, we want to see differences in the distribution of the categories. Based on the boxplot’s median and the quantiles, we can denote that prod_year, int_color, and condition show adequate variance. The scatterplot for the odometer value also looks good. So we want to keep these features. In contrast, the differences between “state” and “ext_color” are rather weak. Therefore, we exclude these variables from our subset.
# drop columns with low variance df.drop(columns=['state', 'ext_color'], inplace=True)
Finally, if you want to take a more detailed look at the numeric features, you can use jointplots. These are scatterplots with additional information about the distributions. The example below shows the jointplot for the odometer value vs price.
# detailed univariate and bivariate analysis of 'odometer' using a jointplot def make_jointplot(feature_name): p = sns.jointplot(data=df, y=feature_name, x=target_name, height=6, ratio=6, kind='reg', joint_kws={'line_kws':{'color':'coral'}}) p.fig.suptitle(feature_name + ' Distribution') p.ax_joint.collections[0].set_alpha(0.3) p.ax_joint.set_ylim(df[feature_name].min(), df[feature_name].max()) p.fig.tight_layout() p.fig.subplots_adjust(top=0.95) make_jointplot ('odometer') # Alternatively you can use hex_binning # def make_joint_hexplot(feature_name): # p = sns.jointplot(data=df, y=feature_name, x=target_name, height=10, ratio=1, kind="hex") # p.ax_joint.set_ylim(0, df[feature_name].quantile(0.999)) # p.ax_joint.set_xlim(0, df[target_name].quantile(0.999)) # p.fig.suptitle(feature_name + ' Distribution')
Here is another example of a jointplot for the variable ‘condition.’
# detailed univariate and bivariate analysis of 'condition' using a jointplot make_jointplot('condition')
The graphs show a linear relationship between the price for the condition and the odometer value.
4.2 Correlation Matrix
Correlation analysis is a technique to quantify the dependency between numeric features and a target variable. Different ways exist to calculate the correlation coefficient. For example, we can use Pearson correlation (linear relation), Kendall correlation (ordinal association), or Spearman (monotonic dependence).
The example below uses Pearson correlation, which concentrates on the linear relationship between two variables. The Pearson correlation score lies between -1 and 1. General interpretations of the absolute value of the correlation coefficient are:
- .00-.19 “very weak”
- .20-.39 “weak”
- .40-.59 “moderate”
- .60-.79 “strong”
- .80-1.0 “very strong”
More information on the Pearson correlation can be found here and in this article on the correlation between covid-19 and the stock market.
We will calculate a correlation matrix that provides the correlation coefficient for all features in our subset, incl. sale_price.
# 4.1 Correlation Matrix # correlation heatmap allows us to identify highly correlated explanatory variables and reduce collinearity plt.figure(figsize = (9,8)) plt.yticks(rotation=0) correlation = df.corr() ax = sns.heatmap(correlation, cmap='GnBu',square=True, linewidths=.1, cbar_kws={"shrink": .82},annot=True, fmt='.1',annot_kws={"size":10}) sns.set(font_scale=0.8) for f in ax.texts: f.set_text(f.get_text())
All our remaining numeric features strongly correlate with price (positive or negative). However, this is not all that matters. Ideally, we want to have features that have a low correlation with each other. We can see that prod_year and condition are moderately correlated (coefficient: 0.5). Because prod_year is more correlated with price (coefficient: 0.6) than condition (coefficient: 0.5), we drop the condition variable.
df.drop(columns='condition', inplace=True)
Step #5 Data Preprocessing
Now our subset contains the following variables:
- prod_year
- maker
- model
- trim
- body_type
- odometer
- int_color
- sale_price
Next, we prepare the data for use as input to train a regression model. Before we train the model, we need to make a few final preparations. For example, we use a label encoder to replace the strong_values of the categorical variables with numeric values.
# encode categorical variables def encode_categorical_variables(df): # create a list of categorical variables that we want to encode categorical_list = [x for x in df.columns if is_string_dtype(df[x])] le = LabelEncoder() # apply the encoding to the categorical variables # because the apply() function has no inplace argument, we use the following syntax to transform the df df[categorical_list] = df[categorical_list].apply(LabelEncoder().fit_transform) return df df_final_subset = encode_categorical_variables(df) df_all_ = encode_categorical_variables(df_all) # create a copy of the dataframe but without the target variable df_without_target = df.drop(columns=[target_name]) df_final_subset.head()
prod_year maker model trim body_type odometer int_color sale_price date 0 2015 23 594 794 31 16639.0 0 21500 8 1 2015 34 59 98 32 5554.0 0 10900 17 2 2014 2 46 180 32 14414.0 0 49750 8 3 2015 34 59 98 32 11398.0 0 14100 13 4 2015 7 325 789 32 14538.0 0 7200 158
Step #6 Splitting the Data and Training the Model
To ensure that our regression model does not know the target variable, we separate car price (y) from features (x). Last, we split the data into separate datasets for training and testing. The result is four different data sets: x_train, y_train, x_test, and y_test.
Once the split function has prepared the datasets, we the regression model. Our model uses the Random Decision Forest algorithm from the scikit learn package. As a so-called ensemble model, the Random Forest is a robust Machine Learning algorithm. It considers predictions from a set of multiple independent estimators.
The Random Forest algorithm has a wide range of hyperparameters. While we could optimize our model further by testing various configurations (hyperparameter tuning), this is not the focus of this article. Therefore, we will use the default hyperparameters for our model as defined by scikit-learn. Please visit one of my recent articles on hyperparameter tuning, if you want to learn more about this topic.
For comparison reasons, we train two models—one model with our subset of selected features. The second model uses all features, cleansed but without any further manipulations.
We use shuffled cross-validation (cv=5) to evaluate our model’s performance on different data folds.
def splitting(df, name): # separate labels from training data X = df.drop(columns=[target_name]) y = df[target_name] #Prediction label # split the data into x_train and y_train data sets X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0) # print the shapes: the result is: (rows, training_sequence, features) (prediction value, ) print(name + '') print('train: ', X_train.shape, y_train.shape) print('test: ', X_test.shape, y_test.shape) return X, y, X_train, X_test, y_train, y_test # train the model def train_model(X, y, X_train, y_train): estimator = RandomForestRegressor() cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0) scores = cross_val_score(estimator, X, y, cv=cv) estimator.fit(X_train, y_train) return scores, estimator # train the model with the subset of selected features X_sub, y_sub, X_train_sub, X_test_sub, y_train_sub, y_test_sub = splitting(df_final_subset, 'subset') scores_sub, estimator_sub = train_model(X_sub, y_sub, X_train_sub, y_train_sub) # train the model with all features X_all, y_all, X_train_all, X_test_all, y_train_all, y_test_all = splitting(df_all_, 'fullset') scores_all, estimator_all = train_model(X_all, y_all, X_train_all, y_train_all)
subset train: (76592, 8) (76592,) test: (32826, 8) (32826,)
Step #7 Comparing Regression Models
Finally, we want to see how the model performs and how its performance compares against the model that uses all variables.
7.1 Model Scoring
We use different regression metrics to measure the performance. Then we create a barplot that compares the performance scores across the different validation folds (due to cross-validation).
# 7.1 Model Scoring def create_metrics(scores, estimator, X_test, y_test, col_name): scores_df = pd.DataFrame({col_name:scores}) # predict on the test set y_pred = estimator.predict(X_test) y_df = pd.DataFrame(y_test) y_df['PredictedPrice']=y_pred # Mean Absolute Error (MAE) MAE = mean_absolute_error(y_test, y_pred) print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2))) # Mean Absolute Percentage Error (MAPE) MAPE = mean_absolute_percentage_error(y_test, y_pred) print('Mean Absolute Percentage Error (MAPE): ' + str(np.round(MAPE*100, 2)) + ' %') # calculate the feature importance scores r = permutation_importance(estimator, X_test, y_test, n_repeats=30, random_state=0) data_im = pd.DataFrame(r.importances_mean, columns=['feature_permuation_score']) data_im['feature_names'] = X_test.columns data_im = data_im.sort_values('feature_permuation_score', ascending=False) return scores_df, data_im scores_df_sub, data_im_sub = create_metrics(scores_sub, estimator_sub, X_test_sub, y_test_sub, 'subset') scores_df_all, data_im_all = create_metrics(scores_all, estimator_all, X_test_all, y_test_all, 'fullset') scores_df = pd.concat([scores_df_sub, scores_df_all], axis=1) # visualize how the two models have performed in each fold fig, ax = plt.subplots(figsize=(10, 6)) scores_df.plot(y=["subset", "fullset"], kind="bar", ax=ax) ax.set_title('Cross validation scores') ax.set(ylim=(0, 1)) ax.tick_params(axis="x", rotation=0, labelsize=10, length=0)
Mean Absolute Error (MAE): 1643.39 Mean Absolute Percentage Error (MAPE): 24.36 % Mean Absolute Error (MAE): 1813.78 Mean Absolute Percentage Error (MAPE): 25.23 %
The subset model achieves an absolute percentage error of around 24%, which is not so bad. But more importantly, our model performs better than the model that uses all features. However, the subset model is less complex as it only uses eight features instead of 12. So it is easier to understand and less costly to train.
7.2 Feature Permutation Importance Scores
Next, we calculate feature importance scores. In this way, we can determine which features attribute the most to the predictive power of our model. Feature importance scores are a useful tool in the feature engineering process, as they provide insights into how the features in our subset contribute to the overall performance of our predictive model. Features with low importance scores can be eliminated from the subset or replaced with other features.
Again we will compare our subset model to the model that uses all available features from the initial dataset.
# compare the feature importance scores of the subset model to the fullset model fig, axs = plt.subplots(1, 2, figsize=(20, 8)) sns.barplot(data=data_im_sub, y='feature_names', x="feature_permuation_score", ax=axs[0]) axs[0].set_title("Feature importance scores of the subset model") sns.barplot(data=data_im_all, y='feature_names', x="feature_permuation_score", ax=axs[1]) axs[1].set_title("Feature importance scores of the fullset model")
In the subset model, most features are relevant to the model’s performance. Only date and int_color do not seem to have a significant impact. For the full set model, five out of 12 features hardly contribute to the model performance (date, int_color, ext_color, state, transmission_type).
Once you have a strong subset of features, you can automate the feature selection process using different techniques, e.g., forward or backward selection. Automated feature selection techniques will test different model variants with varying feature combinations to determine the best input dataset. This step is often done at the end of the feature engineering process. However, this is something for another article.
Conclusions
That’s it for now! This tutorial has presented an exploratory approach to feature exploration, engineering, and selection. You have gained an overview of tools and graphs that are useful in identifying and preparing features. The second part was a Python hands-on tutorial. We followed an exploratory feature engineering process to build a regression model for car prices. We used various techniques to discover and sort features and make a vital feature subset. These techniques include data cleansing, descriptive statistics, and univariate and bivariate analysis (incl. correlation). We also used some techniques for feature manipulation, including binning. Finally, we compared our subset model to one that uses all available data.
If you take away one learning from this article, remember that in machine learning, less is often more. So training classic machine learning models on carefully curated feature subsets likely outperforms models that use all available information.
I hope this article was helpful. I am always trying to improve and learn from my audience. So, if you have any questions or suggestions, please write them in the comments.
Sources and Further Reading
- Zheng and Casari (2018) Feature Engineering for Machine Learning
- David Forsyth (2019) Applied Machine Learning Springer
- Chip Huyen (2022) Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications
The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.
Stock-market prediction is a typical regression problem. To learn more about feature engineering for stock-market prediction, check out this article on multivariate stock-market forecasting.