Companies are always trying to find better ways to understand their customer base in order to acquire new customers more effectively. Instead of targeting random people for their marketing campaigns, it is much more effective to focus on people that are more prone to have a positive response to it. This project is an example of such a situation!
In this project, I analyzed demographics data for customers of a mail-order sales company (provided by Arvato Financial Services) in Germany, comparing it against demographics information for the general population. This involves customer segmentation and behaviour prediction. In this project, I had two main goals:
1) Identify and compare clusters of people from the general population that match the customer base using Unsupervised Learning. For this goal, I used two data files called azdias and customers. The former contained demographic data of several people from the general population of Germany, while the latter contained similar data, but only of Arvato Financial Services customers.
2) Predict who has a high probability of becoming a customer using Supervised Machine Learning. For this goal, I used other two datasets that were also similar to the previous ones (train and test). These datasets contained data of people that were subjected to a marketing campaign. There was an additional variable showing whether or nor the person became a customer after the campaign in one of these datasets, that served as the response variable for training the ML algorithms.
The Jupyter Notebook containing all the analyses and other information can be found on my Github.
All datasets used in this project share the same set of variables (although some of them have other few additional columns). To be more specific, they have at least 366 variables related to different demographic information and most of them were numeric (only 6 were not). The range of the numeric variables (their maximum minus their minimum) was mostly small (see below, few columns were left out of this plot). While the azdias dataset (general population) had 891221 rows, the customers datset (customers) had 191652 rows, each of them representing an individual. The train and test datasets used for supervised ML had 42962 and 42833 rows, respectively.
Several steps were necessary to clean the data:
There were several data points that were numeric (e.g. -1 or 0) but were in fact unknown values. Thus, I first had to identify these values and label them as null. Below are histograms on the proportion of null values for rows and columns of the azdias and customers datasets after correcting this issue.
Second, I removed rows that had too many null values, using a threshold of 15%. If it was above this limit, I deleted them from the dataset.
Third, I did a similar thing with columns. I removed columns that had more than 75% of null values.
Lastly, I imputed values for the remaining null values in the dataset, using the mode of each column.
I used the same process with all datasets, reducing the number of columns to 354. The resulting number of rows were 737288 (azdias), 134246 (customers) and 33837 (training). The only exception was the test dataset, for which I did not exclude rows above a certain null threshold value because I wanted to maintain all the rows (to predict response to marketing campaign for all of them).
Reduction in data dimensions
After cleaning the data, each dataset still contained 354 columns. Thus, before applying clustering algorithms, it was important to reduce the dimensions of data using Principal Component Analysis (PCA). I settled for 210 columns after seeing these curves (below) which still explained 93% of the original data.
Unsupervised learning: clustering
I used KMeas algorithm to do the clustering of the data. To settle on how many clusters there should be, I used the elbow and gap methods (below), choosing 11 in the end.
From this, I compared clusters between the general population and customers. There greatest customer overrepresentation occurred for cluster 1 and the greatest customer underrepresentation occured in cluster 4.
Supervised learning: predictions
After cleaning the training dataset used in this part the same way I did for the first two datasets (used for the clustering part), I saw that the response variable (whether the person became a customer after the campaign) was very imbalanced (see below). Therefore, I used ROC/AUC to evaluate classifiers, as it is less problematic for such situations than many other evaluation metrics (e.g. precision, f1-score).
First, I fitted the data to three different classifiers with their default parameters (logistic regression, random forest classifier, and gradient boosting classifier). I used GridSearchCV with stratified cross-validation (5-fold), meaning that for each fit I had five ROC/AUC scores. The algorithm that had the best perfomance with default paramaters was gradient boosting classifier.
I then tried to fine tune the fit a little more with the gradient booster classifier. However, I only managed to evaluate a few parameters settings because of the huge processing power involved. I changed two parameters: min_samples_leaf (1 and 2) and n_estimators (100 and 150). The best parameter setting was the one with min_sample_leaf as 2 and n_estimators as 150. Yet, note that the best ROC/AUC score for the best model (0.768) is only marginally greater than with the default parameters (0.766).
This project was a great example of how data can drive solutions to companies, giving them more precise information for business decisions.
Personally, this was an outstanding opportunity to understand different elements of the machine learning process, from challenges with data cleaning, evaluation metrics and algorithms. Of course, further procedures could be taken to improve the predictions made, such as more specific data cleaning, use of other algorithms and better parameter tuning. However, there is always a trade off between the quality of a project and the time spent in it, so it is a matter of finding the balance between the two.
Sydney is the largest city in Australia and a popular destination for tourists, with majestic beaches that attract visitors from all over the world. Here I explore characteristics and the importance of being close to a beach for Airbnb listings in Sydney from a dataset from 2018-2019 (pre-pandemic), collected by insideairbnb (available here). Analysis code can be seen here.
WHich neighborhoods are close to the beach?
If you are thinking about visiting Sydney, you may decide where you will stay based on a map. But being close to the water does not mean being close a good beach, as Sydney is surrounded by cliffs, harbors, or simply less desirable beaches that may not satisfy your beach needs.
To classify neighborhoods regarding their proximity to a beach, I simply used the proportion of listings thats used the word "beach" in each neighborhood (more than 50% classified as coast, less than 50% classified as inner). This proved to be a reliable method by checking the map of Sydney coupled with a little knowledge on where the best beaches are located. I also classified the city of Sydney (downtown) as an independent category (center), given its importance.
Center listings have fewer bedrooms and accommodate fewer people than listings from other neighborhoods, but there is little difference between coast and inner listings (on average).
Does price change according to neighborhood category?
Being in a neighborhood close to a beach appears to increase the price charged per day for listings with many bedrooms or that are able to accommodate many people. Surprisingly, center listings are not much more expensive, on average, than inner ones.
Is there a difference between summer and winter availability depending on neighborhood category?
Different from expected, winter availability was lower than availability in summer for all neighborhood categories. Yet, it is important to notice that listings from coast neighborhoods had almost a zero difference.
Conclusion: DOES IT MATTER being on a neighborhood close to the beach?
From a tourist perspective, renting a place in a coast neighborhood may be more difficult in summer (due to lower availability) and may result in a greater price. This also means that, if you are thinking on getting property in Sydney to rent it in Airbnb, being nearby a beach may be more profitable.