Data Science Bootcamp Capstone Project: Predicting Open Rates and Click-to-Open Rates for Newsletters using Natural Language Processing

10 min readFeb 1, 2022

This project was completed as part of the General Assembly Data Science Immersive bootcamp.

The Problem

Email marketing is used to promote a business, whether that be a product or service. It is also a key tool in strengthening the company brand and the relationship between the business and the user. They can be time-consuming to plan and construct but the right messaging can make a big impact.

At first, email was just a system for internal communication to help with company productivity and workflow; today, we send and receive over 319 billion emails a day and it is predicted that number will reach 376 billion by 2025.

I would like to know if we can find the keywords that give the business their desired impact of a higher open-rate or a higher click-to-open rate by looking at the subject of the newsletter.

In this scenario, the business is an international genre-specific music news portal.


  • Open-rate (OR) : the rate of opened emails out of total delivered
  • Click-to-open rate (CTOR) : the rate of people that open an email but also click a link in the newsletter. It is a strong indicator of the effectiveness of the campaign. Total unique clicks divided by total unique opens
  • Click-rate (CR) : Total number of clicks divided by number of emails delivered

My objectives for this project are to:

  • Predict high OR and CTOR
  • Identify keywords and therefore user segmentation & behaviour
  • Provide insights to the business to aid in reaching their email marketing goals

OR and CTOR are important metrics when it comes to email marketing as it tells the business how engaged their users are with the content. My hypothesis is that users will be more engaged with the newsletter if it covers a specific topic or artist.

From a user point of view, you may be receiving several newsletters a day. The decision to open an email often comes from glancing at the subject of the newsletter, or perhaps even just the first few words that are visible depending on the set up of your email. Those words will be the deciding factor on whether or not you open an email.

Similarly, once the email is opened, it may come down to the title of the linked article or site that will determine whether or not it is clicked.

The keywords we spot are a determining factor for a good OR and CTOR.

There are of course other factors at play such as demographics, position of the content within the email, time of day, and more, but those are beyond the scope of this project, as that information was not available in the current data set.

The Data

The data acquisition for this project was straightforward. I used my network to search for a business that was willing to share data for the purposes of this capstone project without a problem statement in mind. I was kindly given access to a few marketing and social media dashboards and chose to work with the newsletter data.

From here, the acquisition was a simple case of loading the email history and exporting it to CSV.

The data was relatively clean, there were no missing values but a few things to correct. All work was done in Python on Jupyter Notebooks. Here is a summary of the steps taken, if you’d like more detail on these please head over to the Read Me file on GitHub:

  • Rates changed to numeric
  • New feature for OR created
  • Date column reformatted into datetime and day, month, and year extracted
  • Remove outliers

I will be reformatting the subject title column of the data for modelling, however, as this is done as part of a model pipeline we do not need to create these columns in advance. I still perform this action separately to explore the frequency of words. This is done with a stemmer and a CountVectorizer.

After tallying up the frequency of words, I produced a WordCloud to visualize the results. Here the biggest words indicate a higher frequency:

Before I move to modelling, we explore some of the trends within the data including reviewing the sending behavior per year, month, and day. We can see in the graphs below that the company launched in 2016 but has since been relatively consistent with their sending behavior and the majority of their emails are sent on a Friday.

Finally, I see how this newsletter compares to the global benchmarks for email marketing. Here I do not include the benchmarking for 2021 as that year is not complete yet, and a reminder that our 2016 data is also an incomplete year.

A high click-to-open rate indicates that the content resonates strongly with your audience, specifically the ones that open their emails.

Unfortunately, this newsletter is underperforming when it comes to initial open rates, but the ones that do open are loving the content and very click happy!

Source: 2015–2019 Email Benchmark Data

Source: 2020 Email Benchmark Data

The Model

In the image above we see our OR and CTOR sit within quite a small range, I therefore decide to make this a classification problem by creating a flag for a high or low rate. This is done by first finding the mean, or average, rate and then anything above that is flagged as a high rate, or a 1, and anything below is a low rate, or 0.

It is worth noting that this data set is small, so we are expecting some variance issues to come up. To try to mitigate this from the beginning, we decide not to split our data into a training and test set but instead to use cross-validation which segments the data and runs the model for each segment. The cross-validation method will score our model using “roc-auc” which measures our models ability to distinguish between the two classes we have created (1 or high, 0 or low).

The modelling is done using a pipeline that includes a ColumnTransformer reformatting the newsletter subject as previously described with the CountVectorizer, and also a OneHotEncoder for the month variable that we decide to include. This creates a column for each month and adds a 1 in the column if the newsletter was sent in that month.

Note that day was not included in the modelling since we know from the EDA that most (66%) emails are sent on a Friday.

I run the following models first with all features (month, and subject) and then run all models two more times reducing the features each time, first without month, and second with a maximum of 50 word features:

  • LogisticRegression
  • DecisionTreeClassifier
  • RandomForestClassifier
  • KNN Classifier
  • Naive Bayes Bernoulli
  • Naive Bayes Multinomial

Here are the results. As expected with a small data-set we see a high variance in the scoring. Too many features and not enough data creates a risk of overfitting, or the model being too sensitive to the input data, resulting in a big range of results when training the model on unseen data (high variance). Normally this can be fixed with feature reduction, however that was not the case here so it is likely we will need more data to reduce that variance.

CTOR Modelling

OR Modelling

The Results

To determine which model is best I want to have the least amount of variance and the highest score, remember that the scoring is “roc_auc” measuring our models ability to distinguish between the two classes we have created (1 or high, 0 or low). Looking at both of these metrics, I can determine that the best models are the Logistic Regression for the CTOR and Random Forest for the OR, both with all features included.

By reviewing the features of these models, I select a few words that appear to have a strong influence on the OR or CTOR. These are:

  • resumo
  • retrospectiva
  • slipknot
  • nervosa
  • mustaine
  • lennon

In order to check the robustness of our model, we take a closer look at the OR and CTOR rates specifically related to the appearance of these words. We do this by running a t-test and can conclude that these keywords successfully segment our users based on their open and click behavior.

The features pulled from our two top models tell us a story. Both indicate “resumo” and “retrospectiva” as important features however with the Logistic Regression we can infer that the relationship there is negative. Upon closer inspection we can infer that these keywords have a positive influence on OR but a negative influence on CTOR indicating that readers have no interest in clicking further to the website. For example, a user may be opening an email interested in reading a recap of the news however, upon opening the email, realises they have already read all of the articles and does not click further.

We also investigate some artist names and find further segmentations. In all instances the effect on OR is always opposite to that of CTOR. Here, an example could be that only fans of that artist bother opening the email and once opened, they are keen to click through and read the latest news.


In all cases we were unable to remove the variance in the modelling and after trying various feature reduction methods to do so we can conclude that the issue here is the size of the data set.

Aside from the size of the dataset, the biggest issue faced here is the language barrier given the data is in Portuguese. It is likely that more words can be added into the stopwords feature however given the engineer’s limited knowledge of the language this was not done.

A better understanding of email marketing metrics would also be helpful.


The problem we set out to answer was to increase clicks to the website from the newsletter. As we didn’t model on CR given the miniscule rate range there we focussed on OR and CTOR.

In real terms, the recommendation will need to be based on what the future business strategy will be and what to focus on. Having looked at where the business sits across email marketing benchmarks, perhaps a focus on improving the OR could be a good recommendation since the CTOR is over performing. In this case, although our results indicate that newsletters on “resumo” and “retrospectiva” negatively impact the CTOR, it does have a positive impact on OR.

To summarize, if the business wants more people to open their emails then a focus on news recaps may be helpful. Alternatively, if they want more people to engage with the content after opening then a focus on breaking news on artists may be helpful.

The caveat remains that these results come from a small dataset with high variance and although we have tested for robustness of the results, there is still room for improvement.

Further work

This model can be revisited in the future when more data is available, however given the frequency of the newsletter we would be looking at a wait time of a few years in order to increase the dataset substantially.

The newsletter dashboard does include more data that is available per email sent that can be pulled individually from each email as well as some data that can be compiled manually (however, this would be a lengthy job). This data would include details for each newsletter or “campaign” including but not limited to: email content, user domain data, geographical open data, device, browser, and specific on which links within the email have been clicked.

We could improve on the current models by optimizing the hyperparameters with a Gridsearch or combining the models with Voting or Stacking. The latter would reduce interpretability, however.

An improvement can be made on the keywords to better recognize artist & band names, particularly those with two words e.g. Dave Mustaine or Alice Cooper.

We can also look into improving our understanding of what artists are “trending”.

Key Learnings & Challenges

  • This project allowed me to research and construct WordClouds
  • It also allowed me to refine some visualisation skills, specifically when visualising the rates with benchmarks and comparing the model scoring
  • Using t-test to check for outliers, user-segmentation and model robustness was a key learning opportunity for me
  • My personal challenges are mentioned under limitations as well: a language barrier, and a limited understanding of email marketing strategy


I would love to hear any feedback or questions you may have as I continue my learning journey. Feel free to leave a comment here or contact me on LinkedIn.