How to send mass mail without crashing your website

Sep 18, 2024 | Peter Denison

mail
edm
performance

We often field queries from disappointed clients who’ve just sent an email campaign, only to find that their website immediately started displaying errors or timing out.

People are often surprised by this ('We only sent a small campaign to 2k recipients!'), however server load during an EDM can easily be hundreds of times higher than the baseline. This is because the bulk of traffic from a mail campaign often comes in the first several minutes after the campaign is sent.

The problem

This spikiness in traffic and the resulting server load is caused both by organic traffic—human recipients clicking on links in the email—and click bots. The majority of click bots are email security products that follow each link in an email to check the target for malicious content. Most mass mail providers rewrite each link to add tracking information, which means that each email sent can lead to unique bot clicks. Taking this into account, it becomes easier to see how sending an email with 5 links to 2000 recipients, 20% of whom use security software that opens every link, can lead to 2000 page requests within a minute, leading to serious issues at the hosting end. This is especially the case when considering that click bots can retry requests until they get a non-error response.

Most hosting solutions are set up to avoid overload in the first place, rather than to recover from massive overload gracefully. When overloaded, requests queue and the server can run at 100% CPU with exhausted RAM for a period significantly longer than the original traffic spike, sometimes becoming so overloaded as to need a manual restart. Avoidance is better than cure, but modern hosting solutions shouldn’t be unavailable for 20 minutes after a brief unexpected spike in traffic. If you’re looking for hosting that recovers from unexpected issues quickly, have a look at our self-healing kubernetes solutions. Kubernetes based hosting has significant advantages for handling EDM traffic, both in terms of rate of scaling to meet unexpected demand and graceful recovery from overload.

Regardless of the hosting solution, however, It’s unrealistic to expect that a website can handle any spike in traffic that you throw at it. Normal browsing traffic is spread across the day, and the stark ratios between baseline and EDM traffic really shouldn't be underestimated.

Since EDMs are important business tools, we recommend approaching the problem from both ends: increasing capacity and reducing load spikes.

The Solution, Part 1: Reducing Load Spikes

Caching the landing pages that an email links to is critical. If you have a caching layer like Varnish in front of your application (and if you don’t you definitely should), make sure that the links in your emails don’t reach the backend. If your email platform is turning

www.mysite.com/special-deal

into

www.mysite.com/special-deal?track=something-unique

then Varnish needs to be configured to ignore the specific query parameter and return a cache hit. Your developer should be able to implement this by modifying the VCL within an hour.

Spreading the load over as long a period as practical is also very important. We can’t speak to your unique business case, so if you’re running campaigns centred around a product being available for a 10 minute window this may not apply to you, however the majority of email campaigns can and should be delivered over a longer period of time than they usually are. Stagger the send as much as you can. We really can’t stress this enough: if the business requirement is to send 2000 emails between 10 am and 12pm, there is a huge difference between sending 2000 emails at 10 am, and 84 emails every five minutes from 10 am. Again, the business case needs to be considered: if you find conversion is double in the morning compared to afternoon then that’s when mail should be sent, but perhaps the campaign can be sent over two or three days.

Most mass mail platforms all you to do this natively, e.g.:

https://mailchimp.com/help/schedule-batch-delivery/

https://help.klaviyo.com/hc/en-us/articles/360050216012

If your mass mail provider doesn’t have a feature to stagger mail, you can still split your list into several smaller lists that are then sent at different times.

The Solution, Part 2: Increasing Server Capacity

If your hosting plan has fixed resources, the calculations are relatively simple—you need to have enough memory and CPU to handle highest peak loads during a mailout, ideally with some headroom. You need to be able to see your resource usage graphs across several mail sends, and then size the plan accordingly:

A stacked graph of CPU usage across time, illustrating the variable nature of resource usage caused by mass mail campaigns — Observe current load during EDMs to gauge required resource limits

Ideally your hosting plan can auto-scale, keeping costs low during normal, baseline traffic, and then allocating additional resources when traffic ramps up.

Scaling isn’t, unfortunately, always as simple as taking a functioning website and saying “I want to be able to host traffic volumes 1000 times bigger than the baseline”:

Speed of scaling is critical
- If traffic spikes sharply and scaling relies on (e.g. EC2) instances to boot before they’re added to a scaling group, there’s a good chance your existing instances will become overloaded before help arrives, leading to knock-on effects.
- Setting up scaling strategies / methodologies isn’t as simple as it sounds, and needs to be tweaked to keep the site up and performing during real-world traffic events. Do you scale solely on CPU usage? Memory? How aggressively should the application scale up and then back down?
There are likely to be other bottlenecks in addition to CPU and memory.
- Scaling databases is non-trivial, and depending on the application design database locks can be an issue.
- Horizontal scaling doesn’t scale shared disk IO.
- Network capacity needs to be sufficient to handle the spike in traffic.
Applications need to be tested with either real world or synthetic traffic to determine the rate and volume of traffic that can actually be served.
Scaling still needs to have limits, both in terms of absolute cost per minute and the level of permissible monthly spend. Paying 100 times the normal monthly hosting cost to serve aggressive bot traffic isn't acceptable, for example, so cost alerts need to be set up and monitored.

With a properly set up scaling solution, however, it is possible to both reduce costs at baseline and serve heavy traffic during busy periods. Kubernetes hosting is ideal for this, for several reasons:

Scaling is fast—new pods come up in seconds, rather than minutes
Scaling is built in, and scaling strategies can be set up based on CPU usage / overhead, RAM and a host of other metrics.
The components of the stack that actually need to scale can be scaled granularly

Please get in touch if you would like a demonstration of the speed and resilience with which K8s can scale in response to EDM traffic.

Summary—the critical points

The following steps should be taken to ensure that your site stays up during a mass mail send:

Use Varnish, and configure your VCL to not send EDM links to your backend.
Stagger the send as much as possible within your business constraints.
Once the above is done, if your hosting is running on fixed resources, scale your server resources to be bigger than your largest EDM peak. If on a scalable solution, ensure that the scaling resource limit is higher than largest EDM peak, and that the methodology in use scales required components quickly enough to meet traffic demands.

How to send mass mail without crashing your website

The problem

The Solution, Part 1: Reducing Load Spikes

The Solution, Part 2: Increasing Server Capacity

Summary—the critical points

You may also like: