Beginner's guide to high performance caching with SSL

Dec 17, 2012

*Note: This is the first of a two-part post. For implementation details see the follow up post.

Here at Zivtech we're always looking for ways to speed up a site. Drupal is a great tool and massively powerful, but with that power comes performance trade-offs. The most noticeable trade-off is site speed. Each page load on a Drupal site that isn't cached is expensive from the point of view of the server. Generating all the data necessary to perform a single page load requires checking permissions, user settings, and block settings, just to name a few core systems. All this base functionality code that must run (commonly called the Drupal bootstrap) adds a tremendous amount of overhead to your site. This is one reason Drupal has many caches of data. The idea is to store as much of the commonly used data as possible so that the server doesn't have to do all the work on every page load. This helps but there are better tools out there.

Varnish

Varnish emerged as a favorite caching technology in the Drupal world mainly because it is optimized for HTTP. Varnish is its own open source project. It is a caching reverse-proxy that allows tremendous flexibility and reconfigurability. Even so, it can be set up rather simply. If the term "caching reverse-proxy" makes your eyes glaze, you're not alone, but the concept is very simple and worth understanding. The caching part is somewhat obvious, Varnish retains data that is common so that it can be retrieved quickly. As a reverse-proxy, Varnish acts as a middle-man. Varnish itself is a server but it is not an application server. This means that Varnish can't host a Drupal site but if it's been given some HTML it can serve that to a requesting client. Vanish can also pass requests to an application server that does host a Drupal site.

I think it's worth explaining how a reverse-proxy is different from a “normal” proxy but I'm going to lay some ground work with analogies first. In the post script we'll come back to this difference.

To understand the work Varnish does, imagine the following scenario. You've been very busy at work and you've never had a great memory. Clients keep calling you with questions. To answer each question, you've got to look up the answer in your files. Soon, client questions have piled up and there are a bunch of people waiting on hold for your answers. To ease your work load, you station a secretary out front of your office to handle incoming calls because you are obviously very busy. When the secretary first begins working for you he might not have all the answers to the incoming calls. He'll bug you a lot in these early days. But he has a terrific memory and he remembers everything you tell him. With each similar call he can respond on your behalf and leave you to handle more important matters. In this scenario, the decision maker – the person in the office – is the HTTP/PHP server, which for most of us is Apache. So Varnish sits in front of Apache and bugs it a lot at first. But Varnish caches each page response and on subsequent requests, just serves that page from the cache. This means that if your site has mostly static pages for anonymous users (and most do) then after most of the pages have been visited the site will start to be served almost entirely from cache which will make it feel almost as fast as one of those old static HTML file sites from the 90's. Remember those?

So the takeaway so far should be "Varnish is great and I should be using it." The problem comes when you want to use Vanish on a site that is secured by SSL.

Security is a feature

We're fans of securing sites with SSL whenever we can. At the very least we think it's important to secure logins and administration pages but as far as I am concerned we really should be using SSL everywhere. There is little overhead for enabling SSL everywhere. We're talking about a one to two percent performance difference on most sites. The benefits are great though, and as your site grows and features are added you can be comfortable knowing that there is a safe channel between you and your server. With SSL everywhere, your passwords, your data, and your site sessions will be safe even if you're surfing from an open WiFi point such as an Airport, coffee shop, or train. Even anonymous file downloads can benefit from the protection of an SSL certificate. Imagine downloading a new version of Adobe Flash Player only to find that the file was actually a Trojan served by a hacker who happened to inject some data into your requests to Adobe. Sound far-fetched? It's not.

Takeaway #2: "If a site is worth protecting with a password it's probably worth the $100 a year to add an SSL cert."

Sadly Varnish doesn't handle SSL termination. This means that Vanish can't decrypt the encrypted requests that come from a secured connection. Let's return to our secretary example. Imagine that the encryption provided by the SSL cert via HTTPS was like a foreign language (this isn't how encryption works but the difference between HTTP and HTTPS is analogous to two different languages). Suppose many of the calls you are used to receiving in the office are from French speakers. That's wonderful but you just got this secretary with a great memory who doesn't speak any French. So now he's just calling you every minute and bugging you again. It's even worse if you want your office to serve French speakers only and so you just refuse to take calls that aren't in French. Now your secretary is totally useless. So what should you do?

Pound to the rescue

Pound is another reverse-proxy tool that happens to handle SSL termination. Pound is the kind of software that I love. It is open source, very stable and it does only a few things, but it does those few things very well.

In our office example Pound is the solution to your secretary's inability to speak French. Here Pound is like a perfectly bilingual translator. He's not great at too much else but he's a great fit for your office setup. With the addition of your new translator, phone calls come in and the translator answers them. He in turn translates the French requests to English and passes the request on to your secretary. Your secretary responds immediately (in English) if he knows the answer, otherwise he'll ask you and then remember it for next time. The secretary responds to the translator who then gives the information back to the caller in French. Callers never know that behind the scenes there was a bunch of English communication and they don't know who actually had the answer to the question, they just know that they received quick responses in their native language.

This is the final setup. Pound sits up at the front of the server taking incoming secure Web requests. Pound handles the encryption and decryption of traffic and hands requests off to Varnish. Varnish responds immediately if it has a cached version of the page. If it does not, it asks Apache for the page. This starts the Drupal bootstrap and runs thousands of lines of PHP code along with dozens of MySQL (or other compatible database) queries. The resulting page is handed back to Varnish (which will remember it for next time) and Varnish hands it off to Pound. Pound encrypts the data and sends it to the user whose browser then decrypts it. Viola, secure pages served up (on average) significantly faster than any normal Drupal site.

In a later post we'll describe how to actually wire all this up but I thought it would be a good start to lay some foundational knowledge so that we have a (more) clear picture of how these technologies work together to make the dirty business of handling Web requests faster.

The Pound-Varnish-Apache setup may not be for every site. Budget, maintenance, and lack of knowledge may prohibit their use on some projects. Even so, almost any Drupal site could benefit from such a setup and I would hope that this knowledge will help you make informed decisions on your next project.

Post Script: Proxy vs Reverse-Proxy

Now that we have a nice example involving an office setup and a secretary, I can extend that to make clear the difference between a reverse-proxy and a “normal” proxy which is generally called a forward-proxy. The distinction resides in who has the secretary (the server or the client).

The Reverse-Proxy
If you were to call the office that we described in our examples you would not know who actually had the answer to your question. You'd only be speaking to a single individual but that person may actually have contacted three or four others to get you the information you requested. You'll never know who actually answered your question. Since the office you called has a public number (is a server of information) this is a reverse proxy. Note that the company you called does not discriminate incoming calls. They'll answer the phone for anyone but handle the flow of data in the office in a way that is obscured to you. A reverse proxy is a network gateway to any number of other other servers or services. Generally a reverse-proxy is available to anyone on the wider network but has a restricted set of back end machines.

The Forward-Proxy
Forward proxies happen on the client side. Imagine that you've hired a personal secretary. Your personal secretary actually has a number of clients all of whom do not have public phone numbers. When you need information from the outside world, you call your secretary and ask her to retrieve it for you. She then then calls out to the company you want information from and then gets back to you with the response. The company never knows exactly which client received the information. It could have been any one of the secretary's many clients. Here note that not just anyone can call the personal secretary and ask her to get them information. She won't respond to people who aren't her clients. In a network a forward proxy allows a known set of users to connect and access a wider network through it. Importantly in a forward proxy, connecting to the proxy is not open to everyone on the network. It's similar to a home router that you might use to allow all your computers and devices to connect to the Internet via a single IP address.