Lately I’ve been ramping up my ability to test the runtime impact of changes to WordPress and someone asked me to share more specifically what methodology I’m using. I’m not content with my setup, but in this post I’d like to share what it is and what I’ve learned so far in getting here.
If you’re up for a long read, then follow along and I’ll share the approach I’ve been taking lately to investigate some performance claims in the upcoming WordPress 6.4 release.
Test Hardware
I’ve been comparing runtime performance on two separate devices: my old work laptop, and donated time on a sever in a research compute cluster.
The Laptop
- 8-Core i9 5.0 GHz processor
- 64 GB RAM
- 500 GB embedded SSD
- Running in a room with a ceiling fan blowing, on a stand that permits airflow around and underneath the laptop, with no obstructions around the input or output vents. I use MacsFanControl to pin the internal cooling fans at full speed. This effectively prevents thermal throttling in the workloads that I have seen while testing WordPress.
- Nothing else is running other than the normal array of background tasks that macOS runs.
The Server
- 4-Core 2.0 GHz EPYC processor
- 16 GB RAM
- Storage is unknown to me
- Running in an actively cooled rack in an actively cooled data center.
- CPU Cores are dedicated to the host and are not supposed to be shared with any others. Running Ubuntu and not running any notable background processes.
Because these tests run for so I long I login to each machine via SSH and open my testing environment with tmux
. If I ever skip opening a tmux
session I usually regret it when some computer unexpectedly reboots or my network connection drops.
Test Plan and Setup
Measuring the wp-total
metric
I’m relying on the Server-Timing
HTTP header that the Performance Lab and Felix Arntz’ server timing output buffer plugin provide. To be honest, I’ve never questioned why I’m using this or what role each plugin plays. When I began testing I was pointed to the wpp-research
tooling and its instructions to activate these plugins and discovered that the Server-Timing
header was clear and easy and should have relatively little impact on the running code (vs. far-more intrusive methods of profiling which can substantially skew the picture of where code is spending its time). I don’t even know what the server-total
metric means, but I am measuring when it goes up and when it goes down.
The trouble I have with the wpp-research
tools are that they only report aggregate statistics and don’t show the distribution of the results. Thankfully, it’s easy enough to extract the HTTP header and collect the individual data points manually.
~$ curl http://localhost/ -D - 2>/dev/null | grep 'Server-Timing'
Server-Timing: wp-load-alloptions-query;dur=0.98, wp-before-template;dur=38.87, wp-template;dur=42.7, wp-total;dur=81.58
For my experiments I run curl
against the test site and append each of these lines to a file associated with the code under test.
Running native code
I’m terribly confused by the layers of abstraction and indirection involved in a number of the Docker, wp-env
, npm
scripts, and other systems for running WordPress. For some reason they always seem to mess with my files, break things, and stand in the way of me doing the debugging I’m looking for.
Someone showed me long ago my favorite method of developing WordPress and I use this not only for my daily work but also in these tests: I run MySQL locally and PHP natively from the WordPress directory.
~$ cd wordpress-develop/src
~$ php -S 0.0.0.0:80
It’s not convenient for all kinds of testing, but there’s not even a webserver like nginx
here. We’re eliminating almost everything but the bare minimum, which seems sufficient so far for the general-purpose performance testing going on. For example, I’m not depending on multi-site installs or verifying that permalinks work. Besides, I want to test the PHP code running WordPress, not nginx
or Docker’s filesystem performance.
This simple environment is quick and easy to start, stop, and reset.
Resetting state
Since WordPress performs caching and runs its own background jobs I want to ensure that the process of repeatedly making requests against a site doesn’t mislead the results. More importantly, some patches I’ve been testing impart changes to the way in which computation is cached. I want to make sure that the remnants of one test group in the database don’t bleed into another and bias those results.
To mitigate this risk I run my tests in batches and at the start of each batch I perform a database restore to a known pristine state.
- Start by running
DROP DATABASE wordpress
in MySQL to eliminate the database I use for my test site. - Recreate the database and grant privileges on it for the
wpuser@localhost
user configured in the test’swp-config.php
. - Run through the site installation at
http://localhost/wp-admin
. - Install and activate the two required performance testing plugins and deactivate the options in the Performance Lab plugin.
- Ensure the appropriate theme is installed and activated.
At this point the site is fresh. I take care not to visit the front end of the site. If there is any caching going on I don’t want it to happen before I take a snapshot of the site. I want that caching to appear in the test results.
sudo mysqldump -u root wordpress > fresh-install.sql
This command dumps the wordpress
database into a file and we can use that file at the start of each testing batch. The file drops all of the existing tables and replaces them with the original values.
mysql -u wpuser --password wpuser wordpress < fresh-install.sql
Disable WP-Cron
Finally, because the WP-Cron system runs its own background tasks at intervals, I want to disable it so that those tasks don’t impact the test results. It could be the case that some change in WordPress mistakenly overloads the WP-Cron system to the point that it affects every request. By disabling the cron tasks my benchmarks will miss that, but I’m counting on other methods to uncover unintentional problems like that. It’s easy enough to not disable them to check if this could be the case. Regardless, for these tests I don’t want external factors introducing non-determinism into the test results.
// in wp-config.php
define( 'DISABLE_WP_CRON', true );
Never repeat a URL
This is probably unnecessary given that there are no web servers in between curl
and php
, but I take what the wpp-research
tools do and add a random query argument to the page URL. In order to ensure it’s unique I use the date as seconds and then add the nanoseconds reading as well, since the date seconds is the same value for an entire second, and multiple requests may occur in that same second.
curl http://localhost/?rnd=$(date +%s-%N)
Batch requests
When I started running these tests I noticed that there were unexpected times where my computer or test server would run faster or slower and I didn’t know why. This caused a major impact on one test group without affecting others. These periods lasted for minutes or longer though, so I started interleaving test runs. Normally I would prefer to randomly select which version of the code to run but given that there are costs involved in switching branches, resetting the database, and doing other test setup, I tried to find a reasonable batch size over which to amortize that while spreading the bias across all groups equally. I have been using batches of 500 requests.
A script runner
Since I gave up years ago on BASH and run fish, this is a fish script.
When testing git
revisions this script is easier because there’s no need to switch directories. I’ll use the specific commit SHA as the branch or group name. It also doesn’t require having a separate directory for each WordPress site or require copying the plugins into that directory.
# benchmark.fish
# 100 * 500 per batch means each group runs 50,000 times or until the script is killed.
for batch_group in (seq 1 100)
for branch in wp-6-4-rc1 wp632 some-other-test-branch
cd ~/perf-6-4-rc1/$branch/wordpress
# WordPress updates the database across major versions, so it may be
# required to create a different database export for each one, and for each
# theme you want to test against.
if [ "wp632" = "$branch" ]
mysql -u wpuser --password=wpuser wordpress < ~/perf-6-4-rc1/fresh-tt1-632.sql
else
mysql -u wpuser --password=wpuser wordpress < ~/perf-6-4-rc1/fresh-tt1.sql
end
php -S 0.0.0.0:80 &
set -lx server_pid $last_pid
for n in (seq 1 500)
curl 'http://localhost?rnd='(date +%s-%N) -D - | grep 'Server-Timing' >> ~/perf-6-4-rc1/perf-$branch.txt
end
kill $server_pid
end
end
Next I run this script and let it go for hours. At the beginning though I verify that the results look right and I’ll run a request in the browser to confirm that the site is rendering properly and isn’t doing something like flashing an error message. Typically I’ll have a second pane in tmux
reporting on the progress of the benchmark.
watch -n 5 wc -l 'perf-*.txt'
This reports on how many samples of each test group have been collected. Every so often, when I have enough from each group, I will preview the results and move into data collection.
Data collection
While I run the tests from a remote computer and I run curl
from that same remote computer (to eliminate the network delay as a source of bias), I perform data analysis locally. This typically involves a quick call to grab the files, extracting the measurements in node
(for no specific reason other than it was convenient when I started), and then analyzing in R.
Copying files to my machine
There’s nothing special here. Something interesting though that I discovered while on a quite terrible connection was that it seemed like scp
continues to send data if the file it’s sending is being written to. While copying the file for the currently-running test I would see the download exceed 100%, and I think that because the file operations were append-only, as long as scp
hadn’t passed the end of the file, it would continue writing as long as new data extended the end of the file.
scp 'user@remote:~/wp-perf-6-4/perf-*.txt' .
Extracting the measurement
There are multiple values by default in the Server-Timing
header but I’ve focused on wp-total
. Each line looks like the following:
Server-Timing: wp-load-alloptions-query;dur=0.98, wp-before-template;dur=38.87, wp-template;dur=42.7, wp-total;dur=81.58
Since these lines are all reliably formatted this way I reached for over-simplified text mashing to parse them. My goal for the extractor function is turning these separate files of HTTP headers into a single CSV file that R can read in as a data frame. Because of that, the transformation is a bit unexpected. While we might usually think about making a separate column for each branch or test group, here each row contains a loop index as a sample sequence number, the value of the wp-total
measurement, and the descriptive branch name is duplicated on every row. The above example would turn into something like this, and it makes it easier for me to work with inside of R.
n, value, branch
0, 618.13, "rc1"
…
1423, 81.58, "rc1"
…
1423, 85.13, "wp632"
1425, 96.11, "wp632"
…
0, 534.09, "some-other-test-branch"
…
Like most research related code, my extraction script is embarrassing. It’s unclear, unstructured, and ad-hoc. It’s the result of exploring data and figuring out what I need before writing the code to do it, and it works well enough while being small enough that I could rewrite it more robustly should the need arise. It started as individual lines in the node
shell that I got tired of repeating manually.
import fs from 'fs';
const read = infile =>
fs.readFileSync( infile, 'utf8' )
.trim()
.split('\n')
.map(l => parseFloat(l.split('wp-total;dur=')[1].trim()));
const collect = ( branches, outfile ) =>
fs.writeFileSync(
outfile,
'n,value,branch\n' +
branches.map(
([name,sha]) =>
read( `perf-${sha}.txt` )
.map((v,i) => `${i},${v},${name}`)
.flat()
.join('\n')
)
);
collect(
[
// These are the descriptive name and input filename pairs.
// Sometimes the values on the right are git SHAs or branch names.
["wp632", "wp632"],
["rc1", "rc1"],
["no-block-pattern-cache", "no-block-pattern-cache"],
["no-block-template-folders", "no-block-template-folders"],
["no-file-exists", "no-file-exists"]
],
'perfData-rc1-improvements-compare-tt1.csv'
);
Analyze the data in R
Like all statistical research work, the most important and most challenging part is interpreting the results of a study. Statistics can tell us if our theories are unsupported by evidence, but they can’t actually tell us that the evidence supports our claims, only that it might be consistent with what we’d expect if our claims were true.
View the raw data
Therefore, the most important step in analyzing these profile results is to plot the raw data and look for any oddities we didn’t anticipate.
library(ggplot2)
data <- read.csv( "./perfData-some-test-run.csv" )
ggplot(data, aes(x=n, y=value, fill=branch)) + geom_point(aes(color=branch))
Let’s examine a few of these oddities and reinforce how important this initial period of discovery can be to ensure that we don’t draw conclusions from accidental bias or testing methodology failures.
This first example highlights what I mentioned above about what led me to start batching results. In this case I ran 3500 samples from each branch, sampling all 3500 times for the first test group, then all of them for the next test group, and then finally all of them for the last test group. Although each test group experiences periods of unexpectedly fast response times, the red group appears to experience it more.
Changing to batching makes it clear in the raw data graph that although we still see the nondeterministic effects of the server environment, those effects are impacting each test group approximately equally. It’s precisely because that server environment changes on a timescale much slower than each individual test that when the test runs has an outsized impact on its result.
A second example illustrates another reason that viewing the raw data is so important. During one experimental run there was a period in which requests completed in one tenth of the time that they normally did, and that’s strikingly obvious here. It’s obvious that these fast requests were wrong too, but I don’t know why because I didn’t also save the HTML responses with each sampling. I assumed it was possibly returning an error page for an unexpected reason, or was perhaps related to WP-Cron – this was before I started disabling WP-Cron.
For data like this I generally throw it out. I may restart the experiment or I may cut out that data from my analysis. R makes this easy to do, specifically with the dplyr
library. It can be used anywhere a dataframe is used.
> library(dplyr)
> ggplot(filter(data, value > 200), …) + geom_point(…)
> sd(filter(data, branch == 'rc1')$value)
[1] 19.97622
It is sometimes useful to compare the data in different ways. For one experiment I was curious about what I saw and so I plotted the samples after sorting them by value. The ggplot2
library also provides for easy faceting to show data next to each other instead of on top of each other. It’s a wonderful library that makes for quick interactive plotting with high quality plots.
> ggplot(data, …) + geom_point(…) + scale_y_log10() + facet_wrap( branch ~ . )
There’s also facet_grid()
for more directly arranging plots in rows and columns.
These plots show a clear bimodal distribution of requests. Most requests are fast, taking less than 100ms to complete, but a steady stream of those requests take around 500ms. In this experiment, those slow requests were the very first requests to hit the server after switching branches and resetting the database. Exploring the data like this led me to understand that there are at least two kinds of requests and I’ve been calling them primed and un-primed requests. I don’t know why the first request takes a factor of ten times as much time to respond, but it’s evident that something is doing that.
There’s another interesting spike in this data for the rc1
and wp632
plots that doesn’t appear in the other test groups. I didn’t have an explanation for that here, but it’s worth pointing out that when we explore raw data and something stands out, it doesn’t always mean that the testing environment has introduced bias. It could be that there was some transient server spike that slowed down the requests, or it could also be that something triggered a behavior that isn’t triggered in the other test groups, perhaps the caching or recaching of some asset or computation.
With everything I’ve learned about WordPress performance in the past few weeks, I have grown skeptical of benchmark results that are sampled across short periods of time. Viewing the raw data like this has demonstrated on multiple occasions that sometimes there are abnormalities that occur on hour timescales, and unless the testing has captured enough data to span multiple hours, there’s a real risk that abnormality unduly influences the results.
Compare distributions with statistical methods
I’ll admit that I take a lazy approach at this stage because I find the ggstatsplot
library too convenient. It’s lazy because I’m leaning on numbers I don’t understand, but the ggbetweenstats()
function does a number of things really well and so I use it.
- It generates violin plots of each test group so that the shape and distribution in each group is evident when comparing aggregate measures like average or median. Highly skewed or multi-modal distributions can render those aggregate measures useless: as they say, with one hand on fire and another frozen, on average one’s hands are comfortable.
- It performs statistical tests between the groups and reports the
p
level at which those results are significant. It provides a selectable set of central measures and tests based on the characteristics of the data. For example, in these tests I instruct it not to assume that the variance between the groups is the same because I don’t believe that should be the case since different code could lead to wildly different variance outcomes. - It plots the raw data in a way that illustrates outliers and provides convenient comparison at a glance between test groups. I often rely on the raw dots to judge if I think the statistical results are dubious; for example, I may see a cluster forming and want to run the test longer and collect more samples, or try to account for why the cluster appeared.
The work we did early in identifying the two kinds of requests is important now because we don’t want that clearly distinct set of slow requests to interfere with the results of the fast ones. It’s possible that they take a different code path with different characteristics, and in fact the evidence here suggests this to be the case.
library(ggstatsplot)
ggbetweenstats(data, x=branch, y=value, var.equal=FALSE)
This plot demonstrates everything we’ve just discussed. Notice how the violin plots and the dots show that there are clearly two different modes of operation captured in this combined plot and notice how much that upper mode has shifted the mean/average value above the median. Because of this we will want to discard this particular study.
Before we discard it though, we can observe the p
value reported between the groups. All three distributions show wide overlap, but they also contains 50,000 samples each. Statistical analysis is helps us determine if it’s likely that there’s a real difference at play or not, but in this case we have a very low power test because of the small measured difference and large overlap. That p
value is a quick estimate on whether the difference we see between the groups in these aggregate values is something we might expect to measure anyway if we ran the same code in each branch and nothing but the normal variation in the time it takes to complete a test caused the difference.
For the two branches on the left, we would say that the difference is significant at a p
value of 0.0081
. For the two branches on the right at a p
value of 0.00013
. The reason I drew out this broken study is because it’s the only one I had available with ready comparisons like this. We get to choose what significant level we care about, and the lower the p
value we choose, the more demanding we are that random variation shouldn’t be able to account for our measured difference. Typically p = 0.05
and p = 0.001
are chosen for basic studies. At p = 0.001
the data suggests that we do not have evidence to believe that there’s an inherent difference in the measured times between the left two branches, but maybe there is between the two on the right. This doesn’t tell us that the code made the difference; it only suggests that some difference might actually exists whereas on the left we shouldn’t conclude that the two are different at all.
With many samples and low testing power, a p
value can demonstrate that a difference is significant, meaning that it’s unlikely that random variation can account the difference we’re seeing; but it won’t tell us if the change matters. If there were a 250µs difference between the branches on the right, would anyone notice? In contrast, a high power test with a massive change in the measured values can be more easily detected with fewer samples. We have to keep this kind of proportionality in mind when analyzing and interpreting our results. Speaking of which, let’s discard these results and look at a split study.
ggbetweenstats(filter(data, value < 200), x=branch, y=value, var.equal=FALSE)
We’ve already examined the structural elements of the plot, but we can observe here that even though the estimated true average response times in these test groups are very close together, the plots show us that the rc1
and the wp632
branches skew upward and it looks like they might deliver slower requests that are worse than the slower requests served by the three branches on the left.
What’s fascinating is that the three test groups on the left all represent the removal of an optimization that was added to speed up page renders. For these specific requests in this specific experiment on this specific hardware the page not only rendered faster without the optimizations, but the slowest requests without the optimizations took less than 65% of the time taken to deliver the slowest requests with the optimizations.
Were the optimizations defective? Unfortunately we don’t know from this data. Remember the unexplained spikes we saw in the plot of the raw data above? Almost all of this upper tail distribution of response times occurred in a single short window of time and then for the most part didn’t reappear. Our results gave us a new question to ask and investigate. We could rerun the experiment again and see if that spike reappears. Maybe it would appear in the other branches, but if it only ever appears in rc1
and in wp632
then I might believe that the optimizations carried an unintended consequence that makes one metric I care about worse without making other metrics any better.
We must also always remember to ask if this matters. Do the optimizations stem from doing less on the server, such as avoiding computation, or are they additive, such as in implementing a complicated algorithm to more efficiently compute a value or caching something that took a long time to compute. Will the cost of understanding the optimization, maintaining it, and fixing what it inevitably breaks outweigh the improvement we’re able to measure? Sometimes the answer is yes, sometimes it’s no, and sometimes we simply don’t have enough evidence to draw a conclusion one way or the other.
How about the slow requests?
ggbetweenstats(filter(data, value > 200), x=branch, y=value, var.equal=FALSE)
We have fewer samples from these, but based on the scattering of the data we see an overall estimate of significance at a p = 0.19
level. This isn’t significant enough for any conclusions whatsoever. If all five of these tests groups were running the same identical code and setup, then based on the variation in the collected data, we’d expect to see differences between the groups that are at least this big every one in five times we run the experiment. At p = 0.001
we only expect random variation to account for the measured difference in every one out of a thousand experiments. Any impact we measure here, even with 100 test runs/samples, is probably an accident based on when we ran the test.
Final Thoughts
We can’t repeat it enough: collecting numbers is the easy part. The hard part about statistics and profiling and benchmarking is in interpreting the results.
- Profilers and tools like XDebug are useful for understanding where our code is spending its time in a given request, but they interfere with the runtime and don’t attempt to speak to the impact that a change makes on a given request. They can be useful in finding out why some request is slow, but can have a hard time providing the necessary context to explain the question behind the question. We can discover that we eliminate runtime in spot in our code only to find it reappear somewhere else. This happened to me recently when I explored one oddity that appeared during profiling. XDebug even reported that after my change the request completed about 12% faster than in
trunk
, but when I ran realistic benchmarks against the site with and without the change, no such impact could be found. What I think happened is that there’s a particular way that the code intrunk
interacts with XDebug that causes it to slow down in accounting work as the code runs, and so what I did was make the code run faster in XDebug without making it run faster in a real production environment. - When lacking sufficient evidence to make a claim it’s important that we don’t lean on that lack of evidence as support. Our intuition can play a role and we may understand that there are situations that are hard to capture in a benchmarking environment where a given change should have an impact. It’s okay to go out on a limb and trust our instincts, but that is a different situation than claiming that the measured impact of our change is meaningful. We might have accidentally hit one of those 10x vs. 1x response times when we ran our tests.
- Because of the ambiguity of lacking evidence, I personally always prefer going faster by doing less over adding more layers to combat the slowdown caused by the introduction of other layers. In another recent example I discovered that WordPress starts iterating over an array to determine if something in that array should trigger another process. Unfortunately, once it discovered that it should trigger that process, it continued iterating over the rest of the array for no reason. Removing overhead like this is a trivial matter, and should always be faster.
- Modern processors and computers can be incredibly complicated and surprising performance-wise. Most of our traditional conceptions of what makes something fast or slow is based on outdated information or theory that doesn’t match reality. The best way to know if a change carries a performance impact is to reproduce as closely as possible the target environment and run the code we want to test. Out of order execution, long compute pipelines, layers of caching, and speculative computation can dominate the performance characteristics of our code, and pulling code out of its context can lead to meaningless results. We need to be diligent about testing and measuring and asking if our results say what we think they do.
- Studies like the ones in this post can be just as misleading and devoid-of-context as micro and synthetic benchmarks can be. Almost no WordPress site runs on a powerful server with no load and which delivers the front page of the site over and over and over again and nothing else. The site being tested in this post isn’t even serving out static assets like JavaScript files, CSS, or images. Those different requests on a real server will be fighting for cache, processor, database, and RAM resources, not to mention that most WordPress sites on the internet are running on large and often oversubscribed shared hosting.
- Because my tests run MySQL on the same machine, even though they have eliminated that as a source of test unreliability, they could be missing that on common WordPress installations the database latency dominates page render time. We developers tend to run high speed and high quality gear with fast internet connections and when we examine performance characteristics on that high-end hardware we can miss obvious bottlenecks that our customers experience.
- Finally it’s worth pointing out that almost all of these raw performance comparisons have limited meaning. Most WordPress sites run behind a static page cache on the web server and that single simple approach bypasses everything that runs when rendering a page.
Measuring performance is hard, which is why it demands persistence, inquisitiveness, and an expectation of surprise. The approach I’ve shared in this post represents how I’ve been trying to capture the impact of changes to WordPress code in the 6.4 release on the performance of rendering the home page of mostly the twentytwentyfour theme and within this limited scope I have found the approach repeatable and transferrable across various devices. This test reliability and transferability makes it easier to capture insight hidden by the testing environment related to some element of time or to the performance characteristics of the hardware on which the tests run.
There are still many ideas I have that I would like to implement to further eliminate the impact of this environmental bias and to identify different clusters in the data so that we can get a better understanding what kinds of situations lead to what different kinds of performance domains. I still have a day job, so I’ll be taking it one step at a time, but I would love to eventually have an automated system with a “pristine” database containing a more complete site, one that runs requests from a browser that will download linked resources for a page, runs the tests while other sites on the same machine are being served, runs tests against several different pages and kinds of pages (a blog post, the home page, contact page, category list, etc…) and which reports the findings on the Github PRs or Trac tickets where changes are being proposed. I think that would be a really valuable tool.