Conditional logic in a regular expression – Conditional Capture Group
I have been working on making WP Stagecoach work reliably with sites that are encrypted with TLS (that is, https://), and I wanted to make a single line of change for different URLs that may be in the database.
Let’s say we want to create a staging site from a live site running on https://example.com with the following local URLs stored in its database:
https://example.com
https://example.com/foo
http://example.com/
http://example.com/bar/
example.com/?p=123
example.com
These are all URLs which may stored in the database of a TLS encrypted site.
When WP Stagecoach creates the staging site, it needs to change all of these to be non-encrypted (as currently WP Stagecoach does not support encrypted staging sites):
http://example.wpstagecoach.com
http://example.wpstagecoach.com/foo
http://example.wpstagecoach.com/
http://example.wpstagecoach.com/bar/
example.wpstagecoach.com/?p=123
example.wpstagecoach.com
So I found that a pretty simple regular expression with a capture group got me really close to what I needed to do:
$search = 'example.com';
$replace = 'example.wpstagecoach.com';
foreach ($urls as $key => $url) {
$urls[$key] = preg_replace('#(https?://)?'.$search.'#i', '1'.$replace, $url);
}
So, some groundwork here. Regular Expressions are an incredibly powerful, but often hard to utilize method of matching (and replacing) a string. In this case, we are searching for elements of the database that match “example.com”, and we want to replace them with “example.wpstagecoach.com”.
preg_replace($pattern_we_want_to_match, $replacement_text, $string_to_check_for_match_in)
The PHP function preg_replace() takes 3 arguments, the pattern we want to match, the string we will replace what is matched with, and the string itself. The first argument here looks a little odd: ‘#(https?://)?’.$search.’#i’ — the ‘#’s here are used as delimiters–typically a ‘/’ is used, but as we are trying to match “//”, it is cleaner to just use a different delimiter. Then there is an ‘i’ at the end to indicate it can match a case-insensitive pattern (eg, Example.com, or example.COM).
Looking inside the delimiters, we notice two main elements, the “(https?://)?” and the ‘$search’ (which in this case is “example.com”). The latter is pretty self-explanatory–it will attempt to match database entries that have “example.com” in them. The former takes a little more effort.
There are a couple question marks in the first element, and notate that the preceding sub-element is not required to be matched. So in this case, on the inside, the element ‘s’ (a literal character ‘s’) is followed by ‘?’, so that will match both “http://” and “https://”. The outer element “(https?://)” is also followed by a ‘?’, so while we must match “example.com”, we don’t have to have “http://” in front of it for the match to succeed.
The parenthesis (https?://) mark what is called a capture group, that is, something we can reference later on. In this simpler case, we just prepend the capture group in front of the replacement string with the “1” in front of “$replace” in the function’s second argument.
However, there is a problem because the capture group will match either “http://” or “https://” and prepend that to “example.wpstagecoach.com”, and we want it only to prepend “http://” if either are present in the database.
So we need to use a slightly different PHP function which lets us call a function as the method of creating the replacement text. This PHP function is called “preg_replace_callback()”, and it takes three arguments, the only difference from the “preg_replace()” function is that it uses a function to calculate what the replacement text should be.
preg_replace_callback($pattern_we_want_to_match, function($match) {return $replacement_text}, $string_to_check_for_match_in)
The first and third arguments are the same as before, however, we want to utilize the function to create a replacement text based on whether the database entry has “http://” or “https://” in it.
preg_replace_callback(
'#(https?://)?'.$search.'#',
function($match) use($replace) {return empty($match[1]) ? $replace:'http://'.$replace;},
$url
);
To this function we pass the “$match” variable, which is (basically) an array of all the capture groups, in this case, we only have one, (and it is 1-based, like the capture group numbering), so inside the function, we do PHP shorthand to check if “$match[1]” is empty, and if so, only return “$replace”, but if it is not empty, we return “‘http://’.$replace”. Of note, the “use()” addition is needed if you want the function to have access to any external variables.