preg_match, UTF-8 and whitespace

Just a quick note, be careful when using the whitespace character \s in preg_match when operating with UTF-8 strings.

Suppose you have a string containing a dagger symbol. When you try to strip all whitespace from the string like this, you will end up with an invalid UTF-8 character:

$ php -r 'echo preg_replace("#\s#", "", "?");' | xxd
0000000: e280

(On a side note: xxd displays all bytes in hexadecimal representation. The resulting string here consists of two bytes e2 and 80)

\s stripped away the a0 byte. I was unaware that this character was included in the whitespace list, but actually it represents the non-breaking space.

So actually use the u (PCRE8) modifier as it will be aware of the a0 “belonging” to the dagger:

$ php -r 'echo preg_replace("#\s#u", "", "?");' | xxd
0000000: e280 a0

By the way, trim() doesn’t strip non-breaking spaces and can therefore safely be used for UTF-8 strings. (If you still want to trim non-breaking spaces with trim, read this comment on PHP.net)

Finally here you can see the ASCII characters matched by \s when using the u modifier.

$ php -r '$i = 0; while (++$i < 256) echo preg_replace("#[^\s]#", "", chr($i));' | xxd 0000000: 090a 0c0d 2085 a0 $ php -r '$i = 0; while (++$i < 256) echo preg_replace("#[^\s]#u", "", chr($i));' | xxd 0000000: 090a 0c0d 20

Functions operating just on the ASCII characters (with a byte code below 128) are generally safe, as the multi-byte characters of UTF-8 have a leading bit of one (and are therefore above 128).

Debugging PHP on Mac OS X

[factolex]

I have been using Mac OS X as my primary operating system for a few years now, and only today I have found a very neat way to debug PHP code, like it is common for application code (i.e. stepping through code for debugging purposes).

The solution is a combination of Xdebug and MacGDBp.

macgdbp-debugger

I am using the PHP package by Marc Liyanage almost ever since I have been working on OS X, because it’s far more flexible than the PHP shipped with OS X.

Unfortunately, installing Xdebug the usual pecl install xdebug doesn’t work. But on the internetz you can find a solution to this problem.

Basically you need to download the source tarball and use the magic command CFLAGS='-arch x86_64' ./configure --enable-xdebug for configuring it. (The same works for installing APC by the way)


/usr/local/php5/php.d $ cat 50-extension-xdebug.ini
[xdebug]
zend_extension=/usr/local/php5/lib/php/extensions/no-debug-non-zts-20060613/xdebug.so

xdebug.remote_autostart=on
xdebug.remote_enable=on
xdebug.remote_handler=dbgp
xdebug.remote_mode=req
xdebug.remote_host=localhost
xdebug.remote_port=9000

Now you can use MacGDBp. There is an article on Particletree that describes the interface in a little more detail.

I really enjoy using this method to only fire up this external program, when I want to debug some PHP code, and can continue to use my small editor, so that I don’t have to switch to a huge IDE to accomplish the same.

Posted in PHP