Just a quick note, be careful when using the whitespace character \s
in preg_match
when operating with UTF-8 strings.
Suppose you have a string containing a dagger symbol. When you try to strip all whitespace from the string like this, you will end up with an invalid UTF-8 character:
$ php -r 'echo preg_replace("#\s#", "", "?");' | xxd
0000000: e280
(On a side note: xxd
displays all bytes in hexadecimal representation. The resulting string here consists of two bytes e2
and 80
)
\s
stripped away the a0
byte. I was unaware that this character was included in the whitespace list, but actually it represents the non-breaking space.
So actually use the u (PCRE8) modifier as it will be aware of the a0
“belonging” to the dagger:
$ php -r 'echo preg_replace("#\s#u", "", "?");' | xxd
0000000: e280 a0
By the way, trim()
doesn’t strip non-breaking spaces and can therefore safely be used for UTF-8 strings. (If you still want to trim non-breaking spaces with trim
, read this comment on PHP.net)
Finally here you can see the ASCII characters matched by \s
when using the u modifier.
$ php -r '$i = 0; while (++$i < 256) echo preg_replace("#[^\s]#", "", chr($i));' | xxd
0000000: 090a 0c0d 2085 a0
$ php -r '$i = 0; while (++$i < 256) echo preg_replace("#[^\s]#u", "", chr($i));' | xxd
0000000: 090a 0c0d 20
Functions operating just on the ASCII characters (with a byte code below 128) are generally safe, as the multi-byte characters of UTF-8 have a leading bit of one (and are therefore above 128).