Title: preg_match, UTF-8 and whitespace
Author: Alex Kirk
Published: October 1, 2011
Last modified: October 3, 2011

---

# preg_match, UTF-8 and whitespace

October 1, 2011

Just a quick note, be careful when using the whitespace character `\s` in `preg_match`
when operating with UTF-8 strings.

Suppose you have a string containing a dagger symbol. When you try to strip all 
whitespace from the string like this, you will end up with an invalid UTF-8 character:

`$ php -r 'echo preg_replace("#\s#", "", "?");' | xxd
 0000000: e280

(On a side note: `xxd` displays all bytes in hexadecimal representation. The resulting
string here consists of two bytes `e2` and `80`)

`\s` stripped away the `a0` byte. I was unaware that this character was included
in the whitespace list, but actually it represents the **non-breaking space**.

So actually use the [u (PCRE8) modifier](http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php)
as it will be aware of the `a0` “belonging” to the dagger:

`$ php -r 'echo preg_replace("#\s#u", "", "?");' | xxd
 0000000: e280 a0

By the way, `trim()` doesn’t strip non-breaking spaces and can therefore safely 
be used for UTF-8 strings. (If you still want to trim non-breaking spaces with `
trim`, [read this comment on PHP.net](http://php.net/manual/en/function.trim.php#98812))

Finally here you can see the ASCII characters matched by `\s` when using the u modifier.

`$ php -r '$i = 0; while (++$i < 256) echo preg_replace("#[^\s]#", "", chr($i));'
| xxd 0000000: 090a 0c0d 2085 a0 $ php -r '$i = 0; while (++$i < 256) echo preg_replace("#[
^\s]#u", "", chr($i));' | xxd 0000000: 090a 0c0d 20`

Functions operating just on the ASCII characters (with a byte code below 128) are
generally safe, as the multi-byte characters of UTF-8 have a leading bit of one (
and are therefore above 128).

[Code](https://alex.kirk.at/category/code/), [PHP](https://alex.kirk.at/category/code/php/)

Read this next

[Restoring single objects in mongodb](https://alex.kirk.at/2011/05/31/restoring-single-objects-in-mongodb/)

### Leave a Reply 󠀁[Cancel reply](https://alex.kirk.at/2011/10/01/preg_match-utf-8-and-whitespace/?output_format=md#respond)󠁿

Only people in [my network](https://alex.kirk.at/friends/) can comment.

This site uses Akismet to reduce spam. [Learn how your comment data is processed.](https://akismet.com/privacy/)