preg_match, UTF-8 and whitespace

Just a quick note, be careful when using the whitespace character \s in preg_match when operating with UTF-8 strings.

Suppose you have a string containing a dagger symbol. When you try to strip all whitespace from the string like this, you will end up with an invalid UTF-8 character:

$ php -r 'echo preg_replace("#\s#", "", "?");' | xxd
0000000: e280

(On a side note: xxd displays all bytes in hexadecimal representation. The resulting string here consists of two bytes e2 and 80)

\s stripped away the a0 byte. I was unaware that this character was included in the whitespace list, but actually it represents the non-breaking space.

So actually use the u (PCRE8) modifier as it will be aware of the a0 “belonging” to the dagger:

$ php -r 'echo preg_replace("#\s#u", "", "?");' | xxd
0000000: e280 a0

By the way, trim() doesn’t strip non-breaking spaces and can therefore safely be used for UTF-8 strings. (If you still want to trim non-breaking spaces with trim, read this comment on PHP.net)

Finally here you can see the ASCII characters matched by \s when using the u modifier.

$ php -r '$i = 0; while (++$i < 256) echo preg_replace("#[^\s]#", "", chr($i));' | xxd 0000000: 090a 0c0d 2085 a0 $ php -r '$i = 0; while (++$i < 256) echo preg_replace("#[^\s]#u", "", chr($i));' | xxd 0000000: 090a 0c0d 20

Functions operating just on the ASCII characters (with a byte code below 128) are generally safe, as the multi-byte characters of UTF-8 have a leading bit of one (and are therefore above 128).

Restoring single objects in mongodb

Today I had the need to restore single objects from a mongodb installation. mongodb offers two tools for this mongodump and mongorestore, both of which seem to be designed to only dump and restore whole collections.

So I’ll demonstrate the workflow just to restore a bunch of objects. Maybe it’s a clumsy way, but we’ll improve this over time.

So, we have an existing backup, done with mongodump (for example through a daily, over-night backup). This consists of several .bson files, one for each collection.

  1. Restore a whole collection to a new database: mongorestore -d newdb collection.bson
  2. Open this database: mongo newdb
  3. Find the items you want to restore through a query, for example: db.collection.find({"_id": {"$gte": ObjectId("4da4231c747359d16c370000")}});
  4. Back on the command line again, just dump these lines to a new bson file: mongodump -d newdb -c collection -q '{"_id": {"$gte": ObjectId("4da4231c747359d16c370000")}}'
  5. Now you can finally import just those objects into your existing collection: mongorestore -d realdb collection.bson