Title: PHP and Multibyte
Author: Alex Kirk
Published: April 5, 2005
Last modified: October 4, 2005

---

# PHP and Multibyte

April 5, 2005

ever messed around with umlauts or other non [a-z] letters? it’s quite horrible.

for the german speaking region there are mainly two encoding types: iso8859-1 and
utf-8. the former encodes each letter with one byte by extending old 7-bit ascii
with 127 more letters, amongst others also umlauts. utf-8 includes up to 32,640 
more letters (ascii 0x80-0xff are used to select the range of the following byte).
this is established by allowing multi-byte characters. in the case of utf-8 the 
maximum is two letters, but there exist utf-16 and utf-32 with up to 4 bytes per
char.

so, what’s the problem? with bandnews we have different sources for our data, meaning
that we receive many pages with many different encodings and have to deliver a page
that follows only one encoding. we chose to use utf-8 now, because a wide range 
of letters from many other encodings can be displayed which are not included in 
iso8859-1.

now it is important that you stop using [strlen](http://php.net/strlen) and [substr](http://php.net/substr)
because it can easily happen that you split an utf-8 character into parts, and forget
comparing it to anything, then. alterenatives are [mb_strlen](http://php.net/mb_strlen)
and [mb_substr](http://php.net/mb_substr) and all other sorts of [mb_*](http://php.net/manual/en/ref.mbstring.php)
functions. well… this does not work out of the box, you need to specify what encoding
is to be expected. this can be done like this:
 ` mb_internal_encoding("UTF-8");
all mb_* commands use this encoding if no other is specified.

still, non-utf-8 code can come through to the browser, e.g. if you receive it from
the database. but there is a chance to get around this quite comfortably:
 ` mb_http_output("
UTF-8"); ob_start("mb_output_handler");  the output buffer is cleared from wrong
charactes by the mb_output_handler. it is also easily possible to have the output
converted to iso8859-1, just by specifying it with the [mb_http_output](http://php.net/mb_http_output)
command. a drawback is, though, that no other output filter can be applied, such
as for output compression  ob_start("ob_gzhandler");

the manual states that instead zlib compression should be used, as specified in 
the php.ini file or via [ini_set](http://php.net/ini_set):
 ` ini_set ('zlib.output_compression','
on'); ini_set ('zlib.output_handler', 'mb_output_handler'); ob_start();  note that
the output-handler for [ob_start](http://php.net/ob_start) has to be empty and it
is moved to the config option. this sounds great, but i was not able to get it to
work. well, i must admit that i did not put so much time into it because i simply
decided to move the responsibility to apache: [mod_deflate](http://httpd.apache.org/docs-2.0/mod/mod_deflate.html).
you might want to modify the configuration line, as i did:  AddOutputFilterByType
DEFLATE text/html text/plain text/xml text/javascript text/css

have fun with character encoding. it works after some while. but its a lot of trial
and error.

[bandnews](https://alex.kirk.at/category/projects/bandnews/), [PHP](https://alex.kirk.at/category/code/php/)

Read this next

[live search](https://alex.kirk.at/2005/03/23/live-search/)

### Leave a Reply 󠀁[Cancel reply](https://alex.kirk.at/2005/04/05/php-and-multibyte/?output_format=md#respond)󠁿

Only people in [my network](https://alex.kirk.at/friends/) can comment.

This site uses Akismet to reduce spam. [Learn how your comment data is processed.](https://akismet.com/privacy/)