Seitenanfang

DBD::mysql UTF-8 fallacy

Dieser Post wurde aus meiner alten WordPress-Installation importiert. Sollte es Darstellungsprobleme, falsche Links oder fehlende Bilder geben, bitte einfach hier einen Kommentar hinterlassen. Danke.


Todays world is UTF-8, I'm very sure that most software worldwide is UTF-8 today but mySQL on Perl?

I had a big annoying fight much fun with encodings lately and discovered that all our database connections return bytes instead of UTF-8 chars. Some of them run their tables in ISO-8859-1 encoding for historical reasons but the newer ones use UTF-8 and they should return UTF-8 chars. Our DSN strings look like this:

DBI:mysql:database=asia_japan;host=japan.asia.db.company.com;user=japan;password=asia;mysql_enable_utf8=1
The last argument is a driver setting which should force the internal communication between anything calling DBI and the database connection to use UTF-8. The recent encoding trouble is over now but I still couldn't believe that we don't use UTF-8 where we should (and want) and I tried to confirm the myth:
perl -MDBI -le 'my $dbh = DBI->connect("DBI:mysql:database=asia_japan;host=japan.asia.db.company.com;user=japan;password=asia;mysql_enable_utf8=1"); my $t = $dbh->selectrow_array("SELECT firstname FROM CUSTOMERS WHERE id=137"); print DBI::data_string_desc($t);'

UTF8 off, non-ASCII, 15 characters 15 bytes

Surprised? I fetched a Japanese name (and confirmed it to contain only Japanese chars before) and got "15 bytes" without UTF-8 flag.

Another try but passing the mysql_enable_utf8 flag as additional argument this time. This shouldn't make any difference because both places should be the same:

perl -MDBI -le 'my $dbh = DBI->connect("DBI:mysql:database=asia_japan;host=japan.asia.db.company.com;user=japan;password=asia",undef,undef,{mysql_enable_utf8=>1}); my $t = $dbh->selectrow_array("SELECT firstname FROM CUSTOMERS WHERE id=137"); print DBI::data_string_desc($t);'

UTF8 on, non-ASCII, 5 characters 15 bytes

Here we are: 5 characters with correct UTF8-flag, exactly what I want.

What shall I do? Fixing this is one like within our db_connect wrapper used for all modules, but as everything was running wrong for the last years... what will break when fixing this bug?

Maybe I'll recommend to remove the non-working mysql_enable_utf8 argument from all connection strings, fix the connection wrapper to extract them from further DSNs and pass them to DBI->connect as expected. The next database could be set up with a real, working UTF-8 connection and we'ld see any problems before they go to the live systems.

UPDATE: I missed something (thanks to Andreas for pointing that out) by not testing a connection without any mysql_enable_utf8=1. Here is the result:

perl -MDBI -le 'my $dbh = DBI->connect("DBI:mysql:database=asia_japan;host=japan.asia.db.company.com;user=japan;password=asia"); my $t = $dbh->selectrow_array("SELECT firstname FROM CUSTOMERS WHERE id=137"); print DBI::data_string_desc($t);'

UTF8 off, ASCII, 5 characters 5 bytes

5 Japanese chars in 5 bytes? No, not at all. The mysql_enable_utf8 in the dsn actually switches the connection to UTF-8-mode on the mySQL server side. All non-ISO-8859-1 chars are simply converted to "?" by the server.I strongly suggest to replace the dsn-suffix ";mysql_enable_utf8=1" by a explicit "SET NAMES utf8" call after connecting:

perl -MDBI -le 'my $dbh = DBI->connect("DBI:mysql:database=asia_japan;host=japan.asia.db.company.com;user=japan;password=asia"); $dbh->do("SET NAMES utf8"); my $t = $dbh->selectrow_array("SELECT firstname FROM CUSTOMERS WHERE id=137"); print DBI::data_string_desc($t);'

UTF8 off, non-ASCII, 15 characters 15 bytes

Doesn't really work but does the same as ;mysql_enable_utf8 in the DSN string and is much more clear. A simple database setup update or software change may easily switch your believed-to-work DSN suffix into a real driver parameter - and probably start a huge encoding bug wave throughout your application.
 

15 Kommentare. Schreib was dazu

  1. der admin

    Achtung: Da fehlt etwas wichtiges!
    im DSN tut das durchaus etwas. es schaltet nämlcih die verbindung auf utf8.


    Würde es das nicht tun, bzw. würdest du das ganz rausnehmen, defaultet deine connection auf latin1 und mysql wäre so "freundlich", und konvertiert deinen japanischen namen nach "???", da alles nicht druckbare zeichen in latin1 sind :)


    Das bedeutet, dass mit dieser Methode das automatische anschalten des utf8_flags nicht funktioniert..


    Ist doch ein Bugreport wert?

  2. Genau das richtige Thema für einen Sonntag abend. :)

  3. Genau das richtige Thema für einen Sonntag abend. :)

  4. Genau das richtige Thema für einen Sonntag abend. :)

  5. [Imported from blog]

  6. [Imported from blog]

  7. I think we talked about MySQL hate during the YAPC attendees dinner, and now that you're bring that issue up again, here's my writeup (including a fix) for the problem I was having back then:
    http://domm.plix.at/perl/hatefull_mysql.html

  8. [Imported from blog]

  9. [Imported from blog]

  10. Note, I don't use mysql these days.


    I don't see why you think setting mysql_enable_utf8 in your connection string should be the same as an attribute passed to connect. DBD::mysql documents mysql_enable_utf8 as an "attribute" and points out it must be set in the connect call. Passing an attribute mysql_enable_utf8 in the hash ref last arg of connect effectively calls the DBD::mysql STORE function which tells it to set the Perl UTF8 flag on data coming back from the database and "Additionally, turning on this flag tells MySQL that incoming data should be treated as UTF-8" (although this latter bit is a bit vague). Adding mysql_enable_utf8 to the connect "string" assumes DBD::mysql parses that string and extracts the mysql_enable_utf8 setting which your tests indicate it does not do (and I see no documentation suggesting it should).


    Regarding your "Surprised?" paragraph, I'm not at all surprised. octets returned by mysql are not magically seen as UTF-8; they must be decoded as UTF-8 by DBD::mysql or (which it actually does) it must set the UTF8 flag on them. Without that step all you have is a sequence of bytes and Perl cannot know they are anything other than that.


    The documentation also states "If you turn the flag on after connecting, you will need to issue the command SET NAMES utf8 to get the same effect" so I'd suggest issuing a do with that SQL statement is redundant if you set mysql_enable_utf8 as a connect attribute.

  11. Sebastian

    You're right, I should read the docs first.
    Someone else added the mysql_enable_utf8 flag more than a year ago and I stumbled into the problems last week.
    I expect DSN-Options to either work like they were passed as attribute, trigger errors (invalid option) or be a noop. I can't remember any DBI->connect to different databases call where this wasn't true. It's still looking like a bad UI/API to me even if it's stated more or less clearly in the documentation.

  12. [Imported from blog]

  13. [Imported from blog]

  14. [Imported from blog]

  15. [Imported from blog]

Schreib was dazu

Die folgenden HTML-Tags sind erlaubt:<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>