Peter Vandenabeele
mysqldump for utf8 data
Update:
Upon further research, the real solution seems to be to add the following in the /etc/mysql/my.cnf file:
[client]
default-character-set=utf8
...
[mysqld]
default-character-set=utf8
default-collation = utf8_general_ci
...
and execute a
# /etc/init.d/mysql restart
afterwards.
Restart a mysql client instance and then check these variables:
mysql> SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
All is utf-8 now and a mysqldump command now seems to spit out proper utf-8 (without the dirty trick of specifying --default-character-set=latin1).
Before the addition of the utf-8 lines in the my.cnf file, the results of above query was:
mysql> SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
I still need to try this on the production system ...
End of Update
Update 2008-06-15 : applied to production system
Applying this to the production system turned out a little harder than expected.
Eventually, I just created a new database (production_utf8) and dumped the data in there (all in utf-8 and with correct default operation of mysqldump and mysql client).
But, moving the running rails instance to recognize the new database as UTF8 was not automatic ....
Turned out I had forgotten to add the line:
encoding:utf8
to the database.yml production file. Adding this line forces the SET NAMES UTF8 when Rails is calling the MySQL database.
End of update of 2008-06-15
Well, this took quite a while ...
As described in MySQL bug report #28969, the mysqldump command (at least version Ver 10.11 Distrib 5.0.45 that is in Ubuntu 7.10) seems to have a weird bug ...
If the tables have the utf8 encoding and one asks to dump in utf8 (which is default in this case, but can be enforced with --default-character-set=utf8) there is a double encoding.
So the "ë" character
(unicode U+00EB that is represented in utf8 with the bytes C3 AB)
is double encoded into C3 83 C2 AB (which eventually looks like ë so a job
becomes "Account manager België").
[What happens is C3 ==> C3 83 and AB ==> C2 AB , which is the correct utf-8
encoding of these fake unicodes]
As the submitter of the bug correctly notes the strange work-around is to use
$ mysqldump database -u user --default-character-set=latin1 -r outfile.sql
to generate the corect dump file (in utf8).
Next step of course is to change the line
/*!40101 SET NAMES latin1 */; [ this latin1 is really fake :-/ ]
into
/*!40101 SET NAMES utf8 */; [ this is now correct ]
and then upon restoring the database, this works :-)
$ mysql -u user backup_production < outfile.sql
This is an old problem finally "worked around". It is unclear to me if the root cause of this actually resolved in the curent version of mysqldump (it seems the bug report does not offer a conclusive answer on that).
Maybe this write-up will help someone who is fighting this same problem.