Since MySQL 5.5, the world can enjoy the amazing world of non latin characters. Time to party! …Or is it?
One may think that just by setting the charset and collation of the database, he is covered.
That may sometimes be right, but not always. Let’s review the steps for resilient character persisting:
Step one: Set the default charset of your database.
ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Note that this will only affect new tables created and not existing ones. For these you must run this on every table individually. No worries, unlike the MODIFY COLUMN command, this command runs instantly, as it only changes the configuration of the table. Note that this does not effect existing Mojibake-d characters.
ALTER TABLE tablename CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Most users would stop already by this step, thinking that they got it covered, however once they will insert entries using the MySQL connector JDBC driver (which is the scenario for most use-cases) they might notice that characters are still jumbled.
Step two: Control your client’s character set! In JVM applications, the default charset used, (when non supplied) is operating system dependent. For writing non latin characters (e.g. Mandarin, Japanese, Hebrew, Thai and other “esoteric” languages) you want to set it to be “UTF-8”. On Linux it is ANSI_X3.4-1968. By calling Charset.defaultCharset() during runtime you can determine your operating system’s specific encoding. This can be overridden by supplying the JVM param -Dfile.encoding=UTF8 when starting the application or by adding it to JAVA_TOOL_OPTIONS to avoid adding it every time.
If you do not control this variable, you can add relevant params to your JDBC url and override the JVM defaults, so it looks like this:
jdbc:mysql://host/dbName?useUnicode=true&characterEncoding=UTF-8
Experimenting with MySQL 5.7 and MySQL connector java 5.1.12, upon trying to insert a Cyrillic string, I encountered the following:
- If only JDBC connection on client side is set to write in UTF8 (as shown above) but table written to is not: you will receive an exception for trying to invalid data: Caused by: java.sql.BatchUpdateException: Incorrect string value: ‘\xD0\xA7e\xD0\x91\xD1…’
- If only table is set to UTF-8 (as shown above) but not the JDBC connection on the client, jumbled string will be persisted.
- If neither is set, jumbles string will be persisted.
Lesson learned (for me the hard way) – Only the combination of server and client side using UTF-8 will yield correct encoding. Happy UTF-8ing for all!