Internet Marketing Coding

Coding Techniques for Internet Marketing

How to Convert Japanese Shift-JIS Text to UTF-8

Published: Fri 27th Aug 2010

I searched for an answer for how to convert Japanese Shift-JIS text to UTF-8 format but couldn't find any suitable answer so here is my proposed method.

I first encountered Japanese Shift-JIS when cutting and pasting Japanese text that my wife had typed on a Windows computer. I thought it started out as UTF-8 which is best for the web but during the cut and paste in the Windows environment, it was converted to Shift-JIS format.

Shift-JIS is a horrible way of encoding Japanese symbols that is only used for Japanese and there are several versions of it. But for us PC programmers, we are interested in Microsoft Code Page 932. This is a map of all the supported Japanese characters on a PC to Unicode i.e. a 16 bit (word) value.

The Shift-JIS character codes are either 8 bit or 16 bit. So in code we need to check each byte that we process for if it is a starting byte for a double byte value or not.

Then we can convert this to the 16 bit Unicode value.

UTF-8 is becoming the most popular way to encode text for web pages since is is backwards compatible with ASCII which I think the majority of text on the web uses, but it can handle extended character sets too such as Kanji, removing the need for many diverse character encoding schemes.

So given that we have converted our original Shift-JIS to Unicode. The next step is to convert to UTF-8. This will result in 1 - 3 bytes of data per character for our Japanese text. So it is actually less efficient in terms of file size but makes the web a simpler and more compatible place.

We will convert to a 1, 2 or 3 byte sequence for each 16 bit Unicode value. So our algorithm will switch between each one depending on the range of integer values that our character code falls into.

Then we can output a file that should be a conversion to UTF-8 format.

To summarize: convert Shift-JIS Bytes or Words to Unicode Word values via a lookup table (based on Code Page 932) e.g an associated array. Then convert the 16 bit value to UTF-8. Or do it in 1 step if you create a direct character-mapping table.

Useful links: Code Page 932 UTF-8

Disclaimer: I haven't actually written the code to do this yet, just researched how I would do it :-)
In my case I need to convert forum posts to WordPress Blog Posts but it is a low-priority task right now.

If there is an easier way, please share in the comments. I was thinking that there may be a way to cut and paste in a particular way or use a fancy PHP function for multi-byte stuff.

User Icon
User Avatar
fcr . {{getAge(1291448928)}}
Yes, there is an easier way to convert between sjis and utf8, and it's using a tool called iconv. Just run: iconv -f sjis -t utf8 outpufile
User Avatar
Andrew . {{getAge(1291494729)}}
@fcr Thanks for the tip! I guess that is on the Linux platform? I have Linux on a bootable CD so can try it. Also, inconv is available via PHP, so I guess it can be done in PHP this way too. Searching on the web, all I could find before was a guy selling a software tool to do it. The simplest solutions are often the best!
User Avatar
Andrew . {{getAge(1304369835)}}
@Misaki Misaki-san, I hope people are not wanting to convert from utf8 to shift-jis. This would be really bad for the web IMO. There should be no way to convert to shift-jis IMO ;-)
User Avatar
Misaki . {{getAge(1304312928)}}
This is the currently first result for 'convert shift-jis to utf8' or utf8 to shift-jis. Users will want to keep in mind that in certain cases the interpretation of the data is not what the user expects it to be. It's becoming less common that a webpage encoded in shift-JIS will be interpreted and displayed by the application as UTF-8; however, it is still common for shift-JIS application data or zip files to be interpreted as Western ISO 8859(-15?) which is the equivalent code page on non-Japanese versions of Windows or something, so for these to be displayed correctly the first conversion should be not from UTF-8 to Shift-JIS, but rather from UTF-8 to Western 8859(-15?), and can then either be rendered by the application as Shift-JIS or can be converted from Shift-JIS to UTF-8 (which is a different mapping than when reading the same data as ISO8859-15). When the mojibake is in filenames convmv can be used which has similar syntax to iconv. Alternately, if your text editor is so kind as to provide the option to save in different encodings, save mojibake as ISO8859-15 and open in a web browser as Shift-JIS (for those who only need a few things decoded from mojibake).
User Avatar
liang . {{getAge(1302882462)}}
With python, you can do this easily, here is the script:
User Avatar
Andrew ⇒ liang . {{getAge(1304201442)}}
You can do it easily if you know how ;-)
User Avatar
Mr. Incredible . {{getAge(1309280475)}}
be careful recommending iconv, it has many problems with switching encoding for the more exotic characters like ? and the circled numbers which my clients still insist on using.