How to Convert Japanese Shift-JIS Text to UTF-8

August 27th, 2010 No comments

I searched for an answer for how to convert Japanese Shift-JIS text to UTF-8 format but couldn’t find any suitable answer so here is my proposed method.

I first encountered Japanese Shift-JIS when cutting and pasting Japanese text that my wife had typed on a Windows computer. I thought it started out as UTF-8 which is best for the web but during the cut and paste in the Windows environment, it was converted to Shift-JIS format.

Shift-JIS is a horrible way of encoding Japanese symbols that is only used for Japanese and there are several versions of it. But for us PC programmers, we are interested in Microsoft Code Page 932. This is a map of all the supported Japanese characters on a PC to Unicode i.e. a 16 bit (word) value.

The Shift-JIS character codes are either 8 bit or 16 bit. So in code we need to check each byte that we process for if it is a starting byte for a double byte value or not.

Then we can convert this to the 16 bit Unicode value.

UTF-8 is becoming the most popular way to encode text for web pages since is is backwards compatible with ASCII which I think the majority of text on the web uses, but it can handle extended character sets too such as Kanji, removing the need for many diverse character encoding schemes.

So given that we have converted our original Shift-JIS to Unicode. The next step is to convert to UTF-8. This will result in 1 – 3 bytes of data per character for our Japanese text. So it is actually less efficient in terms of file size but makes the web a simpler and more compatible place.

We will convert to a 1, 2 or 3 byte sequence for each 16 bit Unicode value. So our algorithm will switch between each one depending on the range of integer values that our character code falls into.

Then we can output a file that should be a conversion to UTF-8 format.

To summarize: convert Shift-JIS Bytes or Words to Unicode Word values via a lookup table (based on Code Page 932) e.g an associated array. Then convert the 16 bit value to UTF-8. Or do it in 1 step if you create a direct character-mapping table.

Useful links:
Code Page 932
UTF-8

Disclaimer:
I haven’t actually written the code to do this yet, just researched how I would do it :-)
In my case I need to convert forum posts to WordPress Blog Posts but it is a low-priority task right now.

If there is an easier way, please share in the comments. I was thinking that there may be a way to cut and paste in a particular way or use a fancy PHP function for multi-byte stuff.

Categories: Conversion Tags:

Automated List Building

August 19th, 2010 No comments

Can A Listbuilding System Be TOTALLY Automated?

I know we’re all a bit sick and tired of promises of “automated software” that will make us rich before dinner.

And I’m always the first to be skeptical of these kinds of outrageous claims.

But today I may have to actually eat my words.

Because I just picked up a listbuildilng system [...]

Categories: Strategy Tags:

Mod Rewrite Not Working

August 17th, 2010 No comments

I searched for a solution to my problem of Mod Rewrite Not Working and the advice was rather obvious such as to make sure you remove the hash (#) symbol before the mod rewrite extension inclusion line in Apache httpd.conf where you see LoadModule rewrite_module modules/mod_rewrite.so

But what solved it for me was recognizing that the default [...]

Categories: Code Tags:

How to Get Your iPhone 3GS to Connect to WiFi

August 8th, 2010 1 comment

I’ve been using the iPhone 3GS for about 8 months now and one big frustration was getting it to connect to my home WiFi network.

I use an IO Data Airport wireless G unit which I love since it is tiny and didn’t cost much. This works well with my computers but the iPhone struggles to connect.

Searching [...]

Categories: Technique Tags:

Getting A Site Ranked Fast

July 28th, 2010 No comments

Sorry: the subject blog is part of an experiment and the original post no longer applies since the blog is continually changing as the code is evolved.

Categories: SEO Tags:

Learn PHP Fast

June 23rd, 2010 6 comments

PHP is the most important server side language to understand these days since it is the most widely used development framework for websites.

Once you have mastered the basics of HTML and CSS, it is time to learn PHP.

Although it is easy to learn in my opinion, a problem is that it has a vast amount of [...]

Categories: Learning Tags:

Cool Niche

June 14th, 2010 2 comments

How to find a cool Niche to build a site around.

It’s easy really, just build a site around what interests you.

I was a bit stressed over this since all I do is eat, drink, exercise, spend hours on the computer and watch selected TV downloads.

So the answer is to obviously focus on the niches related to [...]

Categories: SEO Tags:

Getting Things Done

June 2nd, 2010 No comments

Maybe you are like me? We are the doers, we simply do things and get things done.

This may start with a to-do list or simply a plan in your head derived from a strong desire to accomplish something.

So we wake up each morning with a plan of action in mind and eager to start work on [...]

Categories: Optimization Tags:

How to Avoid the eBay affiliate re-direct penalty

May 31st, 2010 No comments

The main thing is to use the 301 redirect code which indicates a permanent redirect so that eBay URLs are associated with the landing page content in search engine listings rather than your redirect URL. At least, this is how I understand it.

So in PHP where there is a redirect, use the code as follows:

header("Location: $url", [...]
Categories: Code Tags:

How To Be A Success

May 27th, 2010 No comments

Here are some tips based partly on my own experience and things that I am trying to improve on myself.

Have goals in mind and keep working towards them and never give up unless the goal becomes clearly unrealistic, then adjust to a more manageable goal.

Spend most of your time on productive activities.

Be curious, always wanting to [...]

Categories: Technique Tags: