The Gathering Tech:Server

This is an unofficial website, a collection of material found on the web. For up to date information about the event The Gathering, please go to www.gathering.org!

<< Goto index

UTF-8 crash course

- or: what are those odd characters on #tg?

If you just want information on how to make the problem go away in your IRC client, skip the first section.

Computers do not usually deal with characters directly; for practical reasons, they are stored as numbers instead. Every character (or glyph, if you want) is stored as a single number, like this:

A little test.
65321081051161161081013211610111511646

There are several ways to do this. One widely used system is called ISO 8859-1, or latin 1. (Don't worry about what the numbers mean.) ISO 8859-1 can represent most characters used in western languages, like this:

Blåbærsyltetøy
6610822998230114115121108116101116248121

As you can see, only one byte is used for every character; "B" is number 66, "å" is 229, etc.. However, this also means ISO 8859-1 cannot represent more than 256 different characters, so if you wanted to represent, say, the character β (a greek "Beta") in ISO 8859-1, that would simply not be possible.

To solve this problem (and a whole lot of others), Unicode was created. Unicode is not limited to 256 characters, but allows thousands of thousands of characters. On the other hand, this means that not all characters will fit into a single byte. A system called UTF-8 converts the characters from Unicode into series of bytes; for most English characters, a single byte, but for most other characters, multiple in a row. For instance, our string above gets encoded into:

Blåbærsyltetøy
66108195 16598195 166114115121108116101116195 184121

Now, what happens if this UTF-8 string is sent across IRC, and then interpreted by a client that is set to interpret all characters as ISO 8859-1? Let's have a look:

6610819516598195166114115121108116101116195184121
Blåbærsyltetøy

This is the source of the "odd characters" you might have been seeing. It is important to realize that this does not mean UTF-8 is somehow broken; it just means that your IRC client is not set up to handle it — if it properly decoded the string as UTF-8, you would get no such issues. There is no "right" or "wrong"; however, the world at large is (slowly) moving to Unicode, since it means we can use one character set instead of tens or even hundreds of different ones.

All widely-used, modern IRC clients support UTF-8 in their latest versions. Enabling it is easy; just see below for the instructions for your client.

mIRC

mIRC has supported UTF-8 since version 6.17; make sure you do not have an old version. To enable it:

  1. Go to Tools > Options > IRC > Messages.
  2. Check the box saying 'UTF-8 display'.

For more information, see the unofficial mIRC Unicode FAQ.

irssi

irssi has supported UTF-8 for ages, but you might not have an UTF-8 terminal. If you want to use multiple different character sets, you probably want the recode support, available since version 0.8.10. To enable recoding:

  1. /set term_charset iso8859-1 (or whatever your terminal is using)
  2. /set recode on
  3. /set recode_out_default_charset iso8859-1 (if you want to)
  4. /set recode_transliterate on
  5. /set recode_autodetect_utf8 on
  6. /recode add #tg utf-8

XChat

XChat supports only one character set per IRC network, and has supported UTF-8 since version 2.4. To enable UTF-8 on EFnet, do:

  1. Go to the XChat Server List.
  2. Select EFnet and click Edit.
  3. In the Character Set combo box, select UTF-8.

For more information, see http://xchat.org/encoding/.

BitchX

BitchX is not a modern IRC client. Use something that wasn't outdated five years ago (like irssi, which can be configured to act just like BitchX).


The Gathering Tech:Server | Last modified: 2009-01-11 01:41:27

Please ignore this link