Do you ¿ UTF-8? It's easier than you think
By Pete Freitag
Understanding how UTF-8 works is one of those things that most programmers are a little fuzzy on. I know I often have to look up specific on it when dealing with a problem.
One of the most common UTF-8 related issues that I've seen has to do with MySQL's UTF8 encoding. Also known as how do I insert emoji into mysql?
The TLDR answer to that question is that you have to use the utf8mb4
(up to 4 bytes) encoding, because MySQL's utf8
encoding won't hold an emoji, it only stores up to 3 bytes. But the longer answer is sort of of interesting and not as hard as you might think to understand.
So UTF-8 can take 3 or 4 bytes to store?
Encoding a character with UTF-8 may take, 1, 2, 3, or 4 bytes (early versions of the spec went up to 6 bytes, but was later changed to 4).
What?s cool about UTF8 is that if you are only using basic ASCII characters (eg, character codes 0-127) then it only uses 1 byte. Here?s a handy table that shows how many bytes it takes to encode a given character code in UTF8:
Character Code (decimal) | Bytes Used |
---|---|
0-127 | 1 byte |
128-2047 | 2 bytes |
2048-65535 | 3 bytes |
65536-1114111 | 4 bytes |
So hopefully that helps you 😍 UFT8. BTW that emoji is 1F60D
in hex, or 128525
in decimal, which means it takes 4 bytes to store in unicode.
So why can't MySQL uft8 store an emoji?
The utf8
encoding in MySQL can only hold up to 3 bytes UTF-8 characters, and UTF-8 actually supports up to 4 bytes. I don?t know why they choose to limit utf-8
to 3 bytes, but I will speculate that they probably added support while uft8 was still not officially standardized, and assumed that 3 bytes will be plenty big enough.
So to get the real UTF-8 in MySQL you need to use utf8mb4
encoding. Which can store all 4 bytes of a Unicode character, including emoji.
UTF-8 != Unicode
I've probably mistakenly used the terms UTF-8
and Unicode
interchangeably in the past, it's a common mistake, so let's clarify the difference.
Unicode is a character set standard, it specifies what character a given character code maps to. In Unicode parlance they call the character codes code points. This is probably just to confuse you. There are many ways to encode a unicode string into binary, and this is where the different encodings come in to play: UTF-8, UTF-16, etc.
UTF-8 is a way of encoding the characters into bytes. Now if they decided to use 4 bytes for every character, it would have wasted a lot of space (since the most commonly used characters (at least in the english) can be represented using only 1 byte. Although UTF-8 is defined by Unicode and was designed for Unicode, you could invent another character mapping standard and use UTF-8 to store it.
Do you ¿ UTF-8? It's easier than you think was first published on March 04, 2020.