MySQL utf8 migrations: big hairy queries

Open Source:
tCMS
Playwright for Perl
Selenium::Client
Audit::Log
rprove

cPanel & WHM Plugins:
Better Postgres for cPanel
cPanel iContact Plugins

MySQL utf8 migrations: big hairy queries 🔗

1633018799

🏷️ blog 🏷️ mysql 🏷️ utf8

I had to migrate some mysql databases to UTF8mb4 last week. You would think this all would have happened a decade ago in a sane world, but this is still surprisingly common thanks to bad defaults in packaged software. Every day there's a new latin1 db and table being made because /etc/my.cnf isn't setup right out of the box. So People google around, and find some variant of this article.

I wasn't willing to waste hours with a non-automated solution for the many tables and databases piled up over years. So, I wrote a script of course. The plan was this:

Convert the DB's default encoding & collation
Get the list of tables so we can iterate
Try and convert the table. If that fails, it's usually due to long varchars being indexed.
On failure, drop and re-add the indices on long varchars with an acceptable shorter prefix length
After that, try and re-convert the table

This ended up being a 99% solution. A couple of tables had some neat problems I'll talk about at the end. So without further ado, I'll talk about the queries themselves.

Converting the DB is straightforward.

 fix_db.sh
mysql -e "ALTER DATABASE $DB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci"

We'll assume for these examples that $DB is $1 (the first argument to the script).

Similarly, listing and converting tables was easy:

 fix_tables.sh ~ snippet 1
for table in $(mysql $DB -ss -e 'show tables')
    do
        mysql $DB -ss -e "ALTER TABLE $table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci" &> /dev/null
        STATUS=$?
        if [ $STATUS -ne 0 ]
        then
            ...

Now we have to do the hard part of dropping and re-adding the indices, because mysql doesn't support ALTER INDEX like postgres.

This means we need to capture the index state as it currently exists and only make the needed adjustments to turn varchar indexes into prefix indices. If you didn't read the article linked at the top, this is because utf8mb4 strings are 4x as large as latin1, and mysql has a (mostly) hardcoded index member size. You can get it bigger on InnoDB for most use cases, but in this case it was MyISAM and no such option was available.

Here's how I did it:

 fix_tables.sh ~ snippet 2

    for query in $(mysql -ss -e "select CONCAT('DROP INDEX ', index_name, ' ON ', table_name) AS query FROM information_schema.statistics WHERE table_name = '$table' AND index_schema = '$DB' GROUP BY index_name HAVING group_concat(column_name) REGEXP (select GROUP_CONCAT(cols.column_name SEPARATOR '|') as pattern FROM information_schema.columns AS cols JOIN information_schema.statistics AS indices ON cols.table_schema=indices.index_schema AND cols.table_name=indices.table_name AND cols.column_name=indices.column_name where cols.table_name = '$table' and cols.table_schema = '$DB' AND data_type IN ('varchar','char') AND character_maximum_length >= 250 AND sub_part IS NULL);")
    do
        echo "mysql $DB -e '$query'"
    done

    for query in $(mysql -ss -e "select CONCAT('CREATE ', CASE non_unique WHEN 1 THEN '' ELSE 'UNIQUE ' END, 'INDEX ', index_name, ' ON ', table_name, ' (', group_concat(concat(column_name, COALESCE(CONCAT('(',sub_part,')'), CASE WHEN column_name REGEXP (select GROUP_CONCAT(cols.column_name SEPARATOR '|') as pattern FROM information_schema.columns AS cols JOIN information_schema.statistics AS indices ON cols.table_schema=indices.index_schema AND cols.table_name=indices.table_name AND cols.column_name=indices.column_name where cols.table_name = '$table' and cols.table_schema = '$DB' AND data_type IN ('varchar','char') AND character_maximum_length >= 250 AND sub_part IS NULL) THEN '($PREFIX_LENGTH)' ELSE '' END)) ORDER BY seq_in_index), ') USING ', index_type) AS query FROM information_schema.statistics WHERE table_name = '$table' AND index_schema = '$DB' GROUP BY index_name HAVING group_concat(column_name) REGEXP (select GROUP_CONCAT(cols.column_name SEPARATOR '|') as pattern FROM information_schema.columns AS cols JOIN information_schema.statistics AS indices ON cols.table_schema=indices.index_schema AND cols.table_name=indices.table_name AND cols.column_name=indices.column_name where cols.table_name = '$table' and cols.table_schema = '$DB' AND data_type IN ('varchar','char') AND character_maximum_length >= 250 AND sub_part IS NULL);")
    do
        echo "mysql $DB -e '$query'"
    done

We then execute all the mysql statements. I output this to a shell script which can then be run. If we simply ran them, we would have an issue where the indexes we are looking for have already been dropped.

Let's go over the details, as it uses a solid chunk of the important concepts in mysql. Read the comments (lines beginning with --) for explanations.

 drop_indices.sql

-- The plan is to output a DROP INDEX query
SELECT CONCAT('DROP INDEX ', index_name, ' ON ', table_name) AS query
-- information_schema.statistics is where the indices actually live
FROM information_schema.statistics
WHERE table_name = '$table'
AND index_schema = '$DB'
-- The indices themselves are stored with an entry per column indexed, so we have to group to make sure the right ones stick together.
GROUP BY index_name
-- We want to exclude any indices which dont have VARCHAR or CHAR cols, as their length is irrelevant to utf8mb4
-- The simplest way is to build a subquery grabbing these, and then scan the columns in the index via regexp
HAVING group_concat(column_name) REGEXP (
    select GROUP_CONCAT(cols.column_name SEPARATOR '|') AS pattern
    -- We need to cross-reference the cols in the index with those in the table
    -- So that we can figure out which ones are varchars/chars
    FROM information_schema.columns AS cols
    JOIN information_schema.statistics AS indices ON cols.table_schema=indices.index_schema
    AND cols.table_name=indices.table_name
    AND cols.column_name=indices.column_name where cols.table_name = '$table'
    AND cols.table_schema = '$DB'
    AND data_type IN ('varchar','char') AND character_maximum_length >= 250
    -- sub_part is the prefix length.  In practice, any ones with a sub part already set were just fine, but YMMV.
    AND sub_part IS NULL
);

Things get a little more complicated with the index add. To aid the reader I've replaced the subquery used above (it is the same below) with "$subquery".

 make_indices.sql

-- As before, we wish to build a query, but this time to add the index back properly
SELECT CONCAT(
    'CREATE ',
    -- Make sure to handle the UNIQUE indexes
    CASE non_unique WHEN 1 THEN '' ELSE 'UNIQUE ' END,
    'INDEX ',
    index_name,
    ' ON ',
    table_name,
    ' (',
    -- Build the actual column definitions, complete with the (now) prefixed indices
    group_concat(
        concat(
            column_name,
            -- COALESCE is incredibly useful in aggregate functions, as the default behavior is to drop the result in the event of a null.
            -- We can use this to make sure that never happens when sub_part IS NULL.
            -- In that case, we either print a blank string or a prefix based on what I chose to be $PREFIX_LENGTH (in my case 50 chars) when varchar/char cols are detected by the subquery.
            COALESCE(
                CONCAT('(',sub_part,')'),
                CASE WHEN column_name REGEXP ($subquery) THEN '($PREFIX_LENGTH)' ELSE '' END
            )
        )
        -- Very important to preserve the sequence in the index, these are scanned before the latter ones and can have big perf impact
        ORDER BY seq_in_index
    ),
    ') USING ',
    -- Almost always BTREE, but might be something else if we are for example using geospatials
    index_type
) AS query
-- The rest is essentially the same as the DROP statement
FROM information_schema.statistics
WHERE table_name = '$table'
AND index_schema = '$DB'
GROUP BY index_name
HAVING group_concat(column_name) REGEXP ($subquery);

As you can see, this uses nearly every trick in the book. The only improvements I could think of would be to turn the subquery into a VIEW, as it was used multiple times. Turning the create statement generator without our constraints into a VIEW is generally useful and left as an exercise for the reader.

About the only major feature I didn't use were stored procedures or extensions. This all would have been a great deal simpler had ALTER INDEX been available as in postgres. In sqlite it would be a great deal more complicated save for the fact that this isn't a problem in the first place, given utf8 is the default.

The only places this didn't work out was when tables crashed on things like SELECTs of particular rows for whatever reason. The good news was that mysqldump generally worked for these tables, and you could modify the dump's CREATE TABLE statement to setup indices properly. There were also a few places where there were very long varchars that were *not* indexed, but made the table too large. Either shortening them or turning them into LONGTEXT was the proper remedy.

25 most recent posts older than 1633018799

Prev Next Size: Jump to:

POTZREBIE

MySQL utf8 migrations: big hairy queries 🔗 v0 v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18 v19 v20 v21 v22 v23 v24 v25 v26 1633018799

MySQL utf8 migrations: big hairy queries 🔗

1633018799