Professional Services
Custom Software
Managed Hosting
System Administration
See my CV here.
Send inquiries here.
Open Source:
tCMS
trog-provisioner
Playwright for Perl
Selenium::Client
Audit::Log
rprove
Net::Openssh::More
cPanel & WHM Plugins:
Better Postgres for cPanel
cPanel iContact Plugins
I had to migrate some mysql databases to UTF8mb4 last week. You would think this all would have happened a decade ago in a sane world, but this is still surprisingly common thanks to bad defaults in packaged software. Every day there's a new latin1 db and table being made because /etc/my.cnf isn't setup right out of the box. So People google around, and find some variant of this article.
I wasn't willing to waste hours with a non-automated solution for the many tables and databases piled up over years. So, I wrote a script of course. The plan was this:
Converting the DB is straightforward.
fix_db.sh
mysql -e "ALTER DATABASE $DB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci"
Similarly, listing and converting tables was easy:
fix_tables.sh ~ snippet 1
for table in $(mysql $DB -ss -e 'show tables')
do
mysql $DB -ss -e "ALTER TABLE $table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci" &> /dev/null
STATUS=$?
if [ $STATUS -ne 0 ]
then
...
This means we need to capture the index state as it currently exists and only make the needed adjustments to turn varchar indexes into prefix indices. If you didn't read the article linked at the top, this is because utf8mb4 strings are 4x as large as latin1, and mysql has a (mostly) hardcoded index member size. You can get it bigger on InnoDB for most use cases, but in this case it was MyISAM and no such option was available.
Here's how I did it:
fix_tables.sh ~ snippet 2
for query in $(mysql -ss -e "select CONCAT('DROP INDEX ', index_name, ' ON ', table_name) AS query FROM information_schema.statistics WHERE table_name = '$table' AND index_schema = '$DB' GROUP BY index_name HAVING group_concat(column_name) REGEXP (select GROUP_CONCAT(cols.column_name SEPARATOR '|') as pattern FROM information_schema.columns AS cols JOIN information_schema.statistics AS indices ON cols.table_schema=indices.index_schema AND cols.table_name=indices.table_name AND cols.column_name=indices.column_name where cols.table_name = '$table' and cols.table_schema = '$DB' AND data_type IN ('varchar','char') AND character_maximum_length >= 250 AND sub_part IS NULL);")
do
echo "mysql $DB -e '$query'"
done
for query in $(mysql -ss -e "select CONCAT('CREATE ', CASE non_unique WHEN 1 THEN '' ELSE 'UNIQUE ' END, 'INDEX ', index_name, ' ON ', table_name, ' (', group_concat(concat(column_name, COALESCE(CONCAT('(',sub_part,')'), CASE WHEN column_name REGEXP (select GROUP_CONCAT(cols.column_name SEPARATOR '|') as pattern FROM information_schema.columns AS cols JOIN information_schema.statistics AS indices ON cols.table_schema=indices.index_schema AND cols.table_name=indices.table_name AND cols.column_name=indices.column_name where cols.table_name = '$table' and cols.table_schema = '$DB' AND data_type IN ('varchar','char') AND character_maximum_length >= 250 AND sub_part IS NULL) THEN '($PREFIX_LENGTH)' ELSE '' END)) ORDER BY seq_in_index), ') USING ', index_type) AS query FROM information_schema.statistics WHERE table_name = '$table' AND index_schema = '$DB' GROUP BY index_name HAVING group_concat(column_name) REGEXP (select GROUP_CONCAT(cols.column_name SEPARATOR '|') as pattern FROM information_schema.columns AS cols JOIN information_schema.statistics AS indices ON cols.table_schema=indices.index_schema AND cols.table_name=indices.table_name AND cols.column_name=indices.column_name where cols.table_name = '$table' and cols.table_schema = '$DB' AND data_type IN ('varchar','char') AND character_maximum_length >= 250 AND sub_part IS NULL);")
do
echo "mysql $DB -e '$query'"
done
Let's go over the details, as it uses a solid chunk of the important concepts in mysql. Read the comments (lines beginning with --) for explanations.
drop_indices.sql
-- The plan is to output a DROP INDEX query
SELECT CONCAT('DROP INDEX ', index_name, ' ON ', table_name) AS query
-- information_schema.statistics is where the indices actually live
FROM information_schema.statistics
WHERE table_name = '$table'
AND index_schema = '$DB'
-- The indices themselves are stored with an entry per column indexed, so we have to group to make sure the right ones stick together.
GROUP BY index_name
-- We want to exclude any indices which dont have VARCHAR or CHAR cols, as their length is irrelevant to utf8mb4
-- The simplest way is to build a subquery grabbing these, and then scan the columns in the index via regexp
HAVING group_concat(column_name) REGEXP (
select GROUP_CONCAT(cols.column_name SEPARATOR '|') AS pattern
-- We need to cross-reference the cols in the index with those in the table
-- So that we can figure out which ones are varchars/chars
FROM information_schema.columns AS cols
JOIN information_schema.statistics AS indices ON cols.table_schema=indices.index_schema
AND cols.table_name=indices.table_name
AND cols.column_name=indices.column_name where cols.table_name = '$table'
AND cols.table_schema = '$DB'
AND data_type IN ('varchar','char') AND character_maximum_length >= 250
-- sub_part is the prefix length. In practice, any ones with a sub part already set were just fine, but YMMV.
AND sub_part IS NULL
);
Things get a little more complicated with the index add. To aid the reader I've replaced the subquery used above (it is the same below) with "$subquery".
make_indices.sql
-- As before, we wish to build a query, but this time to add the index back properly
SELECT CONCAT(
'CREATE ',
-- Make sure to handle the UNIQUE indexes
CASE non_unique WHEN 1 THEN '' ELSE 'UNIQUE ' END,
'INDEX ',
index_name,
' ON ',
table_name,
' (',
-- Build the actual column definitions, complete with the (now) prefixed indices
group_concat(
concat(
column_name,
-- COALESCE is incredibly useful in aggregate functions, as the default behavior is to drop the result in the event of a null.
-- We can use this to make sure that never happens when sub_part IS NULL.
-- In that case, we either print a blank string or a prefix based on what I chose to be $PREFIX_LENGTH (in my case 50 chars) when varchar/char cols are detected by the subquery.
COALESCE(
CONCAT('(',sub_part,')'),
CASE WHEN column_name REGEXP ($subquery) THEN '($PREFIX_LENGTH)' ELSE '' END
)
)
-- Very important to preserve the sequence in the index, these are scanned before the latter ones and can have big perf impact
ORDER BY seq_in_index
),
') USING ',
-- Almost always BTREE, but might be something else if we are for example using geospatials
index_type
) AS query
-- The rest is essentially the same as the DROP statement
FROM information_schema.statistics
WHERE table_name = '$table'
AND index_schema = '$DB'
GROUP BY index_name
HAVING group_concat(column_name) REGEXP ($subquery);
About the only major feature I didn't use were stored procedures or extensions. This all would have been a great deal simpler had ALTER INDEX been available as in postgres. In sqlite it would be a great deal more complicated save for the fact that this isn't a problem in the first place, given utf8 is the default.
The only places this didn't work out was when tables crashed on things like SELECTs of particular rows for whatever reason. The good news was that mysqldump generally worked for these tables, and you could modify the dump's CREATE TABLE statement to setup indices properly. There were also a few places where there were very long varchars that were *not* indexed, but made the table too large. Either shortening them or turning them into LONGTEXT was the proper remedy.