Geoff Worboys wrote:
I was making some changes to a database when I noticed a query like this:
SELECT TEXTBLOB
FROM TABLEA WHERE TEXTBLOB CONTAINING 'SOME TEXT'
was suddenly taking about 4 x longer than before. (A table of almost a million records, so it was quite distinct and very consistent.)
It turns out the difference was in my collation declaration.
The old database had this:
CREATE COLLATION WIN1252_UNICODE FOR WIN1252; CREATE COLLATION NOCASE FOR WIN1252 FROM WIN1252_UNICODE CASE INSENSITIVE;
The new database had this:
CREATE COLLATION WIN1252_UNICODE FOR WIN1252; CREATE COLLATION NOCASE FOR WIN1252 FROM WIN1252_UNICODE CASE INSENSITIVE ACCENT INSENSITIVE;
Rebuilt the new database without "accent insensitive" and the performance matched the old database.
I have been considering migrating to UTF8 (but keep putting it off because I've quite a bit of other work to do before that is possible.), so I also checked the UNICODE_CI_AI collation. But after testing with a UTF8 based build, even UNICODE_CI_AI is between 3 and 4 times slower than just UNICODE_CI. So it seems accent insensitive has a definite and significant cost. Maybe it's obvious to people in the know but it surprises me.