This is the 13th day of my participation in the August More text Challenge. For details, see: August More Text Challenge

The behavior of case-sensitive collation in the ORDER BY collation clause

As you can see, in PostgreSQL, MySQL/MariaDB, SQL Server, and other databases, using case-sensitive Collation in the ORDER BY clause does not cause the results to be sorted BY case.

All three databases behave the same.

You need to specify additional collates to sort the results by upper and lower case alphabetic order, or strictly by Unicode encoded code values/points.

In PostgreSQL, specify order by
collate “ucs_basic”/”C”/”POSIX”;

SQL Server collate with _BIN/_BIN2 abbreviation; MySQL/MariaDB specifies the _bin abbreviation collate. That is, use binary (code value) comparison sort.

The following is an example of ORDER BY ordering in each database strictly based on character encoding values:

  • PostgreSQL
shop=# select * from OrderTest order by letter collate "ucs_basic";
 letter
--------
 A
 B
 a
 b
(4Rows)Copy the code
  • MySQL/MariaDB:
MariaDB [test]> select * from OrderTest order by letter collate latin1_bin;
+--------+
| letter |
+--------+
| A      |
| B      |
| a      |
| b      |
+--------+
4 rows in set (0.002 sec)
Copy the code
  • SQL Server
select * from OrderTest order by letter COLLATE Chinese_PRC_BIN2; -- 或者 COLLATE Chinese_PRC_BIN;
Copy the code

The comparison is case insensitive

Searching in the database, most of the time is required to be case insensitive.

The most efficient way to search text columns case-insensitive is to use a database using case insensitive collation, provided the correct indexes are used.

Characters in PostgreSQL are case insensitive

PostgreSQL is a case-sensitive database with only two character collations by default, just like SQLite.

Also, it is not known that PostgreSQL has a case-insensitive default collation supported.

Here are a few ways PG can be case-insensitive:

1. Nondeterministic Collations

PG supports CREATE COLLATE to CREATE your own collation rules. Please refer to the official document for details.

The indeterminate collation, namely: deterministic = false, is specified when the collation is created.

CREATE COLLATION ndcoll (provider = icu, locale = 'und'.deterministic = false);
Copy the code

2. ILIKE keyword

You can use the keyword ILIKE instead of LIKE to make the match case-insensitive according to the active locale. This is not the SQL standard, but an extension of PostgreSQL.

shop=# select * from OrderTest where letter ilike 'a';
 letter
--------
 A
 a
(2Rows)Copy the code

3. Use the lower or upper function

Queries are case insensitive with the lower or upper functions.

Lower or Upper can also be used in the Order By clause.

Using functions such as lower or Upper will result in inefficient use of the index.

shop=# select * from OrderTest where lower(letter)='a';
 letter
--------
 A
 a
(2Line record) shop=# select * from OrderTest where upper(letter)='A';
 letter
--------
 A
 a
(2Rows)Copy the code

LIKE the query

shop=# select * from OrderTest where lower(letter) like '%a%';
 letter
--------
 A
 a
(2Rows)Copy the code

Lower or Upper functions are supported in MySQL/MariaDB, PostgreSQL, and SQL Server. MySQL/MariaDB also supports lCase and UCase functions.

3. Use CITEXT

The CITEXT extension is a good alternative to Text. Nondeterministic collations are useful for handling case insensitive, accented, and Unicode characters.

The Citext module provides the case-insensitive string type Citext. Essentially, it calls Lower internally when comparing values, but other than that, it behaves almost exactly like text.

[General processing] It is case insensitive by attaching columns as copies of the original data columns

A common way to achieve case-insensitive data in a table is to add a replica column that stores the lower-case (or upper-case) contents of the original data column.

This is a way of storing redundant information: raw data is stored in one column, and standardized data (which may be all lowercase) is stored in one column. You can index the two separately if you want, and usually only the raw data is retrieved or displayed.

On the method of generating redundant columns of standardized data from the original columns

The generic processing mentioned above, which uses extra columns to store the lower-case content of the original data, can be implemented in several ways.

Such as trigger, but need to insert, update trigger, to achieve the calculation of duplicate column data generation.

The most appropriate approach is to use Generated columns, also known as derived or Computed columns, which are described in the next article.

Note: ICU in character encoding — International Components for Unicode: International Components for Unicode