COLLATE影响PostgreSQL/openGauss数据库的排序与索引使用

当创建PostgreSQL数据库时，可以根据业务需要配置不同数据库的collation ， collation影响包括字符串排序顺序、字符分类方法、数值格式、日期和时间格式以及货币格式。此外，您可能还需要配置 LC_COLLATE 和 LC_CTYPE 环境变量。在创建库、创建表、创建索引、查询时也可以指定该collation，不同的collation可能会产生不同的排序，甚至导致无法使用索引。

postgresql支持的字符集
https://www.postgresql.org/docs/15/multibyte.html Server列为YES支持服务端，否则仅支持客户端。

当前字符库支持的LC_COLLATE, LC_CTYPE
LC_COLLATE, LC_CTYPE当不指定时默认的，但要与当煎数据库的字符集兼容，您可以执行以下 SQL 语句从 pg_collation 系统表中查询字符集支持的 LC_COLLATE 和 LC_CTYPE 设置：

[local]:5432 postgres@db1=# select distinct pg_encoding_to_char(collencoding) from pg_collation;
 pg_encoding_to_char
---------------------
 EUC_CN
 LATIN7
 ISO_8859_7
 WIN1251
 ISO_8859_6
 LATIN3
 LATIN6
 LATIN5
 ISO_8859_8
 LATIN9
 LATIN1

 LATIN2
 EUC_TW
 UTF8
 EUC_KR
 KOI8U
 KOI8R
 ISO_8859_5
 WIN1255
 LATIN8
 EUC_JP
(22 rows)

[local]:5432 postgres@db1=# select pg_encoding_to_char(collencoding) as encoding,collname,collcollate,collctype, collencoding from pg_collation where pg_encoding_to_char(collencoding) like '%EUC_CN%';
 encoding |   collname   | collcollate  |  collctype   | collencoding
----------+--------------+--------------+--------------+--------------
 EUC_CN   | zh_CN        | zh_CN        | zh_CN        |            2
 EUC_CN   | zh_CN.gb2312 | zh_CN.gb2312 | zh_CN.gb2312 |            2
 EUC_CN   | zh_SG        | zh_SG        | zh_SG        |            2
 EUC_CN   | zh_SG.gb2312 | zh_SG.gb2312 | zh_SG.gb2312 |            2
(4 rows)

[local]:5432 postgres@db1=# select pg_encoding_to_char(collencoding) as encoding,collname,collcollate,collctype from pg_collation ;
  encoding  |       collname        |      collcollate      |       collctype
------------+-----------------------+-----------------------+-----------------------
            | default               |                       |
            | C                     | C                     | C
            | POSIX                 | POSIX                 | POSIX
 UTF8       | ucs_basic             | C                     | C
 LATIN1     | aa_DJ                 | aa_DJ                 | aa_DJ
 LATIN1     | aa_DJ.iso88591        | aa_DJ.iso88591        | aa_DJ.iso88591
 UTF8       | aa_DJ.utf8            | aa_DJ.utf8            | aa_DJ.utf8
 UTF8       | aa_ER                 | aa_ER                 | aa_ER
 UTF8       | aa_ER.utf8            | aa_ER.utf8            | aa_ER.utf8
...

Note:
ENCODING是null 表示所有字符集都支持.

创建数据库时指定COLLATE

CREATE DATABASE hrdb WITH ENCODING 'UTF8' LC_COLLATE='C.UTF-8' LC_CTYPE='C.UTF-8' TEMPLATE = template0 CONNECTION LIMIT = 200 --IS_TEMPLATE = TRUE;

note:
一但创建不支持alter database修改。建议 LC_COLLATE使用“C”，性能上有些提升.

创建表时指定COLLATE

CREATE TABLE test1 (  
    a text COLLATE "de_DE",  
    b text COLLATE "es_ES",  
    ...  
);

note:
后期可以修改列 collate,如 alter table a alter c1 type text COLLATE “zh_CN”;

创建索引时指定COLLATE

CREATE INDEX testc_name_idx3 ON testc(name COLLATE "C")

Note:
可以在同一列创建不同collate索引。

查询时指定collate

SELECT * FROM t WHERE name >= 'z' COLLATE "C";

create index idxa on a(c1 collate "zh_CN");  
explain select * from a order by c1 collate "zh_CN";

Note:
如果查询使用的collate与索引collate不同，可能无法使用索引.

当前locale

[postgres@oel7db1 ~]$ env|grep LAN
LANG=en_US.UTF-8
[postgres@oel7db1 ~]$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

查看数据库的locale

[local]:5432 postgres@db1=# \l
                                 List of databases
   Name    |  Owner   | Encoding  |  Collate   |   Ctype    |   Access privileges
-----------+----------+-----------+------------+------------+-----------------------
 anbob     | postgres | SQL_ASCII | C          | C          |
 db        | postgres | SQL_ASCII | C          | C          |
 db1       | postgres | UTF8      | en_US.UTF8 | en_US.UTF8 |
 postgres  | postgres | SQL_ASCII | C          | C          |
 template0 | postgres | SQL_ASCII | C          | C          | =c/postgres          +
           |          |           |            |            | postgres=CTc/postgres
 template1 | postgres | SQL_ASCII | C          | C          | =c/postgres          +
           |          |           |            |            | postgres=CTc/postgres
(6 rows)

collate对排序的影响

[local]:5432 postgres@db1=# select * from (values ('Alice'),('Tom'),('anbob')) as a(c1) order by c1 ;
  c1
-------
 Alice
 anbob
 Tom
(3 rows)

[local]:5432 postgres@db1=# select * from (values ('Alice'),('Tom'),('anbob')) as a(c1) order by c1 collate "C";
  c1
-------
 Alice
 Tom
 anbob
(3 rows)

[local]:5432 postgres@db1=# select * from (values ('Alice'),('Tom'),('anbob')) as a(c1) order by c1 collate "zh_CN";
  c1
-------
 Alice
 anbob
 Tom
(3 rows)

collate对索引的影响

[local]:5432 postgres@postgres=# CREATE DATABASE "example_db"
postgres-# WITH OWNER "postgres"
postgres-# ENCODING 'UTF8'
postgres-# LC_COLLATE = 'en_US.UTF8'
postgres-# LC_CTYPE = 'en_US.UTF8'
postgres-# TEMPLATE template0;
CREATE DATABASE

create table testc(id int,name varchar(20),addr  varchar(3000),otype char(1));


insert into testc select x,x||'anbob',rpad('x',2000,'x'),'t' from generate_series(1,10000) as x;

[local]:5432 postgres@example_db=# create index on testc(name);
CREATE INDEX
[local]:5432 postgres@example_db=# \d testc
                       Table "public.testc"
 Column |          Type           | Collation | Nullable | Default
--------+-------------------------+-----------+----------+---------
 id     | integer                 |           |          |
 name   | character varying(20)   |           |          |
 addr   | character varying(3000) |           |          |
 otype  | character(1)            |           |          |
Indexes:
    "testc_name_idx" btree (name)

[local]:5432 postgres@example_db=# analyze testc;
ANALYZE
[local]:5432 postgres@example_db=# explain select * from testc where name like '1a%';
                       QUERY PLAN
--------------------------------------------------------
 Seq Scan on testc  (cost=0.00..229.00 rows=1 width=50)
   Filter: ((name)::text ~~ '1a%'::text)
(2 rows)

NOTE:
使用全表扫，即使选择率很好，可以创建xx_pattern_ops索引.

创建xx_pattern_ops索引

[local]:5432 postgres@example_db=# CREATE INDEX testc_name_idx2 ON testc(name varchar_pattern_ops);
CREATE INDEX
[local]:5432 postgres@example_db=# \l testc
                       List of databases
 Name | Owner | Encoding | Collate | Ctype | Access privileges
------+-------+----------+---------+-------+-------------------
(0 rows)

[local]:5432 postgres@example_db=# \d testc
                       Table "public.testc"
 Column |          Type           | Collation | Nullable | Default
--------+-------------------------+-----------+----------+---------
 id     | integer                 |           |          |
 name   | character varying(20)   |           |          |
 addr   | character varying(3000) |           |          |
 otype  | character(1)            |           |          |
Indexes:
    "testc_name_idx" btree (name)
    "testc_name_idx2" btree (name varchar_pattern_ops)

[local]:5432 postgres@example_db=# explain select * from testc where name like '1a%';
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Index Scan using testc_name_idx2 on testc  (cost=0.29..8.31 rows=1 width=50)
   Index Cond: (((name)::text ~>=~ '1a'::text) AND ((name)::text ~<~ '1b'::text))
   Filter: ((name)::text ~~ '1a%'::text)
(3 rows)

也可以创建COLLATE “C”索引

drop index  testc_name_idx2;
CREATE INDEX testc_name_idx3 ON testc(name COLLATE "C")

[local]:5432 postgres@example_db=# explain select * from testc where name like '1a%';
                                  QUERY PLAN
------------------------------------------------------------------------------
 Index Scan using testc_name_idx3 on testc  (cost=0.29..8.31 rows=1 width=50)
   Index Cond: (((name)::text >= '1a'::text) AND ((name)::text < '1b'::text))
   Filter: ((name)::text ~~ '1a%'::text)
(3 rows)

xx_pattern_ops and COLLATE “C”不同?

CREATE INDEX ON t(name text/varchar_pattern_ops) 和CREATE INDEX ON t(name COLLATE “C”)都可以使用索引，请注意，如果您希望涉及普通 <、<=、> 或 >= 比较的查询使用索引，您还应该使用默认运算符类创建索引。此类查询不能使用 xxx_pattern_ops 运算符类。（但是，普通的相等比较可以使用这些运算符类。）可以使用不同的运算符类在同一列上创建多个索引。如果您确实使用 C 语言环境，则不需要 xxx_pattern_ops 运算符类，因为具有默认运算符类的索引可用于 C 语言环境中的模式匹配查询。

使用Collate=C的DB

[local]:5432 postgres@example_db=# \l
                                  List of databases
    Name    |  Owner   | Encoding  |  Collate   |   Ctype    |   Access privileges
------------+----------+-----------+------------+------------+-----------------------
 anbob      | postgres | SQL_ASCII | C          | C          |
 db         | postgres | SQL_ASCII | C          | C          |
 example_db | postgres | UTF8      | en_US.UTF8 | en_US.UTF8 |
 postgres   | postgres | SQL_ASCII | C          | C          |
 template0  | postgres | SQL_ASCII | C          | C          | =c/postgres          +
            |          |           |            |            | postgres=CTc/postgres
 template1  | postgres | SQL_ASCII | C          | C          | =c/postgres          +
            |          |           |            |            | postgres=CTc/postgres
(6 rows)

[local]:5432 postgres@example_db=# \c anbob
You are now connected to database "anbob" as user "postgres".
[local]:5432 postgres@anbob=# create table testc(id int,name varchar(20),addr  varchar(3000),otype char(1));
CREATE TABLE
[local]:5432 postgres@anbob=# insert into testc select x,x||'anbob',rpad('x',2000,'x'),'t' from generate_series(1,10000) as x;
INSERT 0 10000
[local]:5432 postgres@anbob=# create index on testc(name);
CREATE INDEX

[local]:5432 postgres@anbob=# \d testc
                       Table "public.testc"
 Column |          Type           | Collation | Nullable | Default
--------+-------------------------+-----------+----------+---------
 id     | integer                 |           |          |
 name   | character varying(20)   |           |          |
 addr   | character varying(3000) |           |          |
 otype  | character(1)            |           |          |
Indexes:
    "testc_name_idx" btree (name)

[local]:5432 postgres@anbob=# explain select * from testc where name like '1a%';
                                  QUERY PLAN
------------------------------------------------------------------------------
 Index Scan using testc_name_idx on testc  (cost=0.29..9.66 rows=1 width=50)
   Index Cond: (((name)::text >= '1a'::text) AND ((name)::text < '1b'::text))
   Filter: ((name)::text ~~ '1a%'::text)
(3 rows)

Note:
默认不需要指定collate和创建xxx_pattern_ops索引。

— over —