blog
My Favorite PostgreSQL Extensions – Part Two
This is the second part of my blog “My Favorite PostgreSQL Extensions” wherein I had introduced you to two PostgreSQL extensions, postgres_fdw and pg_partman. In this part I will explore three more.
pgAudit
The next PostgreSQL extension of interest is for the purpose of satisfying auditing requirements by various government, financial and other certifying bodies such as ISO, BSI, and FISCAM, etc. The standard logging facility which PostgreSQL offers natively with log_statement = all is useful for monitoring, but it does not provide the details required to comply or face the audit. The pgAudit extension focuses on the details of what happened under the hood, while a database was satisfying an application request.
An audit trail or audit log is created and updated by a standard logging facility provided by PostgreSQL, which provides detailed session and/or object audit logging. The audit trail created by pgAudit can get enormous in size depending on audit settings, so care must be observed to decide on what and how much auditing is required beforehand. A brief demo in the following section shows how pgAudit is configured and put to use.
The log trail is created within the PostgreSQL database cluster log found in the PGDATA/log location but the audit log messages are prefixed with a “AUDIT: “ label to distinguish between regular database background messages and audit log records.
Demo
The official documentation of pgAudit explains that there exists a separate version of pgAudit for each major version of PostgreSQL in order to support new functionality introduced in every PostgreSQL release. The version of PostgreSQL in this demo is 11, so the version of pgAudit will be from the 1.3.X branch. The pgaudit.log is the fundamental parameter to be set that controls what classes of statements will be logged. It can be set with a SET for a session level or within the postgresql.conf file to be applied globally.
postgres=# set pgaudit.log = 'read, write, role, ddl, misc';
SET
cat $PGDATA/pgaudit.log
pgaudit.log = 'read, write, role, ddl, misc'
db_replica=# show pgaudit.log;
pgaudit.log
------------------------------
read, write, role, ddl, misc
(1 row)
2020-01-29 22:51:49.289 AEDT 4710 db_replica postgres [local] psql LOG: AUDIT: SESSION,3,1,MISC,SHOW,,,show pgaudit.log;,
db_replica=# create table t1 (f1 integer, f2 varchar);
CREATE TABLE
2020-01-29 22:52:08.327 AEDT 4710 db_replica postgres [local] psql LOG: AUDIT: SESSION,4,1,DDL,CREATE TABLE,,,"create table t1 (f1 integer, f2 varchar);",
db_replica=# insert into t1 values (1,'one');
INSERT 0 1
db_replica=# insert into t1 values (2,'two');
INSERT 0 1
db_replica=# insert into t1 values (3,'three');
INSERT 0 1
2020-01-29 22:52:19.261 AEDT 4710 db_replica postgres [local] psql LOG: AUDIT: SESSION,5,1,WRITE,INSERT,,,"insert into t1 values (1,'one');",
20-01-29 22:52:38.145 AEDT 4710 db_replica postgres [local] psql LOG: AUDIT: SESSION,6,1,WRITE,INSERT,,,"insert into t1 values (2,'two');",
2020-01-29 22:52:44.988 AEDT 4710 db_replica postgres [local] psql LOG: AUDIT: SESSION,7,1,WRITE,INSERT,,,"insert into t1 values (3,'three');",
db_replica=# select * from t1 where f1 >= 2;
f1 | f2
----+-------
2 | two
3 | three
(2 rows)
2020-01-29 22:53:09.161 AEDT 4710 db_replica postgres [local] psql LOG: AUDIT: SESSION,9,1,READ,SELECT,,,select * from t1 where f1 >= 2;,
db_replica=# grant select on t1 to usr_replica;
GRANT
2020-01-29 22:54:25.283 AEDT 4710 db_replica postgres [local] psql LOG: AUDIT: SESSION,13,1,ROLE,GRANT,,,grant select on t1 to usr_replica;,
db_replica=# alter table t1 add f3 date;
ALTER TABLE
2020-01-29 22:55:17.440 AEDT 4710 db_replica postgres [local] psql LOG: AUDIT: SESSION,23,1,DDL,ALTER TABLE,,,alter table t1 add f3 date;,
db_replica=# checkpoint;
CHECKPOINT
2020-01-29 22:55:50.349 AEDT 4710 db_replica postgres [local] psql LOG: AUDIT: SESSION,33,1,MISC,CHECKPOINT,,,checkpoint;,
db_replica=# vacuum t1;
VACUUM
2020-01-29 22:56:03.007 AEDT 4710 db_replica postgres [local] psql LOG: AUDIT: SESSION,34,1,MISC,VACUUM,,,vacuum t1;,
db_replica=# show log_statement;
log_statement
---------------
none
2020-01-29 22:56:14.740 AEDT 4710 db_replica postgres [local] psql LOG: AUDIT: SESSION,36,1,MISC,SHOW,,,show log_statement;,
The log entries, as shown in the demo above, are only written to the server background logfile when the parameter log_statement is set, however in this case it is not configured but the audit messages are written by virtue of pgaudit.log parameter as evidenced in the demo. There are more powerful options available to fulfill all your database auditing requirements within PostgreSQL, which can be configured by following the official documentation of pgaudit here or on the github repository.pg_repack
This is a favourite extension among many PostgreSQL engineers that are involved directly with managing and keeping the general health of a PostgreSQL cluster. The reason for that will be discussed a little later but this extension offers the functionality to remove database bloat within a PostgreSQL database, which is one of the nagging concerns among very large PostgreSQL database clusters requiring database re-org.
As a PostgreSQL database undergoes constant and heavy WRITES (updates & deletes), the old data is marked as deleted while the new version of the row gets inserted, but the old data is not actually wiped from a data block. This requires a periodic maintenance operation called vacuuming, which is an automated procedure that executes in the background that clears all the “marked as deleted” rows. This process is sometimes referred to as garbage collection in colloquial terms.
The vacuuming process generally gives way to the database operations during busier times. The least restrictive manner of vacuuming in favour of database operations results in a large number of “marked as deleted” rows causing databases to grow out of proportion referred to as “database bloat”. There is a forceful vacuuming process called VACUUM FULL, but that results in acquiring an exclusive lock on the database object being processed, stalling database operations on that object.
pg_repack
It is for this reason that pg_repack is a hit among PostgreSQL DBAs and engineers, because it does the job of a normal vacuuming process but offers an efficiency of VACUUM FULL by not acquiring an exclusive lock on a database object, in short, it works online. The official documentation here explains more about the other methods of reorganizing a database but a quick demo as below will put things in appropriate light for better understanding. There is a requirement that the target table must have at least one column defined as a PRIMARY KEY, which is a general norm in most of the production database setups.
Demo
The basic demo shows the installation and usage of pg_repack in a test environment. This demo uses the version 1.4.5 of pg_repack which is the latest version of this extension at the time of publishing this blog. A demo table t1 initially has 80000 rows which undergoes a massive operation of delete, which deletes every 5th row of the table. An execution of pg_repack shows the size of the table before and after.
mydb=# CREATE EXTENSION pg_repack;
CREATE EXTENSION
mydb=# create table t1 (no integer primary key, f_name VARCHAR(20), l_name VARCHAR(20), d_o_b date);
CREATE TABLE
mydb=# insert into t1 (select generate_series(1,1000000,1),'a'||
mydb(# generate_series(1,1000000,1),'a'||generate_series(1000000,1,-1),
mydb(# cast( now() - '1 year'::interval * random() as date ));
INSERT 0 1000000
mydb=# SELECT pg_size_pretty( pg_total_relation_size('t1'));
pg_size_pretty
----------------
71 MB
(1 row)
mydb=# CREATE or replace FUNCTION delete5() RETURNS void AS $$
mydb$# declare
mydb$# counter integer := 0;
mydb$# BEGIN
mydb$#
mydb$# while counter <= 1000000
mydb$# loop
mydb$# delete from t1 where no=counter;
mydb$# counter := counter + 5;
mydb$# END LOOP;
mydb$# END;
mydb$# $$ LANGUAGE plpgsql;
CREATE FUNCTION
The delete5 function deletes 200000 rows from t1 table using a counter which increments 5 counts
mydb=# select delete5();
delete5
------
(1 row)
mydb=# SELECT pg_size_pretty( pg_total_relation_size('t1'));
pg_size_pretty
----------------
71 MB
(1 row)
$ pg_repack -t t1 -N -n -d mydb -p 5433
INFO: Dry run enabled, not executing repack
INFO: repacking table "public.t1"
$ pg_repack -t t1 -n -d mydb -p 5433
INFO: repacking table "public.t1"
mydb=# SELECT pg_size_pretty( pg_total_relation_size('t1'));
pg_size_pretty
----------------
57 MB
(1 row)
As shown above, the original size of the table does not change after executing the delete5 function, which shows that the rows still exist in the table. The execution of pg_repack clears those ‘marked as deleted’ rows from the t1 table bringing down the size of t1 table to 57 MBs. One other good thing about pg_repack is an option for dry run with -N flag, using which you can check what will be executed during an actual run.
HypoPG
The next interesting extension is identical to a popular concept called invisible indexes among proprietary database servers. The HypoPG extension enables a DBA to see the effect of introducing a hypothetical index (which does not exist) and whether it will improve the performance of one or more queries, and hence the name HypoPG.
The creation of a hypothetical index does not require any CPU or disk resources, however, it consumes a connection’s private memory. As the hypothetical index is not stored in any database catalog tables, so there is no impact of table bloat. It is for this reason, that a hypothetical index cannot be used in an EXPLAIN ANALYZE statement while a plain EXPLAIN is a good way to assess if a potential index will be used by a given problematic query. Here is a quick demo to explain how HypoPG works.
Demo
I am going to create a table containing 100000 rows using generate_series and execute a couple of simple queries to show the difference in cost estimates with and without hypothetical indexes.
olap=# CREATE EXTENSION hypopg;
CREATE EXTENSION
olap=# CREATE TABLE stock (id integer, line text);
CREATE TABLE
olap=# INSERT INTO stock SELECT i, 'line ' || i FROM generate_series(1, 100000) i;
INSERT 0 100000
olap=# ANALYZE STOCK;
ANALYZE
olap=# EXPLAIN SELECT line FROM stock WHERE id = 1;
QUERY PLAN
---------------------------------------------------------
Seq Scan on stock (cost=0.00..1791.00 rows=1 width=10)
Filter: (id = 1)
(2 rows)
olap=# SELECT * FROM hypopg_create_index('CREATE INDEX ON stock (id)') ;
indexrelid | indexname
------------+-----------------------
25398 | <25398>btree_stock_id
(1 row)
olap=# EXPLAIN SELECT line FROM stock WHERE id = 1;
QUERY PLAN
------------------------------------------------------------------------------------
Index Scan using <25398>btree_stock_id on stock (cost=0.04..8.06 rows=1 width=10)
Index Cond: (id = 1)
(2 rows)
olap=# EXPLAIN ANALYZE SELECT line FROM stock WHERE id = 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Seq Scan on stock (cost=0.00..1791.00 rows=1 width=10) (actual time=0.028..41.877 rows=1 loops=1)
Filter: (id = 1)
Rows Removed by Filter: 99999
Planning time: 0.057 ms
Execution time: 41.902 ms
(5 rows)
olap=# SELECT indexname, pg_size_pretty(hypopg_relation_size(indexrelid))
olap-# FROM hypopg_list_indexes() ;
indexname | pg_size_pretty
-----------------------+----------------
<25398>btree_stock_id | 2544 kB
(1 row)
olap=# SELECT pg_size_pretty(pg_relation_size('stock'));
pg_size_pretty
----------------
4328 kB
(1 row)
The above exhibit shows how the estimated total cost can be reduced from 1791 to 8.06 by adding an index to the “id” field of the table to optimize a simple query. It also proves that the index is not really used when the query is executed with an EXPLAIN ANALYZE which executes the query in real time. There is also a way to find out approximately how much disk space the index occupies using the hypopg_list_indexes function of the extension.
The HypoPG has a few other functions to manage hypothetical indexes and in addition to that, it also offers a way to find out if partitioning a table will improve performance of queries fetching a large dataset. There is a hypothetical partitioning option of HypoPG extension and more of it can be followed by referring to the official documentation.
Conclusion
As stated in part one, PostgreSQL has evolved over the years in only getting bigger, better and faster with rapid development both in the native source code as well as plug and play extensions. An open source version of the new PostgreSQL can be most suitable for plenty of IT shops that are running one of the major proprietary database servers, in order to reduce their IT CAPEX and OPEX.
There are plenty of PostgreSQL extensions that offer features ranging from monitoring to high-availability and from scaling to dumping binary datafiles into human readable format. It is hoped that the above demonstrations have shed enormous light on the maximum potential and power of a PostgreSQL database.