About the RDKit/Postgres Ordered Substructure Search Problem

another unrelated image of a festive cat

Last summer Richard Apodaca wrote a blog post about using the RDKit PostgreSQL cartridge to support substructure search queries, which included a detailed discussion of how the performances of the same query could significantly degrade when the substructure match constraint was combined with an "order by" clause, referencing another column from the same table.

I couldn't find the time to actually read that post until much later, but during this break I finally found a moment to reproduce the reported behavior, and also try to understand why it happens.

Very conveniently, the original blog post includes a complete description of how the test database was configured. Maybe any random subset of compounds from ChEMBL could also work, but the instructions still saved me some time, and helped ensure that my setup was reasonably similar [1].

The tests use a single table created with the statement below:

CREATE TABLE molecules (id SERIAL PRIMARY KEY,
                        mol mol NOT NULL,
                        mw NUMERIC NOT NULL,
                        fp bit(1024)
                    );

and filled with the molecule and molecular weight computed by the RDKit cartridge on the first 100000 [2] records from the downloaded eMolecules dataset. On this table, a gist index is created on the mol column, and a regular btree index is created on the mw column.

The reference test query consists in selecting the molecules matching a search for a benzene/phenyl substructure, and returning the first 25 hits:

ric@/tmp:emolecules> SELECT mol, mw FROM molecules WHERE mol@>'c1ccccc1' LIMIT 25;
+---------------------------------------------------------+---------+
| mol                                                     | mw      |
|---------------------------------------------------------+---------|
| Cc1ccc(OCCCCl)cc1                                       | 184.666 |
| O=C(CCc1ccccc1)c1ccc(O)cc1                              | 226.275 |
...
+---------------------------------------------------------+---------+
SELECT 25
Time: 0.011s

Searching for a very common substructure limits the ability for the gist index implementation to prune the search tree, and may therefore require the query execution to scan a larger portion of the index to identify the matching rows in the table. But in this case, things are actually made easier by the limit clause, because the index scan ends as soon as the first 25 matching records are found, which is likely to happen quite soon, exactly because phenyl rings are so frequently present in many compounds.

On the other hand, when a more selective query is performed, the index scan may happen to return a very small number of rows, and the query is again quite fast:

ric@/tmp:emolecules> SELECT mol, mw FROM molecules WHERE mol@>'C(=S)Cl' LIMIT 25;
+---------------------+---------+
| mol                 | mw      |
|---------------------+---------|
| Cc1ccc(OC(=S)Cl)cc1 | 186.663 |
+---------------------+---------+
SELECT 1
Time: 0.008s

But, even if substructure queries that include a limit clause tend to be less challenging, some problems become anyway apparent if an ordering clause is added:

ric@/tmp:emolecules> SELECT mol, mw FROM molecules WHERE mol@>'c1ccccc1' order by mw limit 25;
+-------------------------+---------+
| mol                     | mw      |
|-------------------------+---------|
| Cc1cccc(C)c1            | 106.168 |
| Cc1ccc(C)cc1            | 106.168 |
...
+-------------------------+---------+
SELECT 25
Time: 1.963s

Requiring the result records to be ordered by molecular weight made the same query almost 200x slower.

I think the problem is that with the available table and indices, the query planner is confronted with two main options. It could either use the index on the mw column to scan the table in increasing molecular weight order, and explicitly check the query constraint mol@>'c1ccccc1' to filter out the undesired records one by one, or it could use the index on the mol column to select the matching records, and then sort these records by molecular weight.

Because the index on the mol column has the potential for reducing the number of evaluated records, this is the plan that normally applies, as confirmed by looking at the output from EXPLAIN ANALYZE. The bottleneck, I think is not as much in whether the output records are efficiently sorted, but rather that for a substructure query with a very poor selectivity (like c1ccccc1), the index scan is required to process a very large fraction of the search tree associated to the mol column (in this specific case ~66K hits are initially selected from the index, about 2/3 of the overall data - a sequential scan of the unindexed column would be probably faster).

More selective queries can still perform fine also when the query includes the order by query (the mol index search tree is efficiently pruned, and a small number of records is selected), but searching for very common substructures can be very inefficient, and the limit clause in this case doesn't make things easier, because it only applies after the results are ordered.

The suggested workaround of turning off the use of explicit sorting steps in the query planner can indeed help:

ric@/tmp:emolecules> set enable_sort=off;

ric@/tmp:emolecules> SELECT mol, mw FROM molecules WHERE mol@>'c1ccccc1' order by mw limit 25;
+-------------------------+---------+
| mol                     | mw      |
|-------------------------+---------|
| Cc1ccc(C)cc1            | 106.168 |
| Cc1cccc(C)c1            | 106.168 |
...
+-------------------------+---------+
SELECT 25
Time: 0.014s

but I am afraid it could be just because in practice it prevents the planner from using the index on the mol column, and not using this index may impact negatively other queries that would normally show a high selectivity:

ric@/tmp:emolecules> SELECT mol, mw FROM molecules WHERE mol@>'C(=S)Cl' order by mw limit 25;
+---------------------+---------+
| mol                 | mw      |
|---------------------+---------|
| Cc1ccc(OC(=S)Cl)cc1 | 186.663 |
+---------------------+---------+
SELECT 1
Time: 2.652s

My impression is that one main limitation might be in the RDKit cartridge implementation, and more specifically in the selectivity estimation function associated to the substructure operator @>. If this function could return a better estimate of the selectivity associated to the given substructure query, the query planner could make a better decision about whether to use the index and how (maybe an interesting project for the next year?).

In the meantime, I think a two-column index [3] could actually help the query planner leverage the information from both the mw and mol columns at once.

The PostgreSQL gist index implementation provides support for multi-column indices, but a suitable "operator class" is required to be available for the indexed data types. The RDKit cartridge provides this operator class for the molecule, fingerprint and reaction data types it implements, and similarly, the btree_gist [4] extension provides gist operator classes for most numerical types that are available for PostgreSQL.

ric@/tmp:emolecules> CREATE EXTENSION btree_gist;

However, in order to leverage the gist index in supporting the ORDER BY query, we need to change the data type to use one whose operator class defines the distance operator and function that are required for nearest neighbor queries. This operator (<->) is not implemented for the NUMERIC data type [5], but it's available for the other numerical primitive types, including the floating point ones:

ric@/tmp:emolecules> ALTER TABLE molecules ALTER COLUMN mw TYPE real;

This way, we can then create a combined gist index on both the mol and mw columns:

ric@/tmp:emolecules> CREATE INDEX molecules_mol_mw on molecules USING gist(mol, mw);

And finally, a small change needs to apply to the executed query, in order to use the new index in the ORDER BY clause:

ric@/tmp:emolecules> SELECT mol, mw FROM molecules WHERE mol@>'c1ccccc1' order by mw <-> 0.0 limit 25;
+-------------------------+---------+
| mol                     | mw      |
|-------------------------+---------|
| Cc1cccc(C)c1            | 106.168 |
| Cc1ccc(C)cc1            | 106.168 |
...
+-------------------------+---------+
SELECT 25
Time: 0.013s

The output from EXPLAIN ANALYZE may help confirm that the two-column index is actually used:

ric@/tmp:emolecules> EXPLAIN ANALYZE SELECT mol, mw FROM molecules WHERE mol@>'c1ccccc1' order by mw <-> 0.0 limit 25;
+-----------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                              |
|-----------------------------------------------------------------------------------------------------------------------------------------|
| Limit  (cost=0.28..106.84 rows=25 width=409) (actual time=2.399..4.066 rows=25 loops=1)                                                 |
|   ->  Index Scan using molecules_mol_mw on molecules  (cost=0.28..426.53 rows=100 width=409) (actual time=2.396..4.057 rows=25 loops=1) |
|         Index Cond: (mol @> 'c1ccccc1'::mol)                                                                                            |
|         Order By: (mw <-> '0'::real)                                                                                                    |
| Planning Time: 0.202 ms                                                                                                                 |
| Execution Time: 4.146 ms                                                                                                                |
+-----------------------------------------------------------------------------------------------------------------------------------------+

Footnotes

Extending ChEMBL with ChemicaLite (and of course RDKit)

This blog post collects some draft notes about how to use ChemicaLite to integrate some chemistry logic directly into the SQLite version of the ChEMBL database [1].

an unrelated image of a festive cat
$ # you can download the ChEMBL database from ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/
$ tar xvf ./chembl_27_sqlite.tar.gz

At this time the easiest way (for linux and osx [2] users) to get ChemicaLite probably consists in installing the experimental conda packages I uploaded to anaconda.org, together with the RDKit package from conda-forge.

$ conda create -n chembl-chemicalite -c conda-forge -c ric/label/dev rdkit chemicalite
$ . activate chembl-chemicalite

SQLite drivers exist for basically all programming languages and environments, but to keep things simple, I will just use some SQL code interactively in the SQLite shell:

(chembl-chemicalite) $ sqlite3 ./chembl_27/chembl_27_sqlite/chembl_27.db
SQLite version 3.34.0 2020-12-01 16:14:00
Enter ".help" for usage hints.
sqlite>

First, the ChemicaLite extension needs to be loaded (because the extension is provided as a runtime loadable module, this operation is required on every new connection to the database, otherwise SQLite will complain that functions and types implemented in the extension are not known/available). Please note that loading extensions is automatically enabled in the SQLite shell. If you use a different driver or API, you may need to explicitly enable this operation.

sqlite> .load chemicalite

For reference (and to quickly smoke-test the loaded extension) one can briefly verify which versions of ChemicaLite and RDKit are available:

sqlite> SELECT chemicalite_version(), rdkit_version();
2020.12.5|2020.09.3

Once the extension is loaded, bindings to the RDKit toolkit become available as SQL functions and it is for example possible to compute molecular descriptors directly on the compounds in the ChEMBL tables. However, it is usually convenient to permanently extend the database schema with a table of RDKit molecules:

sqlite> CREATE TABLE rdkit_molecules(id INTEGER PRIMARY KEY, molregno BIGINT NOT NULL, mol MOL);
sqlite> INSERT INTO rdkit_molecules (molregno, mol) SELECT molregno, mol(canonical_smiles) FROM compound_structures;

This way chemical constraints (e.g. substructure searches) can be easily integrated in queries applying to the ChEMBL data:

sqlite> SELECT cs.molregno, cs.canonical_smiles FROM
   ...> compound_structures AS cs JOIN rdkit_molecules AS rm USING(molregno)
   ...> WHERE mol_is_substruct(rm.mol, 'c1cccc2c1nncc2') LIMIT 5
   ...> ;
9830|CC(C)Sc1ccc(CC2CCN(C3CCN(C(=O)c4cnnc5ccccc45)CC3)CC2)cc1
37351|Cc1cccc(NC(=O)Nc2ccc3nnccc3c2)c1
47898|COC(=O)c1c(C(F)(F)F)[nH]c2c(O)cc3c(c12)[C@H](CCl)C(c1cc2cc(NC(=O)c4cc5c(OC)c(OC)c(OC)cc5nn4)ccc2[nH]1)N3C=O
62519|N#Cc1nnc2ccccc2c1NCCC(c1ccccc1)c1ccccc1
109577|CNC(=O)[C@H](CC(C)C)C[C@H](O)[C@H](Cc1ccccc1)NC(=O)c1cnnc2ccccc12
sqlite> SELECT COUNT(*) FROM rdkit_molecules WHERE mol_is_substruct(mol, 'c1cccc2c1nncc2');
485

The above queries are nice and simple, but without a proper (chemistry aware) index, the whole rdkit_molecules table is scanned sequentially, the substructure constraint is checked explicitly on every molecule, resulting in some poor performances.

Differently from other database engines, SQLite doesn't support creating a custom index directly on the mol column. Instead, SQLite supports a virtual table mechanism, which allows wrapping the index data structure with a table-like interface. This way, an indexed search can be performed by joining the queried column with the virtual table that implements the index.

ChemicaLite provides an rdtree virtual table that implements an index structure for binary fingerprints. For substructure queries, an index based on the rdtree and the RDKit pattern fingerprint can be created with the following two statements:

sqlite> CREATE VIRTUAL TABLE rdkit_struct_index USING rdtree(id, s bits(2048), OPT_FOR_SUBSET_QUERIES);
sqlite> INSERT INTO rdkit_struct_index (id, s) SELECT id, mol_bfp_signature(mol, 2048) FROM rdkit_molecules WHERE mol IS NOT NULL;

The example substructure queries above can be refactored to leverage this index table, which will allow them to run significantly faster [3] :

sqlite> WITH matching AS (
...>     SELECT rm.molregno, rm.mol FROM rdkit_molecules AS rm JOIN rdkit_struct_index AS idx USING (id)
...>     WHERE idx.id MATCH rdtree_subset(mol_bfp_signature('c1cccc2c1nncc2', 2048))
...> )
...> SELECT cs.molregno, cs.canonical_smiles FROM compound_structures AS cs JOIN matching AS m USING (molregno)
...> WHERE mol_is_substruct(m.mol, 'c1cccc2c1nncc2') LIMIT 5
...> ;
1216581|Cc1cnnc2ccccc12
1233534|O=C(O)c1cnnc2ccccc12
947201|Cc1cc2c(N)cccc2nn1
501513|Cc1cc2cc(O)c(O)cc2nn1
1295924|O=[N+]([O-])c1cccc2nnccc12
sqlite> SELECT COUNT(*) FROM
...> rdkit_molecules AS rm JOIN rdkit_struct_index AS idx USING (id)
...> WHERE mol_is_substruct(rm.mol, 'c1cccc2c1nncc2')
...> AND idx.id MATCH rdtree_subset(mol_bfp_signature('c1cccc2c1nncc2', 2048));
485

Very similarly, an index can be created to support similarity queries, in this case using Morgan binary fingerprints:

sqlite> CREATE VIRTUAL TABLE rdkit_morgan_index USING rdtree(id, s bits(2048), OPT_FOR_SIMILARITY_QUERIES);
sqlite> INSERT INTO rdkit_morgan_index (id, s) SELECT id, mol_morgan_bfp(mol, 2, 2048) FROM rdkit_molecules WHERE mol IS NOT NULL;
sqlite> WITH similar AS (
...>     SELECT rm.molregno, idx.s FROM rdkit_molecules AS rm JOIN rdkit_morgan_index AS idx USING (id)
...>     WHERE idx.s MATCH rdtree_tanimoto(mol_morgan_bfp('Cc1ccc2nc(-c3ccc(NC(C4N(C(c5cccs5)=O)CCC4)=O)cc3)sc2c1', 2, 2048), 0.6)
...> )
...> SELECT
...>     cs.molregno, cs.canonical_smiles,
...>     bfp_tanimoto(mol_morgan_bfp('Cc1ccc2nc(-c3ccc(NC(C4N(C(c5cccs5)=O)CCC4)=O)cc3)sc2c1', 2, 2048), similar.s) AS similarity
...> FROM compound_structures AS cs JOIN similar USING (molregno)
...> ORDER BY similarity DESC
...> ;
472512|Cc1ccc2nc(-c3ccc(NC(=O)C4CCN(C(=O)c5cccs5)CC4)cc3)sc2c1|0.761194029850746
471317|Cc1ccc2nc(-c3ccc(NC(=O)C4CCCN(S(=O)(=O)c5cccs5)C4)cc3)sc2c1|0.64
471461|Cc1ccc2nc(-c3ccc(NC(=O)C4CCN(S(=O)(=O)c5cccs5)CC4)cc3)sc2c1|0.621621621621622
1032469|O=C(Nc1nc2ccc(Cl)cc2s1)[C@@H]1CCCN1C(=O)c1cccs1|0.614285714285714
471319|Cc1ccc2nc(-c3ccc(NC(=O)C4CCN(S(=O)(=O)c5cccs5)C4)cc3)sc2c1|0.613333333333333
751668|COc1ccc2nc(NC(=O)[C@@H]3CCCN3C(=O)c3cccs3)sc2c1|0.611111111111111
740754|Cc1ccc(NC(=O)C2CCCN2C(=O)c2cccs2)cc1C|0.606060606060606

I collected the modifying queries from this post into a small script (please note that the script doesn't include any error handling or graceful recovery from any error conditions):

This script may take some time to execute (about 50' on my home laptop, more than a half of which apparently spent computing and filling the SSS index). With the additional tables, the database file size (ignoring any storage temporarily used by SQLite while applying the changes) grows by about 2 GB.