Extending ChEMBL with ChemicaLite (and of course RDKit)
This blog post collects some draft notes about how to use ChemicaLite to integrate some chemistry logic directly into the SQLite version of the ChEMBL database [1].

$ # you can download the ChEMBL database from ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/ $ tar xvf ./chembl_27_sqlite.tar.gz
At this time the easiest way (for linux and osx [2] users) to get ChemicaLite probably consists in installing the experimental conda packages I uploaded to anaconda.org, together with the RDKit package from conda-forge.
$ conda create -n chembl-chemicalite -c conda-forge -c ric/label/dev rdkit chemicalite $ . activate chembl-chemicalite
SQLite drivers exist for basically all programming languages and environments, but to keep things simple, I will just use some SQL code interactively in the SQLite shell:
(chembl-chemicalite) $ sqlite3 ./chembl_27/chembl_27_sqlite/chembl_27.db SQLite version 3.34.0 2020-12-01 16:14:00 Enter ".help" for usage hints. sqlite>
First, the ChemicaLite extension needs to be loaded (because the extension is provided as a runtime loadable module, this operation is required on every new connection to the database, otherwise SQLite will complain that functions and types implemented in the extension are not known/available). Please note that loading extensions is automatically enabled in the SQLite shell. If you use a different driver or API, you may need to explicitly enable this operation.
sqlite> .load chemicalite
For reference (and to quickly smoke-test the loaded extension) one can briefly verify which versions of ChemicaLite and RDKit are available:
sqlite> SELECT chemicalite_version(), rdkit_version(); 2020.12.5|2020.09.3
Once the extension is loaded, bindings to the RDKit toolkit become available as SQL functions and it is for example possible to compute molecular descriptors directly on the compounds in the ChEMBL tables. However, it is usually convenient to permanently extend the database schema with a table of RDKit molecules:
sqlite> CREATE TABLE rdkit_molecules(id INTEGER PRIMARY KEY, molregno BIGINT NOT NULL, mol MOL); sqlite> INSERT INTO rdkit_molecules (molregno, mol) SELECT molregno, mol(canonical_smiles) FROM compound_structures;
This way chemical constraints (e.g. substructure searches) can be easily integrated in queries applying to the ChEMBL data:
sqlite> SELECT cs.molregno, cs.canonical_smiles FROM ...> compound_structures AS cs JOIN rdkit_molecules AS rm USING(molregno) ...> WHERE mol_is_substruct(rm.mol, 'c1cccc2c1nncc2') LIMIT 5 ...> ; 9830|CC(C)Sc1ccc(CC2CCN(C3CCN(C(=O)c4cnnc5ccccc45)CC3)CC2)cc1 37351|Cc1cccc(NC(=O)Nc2ccc3nnccc3c2)c1 47898|COC(=O)c1c(C(F)(F)F)[nH]c2c(O)cc3c(c12)[C@H](CCl)C(c1cc2cc(NC(=O)c4cc5c(OC)c(OC)c(OC)cc5nn4)ccc2[nH]1)N3C=O 62519|N#Cc1nnc2ccccc2c1NCCC(c1ccccc1)c1ccccc1 109577|CNC(=O)[C@H](CC(C)C)C[C@H](O)[C@H](Cc1ccccc1)NC(=O)c1cnnc2ccccc12
sqlite> SELECT COUNT(*) FROM rdkit_molecules WHERE mol_is_substruct(mol, 'c1cccc2c1nncc2'); 485
The above queries are nice and simple, but without a proper (chemistry aware) index, the whole rdkit_molecules table is
scanned sequentially, the substructure constraint is checked explicitly on every molecule, resulting in some poor performances.
Differently from other database engines, SQLite doesn't support creating a custom index directly on the mol column.
Instead, SQLite supports a virtual table mechanism, which allows wrapping the index data
structure with a table-like interface. This way, an indexed search can be performed by joining the queried column with the
virtual table that implements the index.
ChemicaLite provides an rdtree virtual table that implements an index structure for binary fingerprints. For
substructure queries, an index based on the rdtree and the RDKit pattern fingerprint can be created with the following
two statements:
sqlite> CREATE VIRTUAL TABLE rdkit_struct_index USING rdtree(id, s bits(2048), OPT_FOR_SUBSET_QUERIES); sqlite> INSERT INTO rdkit_struct_index (id, s) SELECT id, mol_bfp_signature(mol, 2048) FROM rdkit_molecules WHERE mol IS NOT NULL;
The example substructure queries above can be refactored to leverage this index table, which will allow them to run significantly faster [3] :
sqlite> WITH matching AS ( ...> SELECT rm.molregno, rm.mol FROM rdkit_molecules AS rm JOIN rdkit_struct_index AS idx USING (id) ...> WHERE idx.id MATCH rdtree_subset(mol_bfp_signature('c1cccc2c1nncc2', 2048)) ...> ) ...> SELECT cs.molregno, cs.canonical_smiles FROM compound_structures AS cs JOIN matching AS m USING (molregno) ...> WHERE mol_is_substruct(m.mol, 'c1cccc2c1nncc2') LIMIT 5 ...> ; 1216581|Cc1cnnc2ccccc12 1233534|O=C(O)c1cnnc2ccccc12 947201|Cc1cc2c(N)cccc2nn1 501513|Cc1cc2cc(O)c(O)cc2nn1 1295924|O=[N+]([O-])c1cccc2nnccc12
sqlite> SELECT COUNT(*) FROM ...> rdkit_molecules AS rm JOIN rdkit_struct_index AS idx USING (id) ...> WHERE mol_is_substruct(rm.mol, 'c1cccc2c1nncc2') ...> AND idx.id MATCH rdtree_subset(mol_bfp_signature('c1cccc2c1nncc2', 2048)); 485
Very similarly, an index can be created to support similarity queries, in this case using Morgan binary fingerprints:
sqlite> CREATE VIRTUAL TABLE rdkit_morgan_index USING rdtree(id, s bits(2048), OPT_FOR_SIMILARITY_QUERIES); sqlite> INSERT INTO rdkit_morgan_index (id, s) SELECT id, mol_morgan_bfp(mol, 2, 2048) FROM rdkit_molecules WHERE mol IS NOT NULL;
sqlite> WITH similar AS ( ...> SELECT rm.molregno, idx.s FROM rdkit_molecules AS rm JOIN rdkit_morgan_index AS idx USING (id) ...> WHERE idx.s MATCH rdtree_tanimoto(mol_morgan_bfp('Cc1ccc2nc(-c3ccc(NC(C4N(C(c5cccs5)=O)CCC4)=O)cc3)sc2c1', 2, 2048), 0.6) ...> ) ...> SELECT ...> cs.molregno, cs.canonical_smiles, ...> bfp_tanimoto(mol_morgan_bfp('Cc1ccc2nc(-c3ccc(NC(C4N(C(c5cccs5)=O)CCC4)=O)cc3)sc2c1', 2, 2048), similar.s) AS similarity ...> FROM compound_structures AS cs JOIN similar USING (molregno) ...> ORDER BY similarity DESC ...> ; 472512|Cc1ccc2nc(-c3ccc(NC(=O)C4CCN(C(=O)c5cccs5)CC4)cc3)sc2c1|0.761194029850746 471317|Cc1ccc2nc(-c3ccc(NC(=O)C4CCCN(S(=O)(=O)c5cccs5)C4)cc3)sc2c1|0.64 471461|Cc1ccc2nc(-c3ccc(NC(=O)C4CCN(S(=O)(=O)c5cccs5)CC4)cc3)sc2c1|0.621621621621622 1032469|O=C(Nc1nc2ccc(Cl)cc2s1)[C@@H]1CCCN1C(=O)c1cccs1|0.614285714285714 471319|Cc1ccc2nc(-c3ccc(NC(=O)C4CCN(S(=O)(=O)c5cccs5)C4)cc3)sc2c1|0.613333333333333 751668|COc1ccc2nc(NC(=O)[C@@H]3CCCN3C(=O)c3cccs3)sc2c1|0.611111111111111 740754|Cc1ccc(NC(=O)C2CCCN2C(=O)c2cccs2)cc1C|0.606060606060606
I collected the modifying queries from this post into a small script (please note that the script doesn't include any error handling or graceful recovery from any error conditions):
This script may take some time to execute (about 50' on my home laptop, more than a half of which apparently spent computing and filling the SSS index). With the additional tables, the database file size (ignoring any storage temporarily used by SQLite while applying the changes) grows by about 2 GB.