AtomGen: Streamlining Atomistic Modeling through Dataset and Benchmark Integration
August 9, 2024
August 9, 2024
By Ali Kore, Amrit Krishnan, David Emerson
The AtomGen project focuses on enhancing atomistic modeling capabilities through advanced machine-learning techniques. By integrating deep-learning models with extensive atomistic datasets, AtomGen aims to improve the prediction and understanding of system properties across various scales, from small organic compounds to adsorbate-catalyst systems.
At the core of the AtomGen project is Atomformer, a encoder-only transformer model adapted for working with three-dimensional atomistic structures. Inspired by the Uni-mol+ architecture, Atomformer incorporates 3D spatial information directly into the model architecture using Gaussian pair-wise positional embeddings. Specifically, this method leverages the Euclidean distances between atoms and a pair-type aware Gaussian kernel which is projected and added to the attention mask. It also incorporates metadata embeddings (atomic mass, radius, valency, etc.,) based on the atomic species. This approach allows the model to capture the spatial relationships between atoms, which is essential for accurate property prediction.
To train the Atomformer base model, the AtomGen team compiled the S2EF-15M dataset, a large-scale aggregation of atomic structures, their forces, and energies. This dataset aggregates information from the Open Catalyst Project (OC20, OC22, ODAC23), the Materials Project Trajectory Dataset (MPtrj), and SPICE 1.1.4. Figure 1 below visualizes a set of samples from each of the constituent datasets.
Figure 1. “Visualizations of atomic structures from the S2EF-15M dataset components: ODAC23 (top left), MPtrj (top right), SPICE (center), OC20 (bottom left), and OC22 (bottom right).”
The S2EF-15M dataset contains over 15 million systems (2x MPtrj and 3x SPICE upsampling), providing a diverse range of atomistic environments for training. It includes atomic numbers, 3D coordinates, forces, formation energy, total energy, and a boolean flag indicating the presence of valid formation energy data. The scale and diversity of the dataset paired with the multi-task learning of energies and forces allows for comprehensive learning across different types of chemical environments.
A significant technical contribution of the AtomGen project is the development of efficient data processing pipelines for each source dataset. These pipelines handle the conversion of various data formats into a unified structure compatible with the HuggingFace Datasets library. The pipelines incorporate parallel processing for large datasets, handle compressed data formats, convert units, calculate derived properties, and manage dynamic padding and batching of variable-sized systems.
Figure 2. “Initialization and usage of an atom modeling data pipeline using AtomTokenizer and DataCollatorForAtomModeling for MaskGIT objective, with random dataset generation for batch processing.”
All dataset and pre-trained models are uploaded to huggingface hub.
AtomGen leverages the HuggingFace ecosystem for model development and dataset management. The project includes custom implementations of AtomformerConfig, AtomformerModel, and various task-specific models such as AtomformerForMaskedAM and Structure2EnergyAndForces. These implementations allow seamless integration with HuggingFace’s Trainer API and model hub, facilitating reproducibility and ease of use.
The DataCollatorForAtomModeling class is a key technical component of AtomGen, offering flexibility in data preparation and supporting various pre-training techniques. This custom data collator handles the complexities of batching data, including dynamic padding of input sequences, computation of attention masks, and optional computation of Laplacian positional encodings. These encodings are used with the TokenGT implementation as node identifiers.
The data collator supports multiple self-supervised pre-training objectives. The coordinate and atom based pre-training objectives can be used either simultaneously or on their own.
These diverse pre-training techniques enable Atomformer to learn rich representations of atomistic structures from multiple perspectives, potentially improving its performance on downstream tasks.
While the data collator supports various self-supervised pre-training techniques, as discussed earlier, these were not utilized in the current pre-training experiments. The exploration of these diverse self-supervised approaches remains an exciting avenue for future research. AtomGen employs a two-stage approach: pretraining on the S2EF-15M dataset followed by fine-tuning on specific tasks. The pre-training phase uses the Structure2EnergyAndForces model, which predicts both per-atom forces and per-system energy. The model was trained on a 4xA40 for two weeks totalling 1344 GPU hours.
For fine-tuning, AtomGen utilizes the ATOM3D benchmark suite, which includes a range of atomistic modeling tasks such as Small Molecule Properties (SMP), Mutation Stability Prediction (MSP), and Ligand Binding Affinity (LBA). Each of the 8 tasks in ATOM3D are pre-processed and formatted to be compatible with the data collator and HuggingFace Datasets library.
The AtomGen project includes optimizations for training on large-scale datasets, including gradient checkpointing for memory-efficient training, mixed-precision training for faster computation, and distributed training support for multi-GPU setups.
On the SMP task, the fine-tuned Atomformer model achieved a test MAE of 1.077, compared to 1.13 for a model trained from scratch, demonstrating the effectiveness of the pretraining approach. We also provide the breakdown of the MAE across all 20 targets:
Targets | Pre-trained (eval) | Pre-trained (test) | Scratch (eval) | Scratch(test) |
Rotational constant A [GHz] | 0.1216 | 18.1177/0.1566* | 0.1621 | 18.1576/0.1972* |
Rotational constant B [GHz] | 0.0541 | 0.0677 | 0.0657 | 0.0794 |
Rotational constant C [GHz] | 0.0283 | 0.0404 | 0.0296 | 0.0421 |
Dipole moment [D] | 0.3044 | 0.3014 | 0.4128 | 0.4107 |
Isotropic polarizability [a0^3] | 0.3388 | 0.3392 | 0.534 | 0.5358 |
Energy of HOMO [Ha] | 0.0104 | 0.0104 | 0.0107 | 0.0106 |
Energy of LUMO [Ha] | 0.014 | 0.0143 | 0.0174 | 0.0174 |
Gap (LUMO-HOMO) [Ha] | 0.0165 | 0.0169 | 0.0194 | 0.0197 |
Electronic spatial extent [a0^2] | 2.0202 | 1.9656 | 2.2719 | 2.2126 |
Zero point vibrational energy [Ha] | 0.0006 | 0.0006 | 0.0007 | 0.0007 |
Internal energy at 0 K [Ha] | 0.0697 | 0.0783 | 0.1307 | 0.1291 |
Internal energy at 298.15 K [Ha] | 0.0697 | 0.0783 | 0.1307 | 0.1291 |
Enthalpy at 298.15 K [Ha] | 0.0697 | 0.0783 | 0.1307 | 0.1291 |
Free energy at 298.15 K [Ha] | 0.0697 | 0.0783 | 0.1307 | 0.1291 |
Heat capacity at 298.15 K [cal/(mol·K)] | 0.1531 | 0.1518 | 0.2325 | 0.2337 |
Internal energy at 0 K (thermochem) [Ha] | 0.0117 | 0.0116 | 0.0185 | 0.0183 |
Internal energy at 298.15 K (thermochem) [Ha] | 0.0117 | 0.0116 | 0.0185 | 0.0183 |
Enthalpy at 298.15 K (thermochem) [Ha] | 0.0117 | 0.0116 | 0.0186 | 0.0183 |
Free energy at 298.15 K (thermochem) [Ha] | 0.0116 | 0.0116 | 0.0185 | 0.0183 |
Heat capacity at 298.15 K (thermochem) [cal/(mol·K)] | 0.1519 | 0.151 | 0.2323 | 0.2324 |
Table 1. The breakdown of the 20 targets of the SMP task for both the pre-trained model and the model fine-tuned from scratch. Asterisk (∗) indicates the MAE after filtering of one sample at index of 1469 from the test set with an outlier Rotational constant A of 232e3.
The AtomGen project represents a significant effort in applying transformer-based models to atomistic modeling tasks. By focusing on efficient data processing, model architecture design, and integration with popular deep learning frameworks, AtomGen provides a robust foundation for research in computational chemistry and materials science. The project’s contributions in data aggregation, processing, and model development, along with its support for diverse pre-training techniques, have the potential to accelerate progress in areas such as drug discovery and materials design.
By integrating numerous techniques for efficient pre-training on large-scale datasets, including gradient checkpointing, mixed-precision training, and numerous pre-training objectives. These optimizations demonstrate best practices for pre-training and fine-tuning molecular transformer models on a diverse set of datasets. By successfully applying these techniques at scale, AtomGen provides valuable insights into overcoming the practical challenges of applying deep learning to complex molecular systems.