AtomGen: Streamlining Atomistic Modeling through Dataset and Benchmark Integration

August 9, 2024

2024 Research Research 2024

By Ali Kore, Amrit Krishnan, David Emerson

The AtomGen project focuses on enhancing atomistic modeling capabilities through advanced machine-learning techniques. By integrating deep-learning models with extensive atomistic datasets, AtomGen aims to improve the prediction and understanding of system properties across various scales, from small organic compounds to adsorbate-catalyst systems.

Atomformer: A Transformer for 3D Atomistic Systems

At the core of the AtomGen project is Atomformer, a encoder-only transformer model adapted for working with three-dimensional atomistic structures.  Inspired by the Uni-mol+ architecture, Atomformer incorporates 3D spatial information directly into the model architecture using Gaussian pair-wise positional embeddings.  Specifically, this method leverages the Euclidean distances between atoms and a pair-type aware Gaussian kernel which is projected and added to the attention mask.  It also incorporates metadata embeddings (atomic mass, radius, valency,  etc.,) based on the atomic species. This approach allows the model to capture the spatial relationships between atoms, which is essential for accurate property prediction.

S2EF-15M: A Diverse Dataset of Atoms w/ Energy+Forces

To train the Atomformer base model, the AtomGen team compiled the S2EF-15M dataset, a large-scale aggregation of atomic structures, their forces, and energies. This dataset aggregates information from the Open Catalyst Project (OC20, OC22, ODAC23), the Materials Project Trajectory Dataset (MPtrj), and SPICE 1.1.4.  Figure 1 below visualizes a set of samples from each of the constituent datasets.

Figure 1. “Visualizations of atomic structures from the S2EF-15M dataset components: ODAC23 (top left), MPtrj (top right), SPICE (center), OC20 (bottom left), and OC22 (bottom right).”

The S2EF-15M dataset contains over 15 million systems (2x MPtrj and 3x SPICE upsampling), providing a diverse range of atomistic environments for training. It includes atomic numbers, 3D coordinates, forces, formation energy, total energy, and a boolean flag indicating the presence of valid formation energy data.  The scale and diversity of the dataset paired with the multi-task learning of energies and forces allows for comprehensive learning across different types of chemical environments.

Data Processing and Integration

A significant technical contribution of the AtomGen project is the development of efficient data processing pipelines for each source dataset. These pipelines handle the conversion of various data formats into a unified structure compatible with the HuggingFace Datasets library. The pipelines incorporate parallel processing for large datasets, handle compressed data formats, convert units, calculate derived properties, and manage dynamic padding and batching of variable-sized systems.

Figure 2. “Initialization and usage of an atom modeling data pipeline using AtomTokenizer and DataCollatorForAtomModeling for MaskGIT objective, with random dataset generation for batch processing.”

HuggingFace Integration

All dataset and pre-trained models are uploaded to huggingface hub.


AtomGen leverages the HuggingFace ecosystem for model development and dataset management. The project includes custom implementations of AtomformerConfig, AtomformerModel, and various task-specific models such as AtomformerForMaskedAM and Structure2EnergyAndForces. These implementations allow seamless integration with HuggingFace’s Trainer API and model hub, facilitating reproducibility and ease of use. 

Data Collation and Pre-training Techniques

The DataCollatorForAtomModeling class is a key technical component of AtomGen, offering flexibility in data preparation and supporting various pre-training techniques. This custom data collator handles the complexities of batching data, including dynamic padding of input sequences, computation of attention masks, and optional computation of Laplacian positional encodings.  These encodings are used with the TokenGT implementation as node identifiers.

The data collator supports multiple self-supervised pre-training objectives.  The coordinate and atom based pre-training objectives can be used either simultaneously or on their own.

  1. Masked Atom Modeling (MAM): Similar to BERT‘s masked language modeling, this technique randomly masks atomic identities for the model to predict, helping it learn atomic context and interactions.  Setting the mam argument to a floating point number will randomly mask that fraction of atoms.
  2. Coordinate Denoising: This involves perturbing atomic coordinates and tasking the model with recovering the original positions, enhancing its understanding of molecular geometry.  The degree of perturbation can be adjusted as a hyperparameter coords_perturb as a floating point number which represents the standard deviation of a normal distribution of noise centered around zero that is added to the coordinates.
  3. MaskGIT Objective: Inspired by the MaskGIT approach in image generation, this technique applies a more challenging masking strategy to atoms, potentially improving the model’s grasp of global system structure.  This objective is activated by setting the ‘mam’ argument to True, which enables this strategy instead of the standard MAM objective.

These diverse pre-training techniques enable Atomformer to learn rich representations of atomistic structures from multiple perspectives, potentially improving its performance on downstream tasks.

Pre-training and Fine-tuning Approach

While the data collator supports various self-supervised pre-training techniques, as discussed earlier, these were not utilized in the current pre-training experiments. The exploration of these diverse self-supervised approaches remains an exciting avenue for future research.  AtomGen employs a two-stage approach: pretraining on the S2EF-15M dataset followed by fine-tuning on specific tasks. The pre-training phase uses the Structure2EnergyAndForces model, which predicts both per-atom forces and per-system energy.  The model was trained on a 4xA40 for two weeks totalling 1344 GPU hours. 


For fine-tuning, AtomGen utilizes the ATOM3D benchmark suite, which includes a range of atomistic modeling tasks such as Small Molecule Properties (SMP), Mutation Stability Prediction (MSP), and Ligand Binding Affinity (LBA). Each of the 8 tasks in ATOM3D are pre-processed and formatted to be compatible with the data collator and HuggingFace Datasets library.

Performance and Scalability

The AtomGen project includes optimizations for training on large-scale datasets, including gradient checkpointing for memory-efficient training, mixed-precision training for faster computation, and distributed training support for multi-GPU setups.  

On the SMP task, the fine-tuned Atomformer model achieved a test MAE of 1.077, compared to 1.13 for a model trained from scratch, demonstrating the effectiveness of the pretraining approach.  We also provide the breakdown of the MAE across all 20 targets:

TargetsPre-trained (eval)Pre-trained (test)Scratch (eval)Scratch(test)
Rotational constant A [GHz]0.121618.1177/0.1566*0.162118.1576/0.1972*
Rotational constant B [GHz]0.05410.06770.06570.0794
Rotational constant C [GHz]0.02830.04040.02960.0421
Dipole moment [D]0.30440.30140.41280.4107
Isotropic polarizability [a0^3]0.33880.33920.5340.5358
Energy of HOMO [Ha]0.01040.01040.01070.0106
Energy of LUMO [Ha]0.0140.01430.01740.0174
Gap (LUMO-HOMO) [Ha]0.01650.01690.01940.0197
Electronic spatial extent [a0^2]2.02021.96562.27192.2126
Zero point vibrational energy [Ha]0.00060.00060.00070.0007
Internal energy at 0 K [Ha]0.06970.07830.13070.1291
Internal energy at 298.15 K [Ha]0.06970.07830.13070.1291
Enthalpy at 298.15 K [Ha]0.06970.07830.13070.1291
Free energy at 298.15 K [Ha]0.06970.07830.13070.1291
Heat capacity at 298.15 K [cal/(mol·K)]0.15310.15180.23250.2337
Internal energy at 0 K (thermochem) [Ha]0.01170.01160.01850.0183
Internal energy at 298.15 K (thermochem) [Ha]0.01170.01160.01850.0183
Enthalpy at 298.15 K (thermochem) [Ha]0.01170.01160.01860.0183
Free energy at 298.15 K (thermochem) [Ha]0.01160.01160.01850.0183
Heat capacity at 298.15 K (thermochem) [cal/(mol·K)]0.15190.1510.23230.2324

Table 1. The breakdown of the 20 targets of the SMP task for both the pre-trained model and the model fine-tuned from scratch.  Asterisk (∗) indicates the MAE after filtering of one sample at index of 1469 from the test set with an outlier Rotational constant A of 232e3.

Conclusion

The AtomGen project represents a significant effort in applying transformer-based models to atomistic modeling tasks. By focusing on efficient data processing, model architecture design, and integration with popular deep learning frameworks, AtomGen provides a robust foundation for research in computational chemistry and materials science. The project’s contributions in data aggregation, processing, and model development, along with its support for diverse pre-training techniques, have the potential to accelerate progress in areas such as drug discovery and materials design.

By integrating numerous techniques for efficient pre-training on large-scale datasets, including gradient checkpointing, mixed-precision training, and numerous pre-training objectives. These optimizations demonstrate best practices for pre-training and fine-tuning molecular transformer models on a diverse set of datasets. By successfully applying these techniques at scale, AtomGen provides valuable insights into overcoming the practical challenges of applying deep learning to complex molecular systems.

Related:

2024
Internships

Transforming user experiences with AI: OJ Onyeagwu’s internship success

2024
Machine Learning
Research
Research 2024

Vector Institute researchers reconvene for the second edition of the Machine Learning Privacy and Security Workshop

2024
Health
Success Stories

Vector workshops give insights into responsible health AI deployment