Designing Materials with Cluster Expansion and MatOpt
Introduction
Computational materials design requires balancing multiple competing interests such as accuracy, computational cost, interpretability, etc. One classical approach to trade off computational cost and accuracy is to fit a cluster expansion on first-principle data. Due to its interpretable and simple structure, cluster expansion, once trained, can be used to formulate a mathematical optimization problem to search the (globally) optimal arrangement of atomic species on a chosen lattice or supercell in terms of a target property (e.g., mixing energy), potentially under constraints such as composition, geometry, etc. In this post, we will demonstrate how to use two Python packages, icet and matopt, to train a cluster expansion using DFT data and then formulate and solve the a MILP to find the optimal arrangement of atomic species for the best mixing energy.
Cluster Expansions
A cluster expansion approximates a property (\(\mathcal{P}\)) of a crystal structure by decomposing that structure into clusters of sites (singlets, pairs, triplets, etc.) and summing up their contributions. Mathematically, if (\(\boldsymbol{\sigma}\)) is the configuration specifying which species occupy each site, we write
\[ \mathcal{P}(\boldsymbol{\sigma}) =\sum_{\alpha} J_{\alpha},\bigl\langle \Gamma_{\alpha}(\boldsymbol{\sigma})\bigr\rangle, \]
where:
- \(\alpha\) indexes distinct symmetry orbits of clusters (e.g., all pairs of sites separated by the same distance/orientation in the crystal),
- \(J_{\alpha}\) is the effective cluster interaction (ECI) for orbit \(\alpha\),
- \(\langle \Gamma_{\alpha}(\boldsymbol{\sigma})\rangle\) is the average of certain basis functions (often orthonormal) over the individual clusters in orbit \(\alpha\).
To train a cluster expansion, we first sample a set of training structures/configurations and compute their property values via DFT or other methods. Then we do a linear regression (often with regularization) to obtain the ECIs ({\(J_{\alpha}\)}). The typical workflow looks like this:
- Picking a prototype structure (primitive cell).
- Defining cutoffs for 2-body, 3-body (and higher) distances so we only keep clusters within certain radii.
- Computing the DFT energies of various sampled structures/configurations of the same underlying lattice.
- Solving a linear regression problem to learn those “effective cluster interactions” (ECIs).
These ECIs can then be used to predict the property of new configurations cheaply. Tools like icet automate and streamline the above workflow.
MILP Reformulation
To effectively utilize a trained cluster expansion in an optimization problem, we need to reformulate/transform the parameters. Below we show how to do this for a Mixed-Integer Linear Programming (MILP) formulation, where we have explicit linear expressions and binary variables.
Site Indicator
For each site (\(i = 1,\dots,N\)) and each species (\(k\)) (e.g., Cu, Ni, Pd, Ag), we introduce a binary decision variable:
\[ Y_{i,k} = \begin{cases} 1, & \text{if site $i$ is assigned species $k$,}\\ 0, & \text{otherwise.} \end{cases} \]
We will also impose:
\[ \sum_{k} Y_{i,k} = 1 \quad \text{for each site } i, \]
so that exactly one species is chosen for each site.
Cluster Indicator
We index a single cluster by \(n\), and define another binary variable (\(Z_n\)) to indicate that the cluster is present in the configuration:
\[ Z_{n} = \begin{cases} 1, & \text{if cluster $n$ exists in the current configuration,}\\ 0, & \text{otherwise.} \end{cases} \]
We will also need to impose the logic that:
\[ Z_n = \prod_{(i,k) \in \text{cluster } n} Y_{i,k} \quad \text{for all } n, \]
In matopt, these cluster-indicator variables \(Z_n\) and underlying logic (reformulated as linear constraints) will be built automatically to ensure that (\(Z_n = 1\)) happens exactly when the relevant combination of occupant species occurs across that cluster.
Objective Function
Next, we need to find coefficients (\(c_{n}\)) for each cluster \(n\) such that the property/search objective can be linearly represented as:
\[ \mathcal{P} = \sum_{n} c_{n}Z_{n}, \]
subject to the constraints. Hence, the cluster expansion parameters must be fully expanded in the sense that each cluster in the geometry - including each possible combination of sites and species - will correspond to a numeric coefficient value. We designed a helper function to do this automatically that:
- Read an
icet.ClusterExpansion
object. - Enumerate all clusters in your “design supercell” (the structure you want to optimize).
- Output “cluster specs” + “coefficients” so that each cluster monomial (e.g., site 10 = Ni & site 11 = Cu) has its unique coefficient.
Finally, we can feed these coefficients into matopt to build and solve the MILP optimization problem.
An End-to-end Example
In the sections below, we show how to:
- Load DFT data for Cu–Ni–Pd–Ag structures
- Build a
ClusterSpace
and train the expansion withLassoCV
- Fully expand the learned cluster expansion onto a “design supercell”
- Build a
matopt
MILP and solve for an optimal arrangement.
Imports and Helper Function
Let’s start by importing the key packages and defining a helper to “expand” the learned parameters. This function inspects the cluster expansion, enumerates the geometry for your chosen design supercell, and returns a list of cluster specs and coefficients.
Click to expand the code
import itertools
import pandas as pd
import ase
from collections import Counter
from sklearn.linear_model import LassoCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
# icet imports
from icet import ClusterSpace, ClusterExpansion, StructureContainer
from icet.core.local_orbit_list_generator import LocalOrbitListGenerator
from icet.core.structure import Structure
# MatOpt imports
from matopt.materials.atom import Atom
from matopt.materials.canvas import Canvas
from matopt.aml.expr import SumSites, SumClusters
from matopt.aml.rule import EqualTo, FixedTo
from matopt.aml.model import MatOptModel
def parameterize_cluster_expansion(cs, ce, design_atoms, weight=1.0, tol=1e-15):
"""
Expand a trained ClusterExpansion over a 'design_atoms' structure into
fully enumerated cluster specs and coefficients for use in MatOpt.
"""
= []
cluster_specs = []
cluster_coeffs
# (0) Zero-let (constant term)
= ce.parameters[0] * weight
zero_coeff if abs(zero_coeff) > tol:
cluster_specs.append([])
cluster_coeffs.append(zero_coeff)
# (1) Generate orbit list for the design structure
= LocalOrbitListGenerator(
lolg
cs.orbit_list,
Structure.from_atoms(design_atoms),
ce.fractional_position_tolerance,
)= lolg.generate_full_orbit_list().get_orbit_list()
orbit_list
# (2) We assume chemical_symbols used in cs match the "elements" list
= list(cs.species_maps[0].keys())
elements
# (3) Expand each orbit's cluster-vector elements
= 1 # param[0] is zero-let
param_idx for orbit in orbit_list:
= orbit.cluster_vector_elements
cves if not cves:
continue
= cves[0]["multiplicity"]
multiplicity
for cve in cves:
= ce.parameters[param_idx]
param_eci += 1
param_idx = (param_eci / multiplicity) * weight
outer_coeff
if abs(outer_coeff) < tol:
continue
# (4) For each actual cluster in that orbit
for cluster in orbit.clusters:
= [site.index for site in cluster.lattice_sites]
site_indices
# (5) Loop over all species combos for those sites
for species_combo in itertools.product(range(len(elements)), repeat=len(site_indices)):
= outer_coeff
final_coeff if abs(final_coeff) > tol:
= list(zip(site_indices, species_combo))
cluster_spec
cluster_specs.append(cluster_spec)
cluster_coeffs.append(final_coeff)
return cluster_specs, cluster_coeffs
Load and Prepare DFT Data
Next, suppose you have a dataset of Cu–Ni–Pd–Ag structures, each with a final DFT energy. We’ll convert those energies into mixing energies (relative to pure elemental references) and store them in a DataFrame
.
Click to expand the code
# Reference energies for pure elements (per atom)
= {
reference_energies "Cu": -240.11121086,
"Ni": -352.62417749,
"Pd": -333.69496589,
"Ag": -173.55506507,
}= 155
reference_structure_indice = [5, 5]
cutoffs
# Read from JSON database
= ase.io.read("structures_CuNiPdAg.json", ":")
db = pd.read_csv("properties_CuNiPdAg.csv", index_col=0)
dft_data = dft_data.index.values
indices
= pd.DataFrame(columns=["atoms", "MixingEnergy", "TotalEnergy"])
df for idx, i in enumerate(indices):
= db[i]
st = dft_data.loc[i, "final_energy"]
energy = dict(Counter(st.get_chemical_symbols()))
dict_representation = len(st)
NumAtoms
# Mixing energy calculation
= (
mixing_energy
energy- sum(
[/ NumAtoms * reference_energies[k]
v for k, v in dict_representation.items()
]
)/ NumAtoms
)
= [st, mixing_energy, energy] df.loc[idx]
Here, df["atoms"]
holds an ase.Atoms
object, and "MixingEnergy"
is the property we want to fit. In a real scenario, you’d have your own structures plus energies.
Build and Train the Cluster Expansion
We now define a ClusterSpace
for a chosen reference structure (one from our dataset). Then we store all data in a StructureContainer
and fit the expansion with LassoCV
:
Click to expand the code
# 4-species expansion: Cu, Ni, Pd, Ag
= ClusterSpace(
cs =df.loc[reference_structure_indice, "atoms"], # pick a reference
structure=cutoffs,
cutoffs=list(reference_energies.keys()),
chemical_symbols
)
# Collect training data
= StructureContainer(cluster_space=cs)
sc for i in df.index:
sc.add_structure(=df.loc[i, "atoms"],
structure={"mixing_energy": df.loc[i, "MixingEnergy"]},
properties
)
# Extract design matrix (X) and target (y)
= sc.get_fit_data(key="mixing_energy")
X, y
# Fit with LassoCV
= train_test_split(X, y, test_size=0.2, random_state=42)
x_train, x_test, y_train, y_test = LassoCV(cv=5, random_state=42)
lasso_model
lasso_model.fit(x_train, y_train)
print("Train R2:", lasso_model.score(x_train, y_train))
print("Test R2:", lasso_model.score(x_test, y_test))
print("Test MAE:", mean_absolute_error(lasso_model.predict(x_test), y_test))
# Build the cluster expansion object
= ClusterExpansion(cluster_space=cs, parameters=lasso_model.coef_)
ce
# Intercept goes into the zero-let
0] += lasso_model.intercept_ ce.parameters[
At this point, ce
holds all the learned parameters. You can use ce.predict(supercell_atoms)
to estimate mixing energies. But we’re going further: turning it into a MILP model.
Fully Expand the Learned CE for Our Design Structure
Let’s pick one structure from the DataFrame as our “design space,” or you could replicate it to enlarge the supercell. We call our helper parameterize_cluster_expansion(...)
:
Click to expand the code
= df.loc[0, "atoms"] # for demonstration
design_atoms = parameterize_cluster_expansion(
cluster_specs, cluster_coeffs =1.0
cs, ce, design_atoms, weight
)
# Example of what we get:
# cluster_specs might look like [ [(40,3),(47,0),(44,3)], [(44,3),(57,1)], ... ]
# cluster_coeffs like [5.8e-05, -1.85e-04, ...]
We now have two lists describing the objective in terms of explicit site–species combos. Notice how we break down the orbits into explicit binary-variable products and store them as [(site_index, species_index), ...]
.
Construct a matopt
Model and Solve
Now we build a matopt
Canvas
for the chosen geometry (site positions, neighbor lists if needed), define which species are allowed, and specify any constraints. Then we create an objective that sums the cluster terms:
Click to expand the code
= design_atoms.get_positions().tolist()
points_list = len(design_atoms)
num_sites = Canvas(points_list)
Canv
# Define the candidate atoms in the same order as "reference_energies.keys()"
= [Atom(s) for s in reference_energies.keys()]
AllAtoms
# Convert cluster_specs to "site-Atom" pairs
= []
cluster_specs_expanded for c in cluster_specs:
= []
c_expanded for st, spidx in c:
c_expanded.append((st, AllAtoms[spidx]))
cluster_specs_expanded.append(c_expanded)
# Create the MatOpt model
= MatOptModel(Canv, AllAtoms, clusters=cluster_specs_expanded[1:])
m
# Optional: add composition constraints
= {
CompBounds 0]: (0, num_sites),
AllAtoms[1]: (0, num_sites),
AllAtoms[2]: (5, 10), # e.g., require 5–10 Pd
AllAtoms[3]: (5, 10), # e.g., require 5–10 Ag
AllAtoms[
}"Composition", bounds=CompBounds, rules=EqualTo(SumSites(desc=m.Yik)))
m.addGlobalTypesDescriptor(
# Force the structure to be fully occupied (or any other site-based constraint)
1.0))
m.Yi.rules.append(FixedTo(
# Build the objective from the cluster coefficients
= SumClusters(desc=m.Zn, coefs=cluster_coeffs[1:])
obj_expr
# Solve the MILP
= m.minimize(obj_expr, solver="neos-cplex", tilim=60) D
A few highlights:
m.Zn
: matopt variable for “is cluster \(n\) present?”- We linked
m.Zn
tom.Yik
automatically, soZn[n] = 1
only if all site–species pairs in that cluster are chosen. - Compositional constraints can restrict how many sites are allowed to be each type.
m.minimize(obj_expr)
calls the solver. Usem.maximize(...)
if needed.
Analyzing the Results
After solving, D
(a Design
object) gives the final configuration. We can export or convert to common structure formats:
Click to expand the code
"result.pdb")
D.toPDB("result.xyz")
D.toXYZ("result.vasp")
D.toPOSCAR(
from pymatgen.core import Structure
= D.to_pymatgen()
structure "cif", "result.cif") structure.to(
That’s it! You have a fully integrated pipeline from DFT data → cluster expansion → MILP-based discrete optimization → final structure. You could run a DFT check on that final arrangement or evaluate further properties.
Remarks
Using icet
to train a cluster expansion and matopt
to solve the resulting discrete optimization problem provides a powerful path toward rational materials design. You can rapidly propose candidate structures that minimize or maximize a desired property, subject to constraints (composition, geometry, etc.). While we showed a small demonstration on a single Cu–Ni–Pd–Ag cell, you can scale up to larger supercells, more complex constraints, or different properties.
Citation
@online{yin2025,
author = {Yin, Xiangyu},
title = {Designing {Materials} with {Cluster} {Expansion} and
{MatOpt}},
date = {2025-01-27},
url = {https://xiangyu-yin.com/content/post_cluster_expansion.html},
langid = {en}
}