Designing Materials with Cluster Expansion and MatOpt

Author
Published

January 27, 2025

Introduction

Computational materials design requires balancing multiple competing interests such as accuracy, computational cost, interpretability, etc. One classical approach to trade off computational cost and accuracy is to fit a cluster expansion on first-principle data. Due to its interpretable and simple structure, cluster expansion, once trained, can be used to formulate a mathematical optimization problem to search the (globally) optimal arrangement of atomic species on a chosen lattice or supercell in terms of a target property (e.g., mixing energy), potentially under constraints such as composition, geometry, etc. In this post, we will demonstrate how to use two Python packages, icet and matopt, to train a cluster expansion using DFT data and then formulate and solve the a MILP to find the optimal arrangement of atomic species for the best mixing energy.

Cluster Expansions

A cluster expansion approximates a property (\(\mathcal{P}\)) of a crystal structure by decomposing that structure into clusters of sites (singlets, pairs, triplets, etc.) and summing up their contributions. Mathematically, if (\(\boldsymbol{\sigma}\)) is the configuration specifying which species occupy each site, we write

\[ \mathcal{P}(\boldsymbol{\sigma}) =\sum_{\alpha} J_{\alpha},\bigl\langle \Gamma_{\alpha}(\boldsymbol{\sigma})\bigr\rangle, \]

where:

  • \(\alpha\) indexes distinct symmetry orbits of clusters (e.g., all pairs of sites separated by the same distance/orientation in the crystal),
  • \(J_{\alpha}\) is the effective cluster interaction (ECI) for orbit \(\alpha\),
  • \(\langle \Gamma_{\alpha}(\boldsymbol{\sigma})\rangle\) is the average of certain basis functions (often orthonormal) over the individual clusters in orbit \(\alpha\).

To train a cluster expansion, we first sample a set of training structures/configurations and compute their property values via DFT or other methods. Then we do a linear regression (often with regularization) to obtain the ECIs ({\(J_{\alpha}\)}). The typical workflow looks like this:

  • Picking a prototype structure (primitive cell).
  • Defining cutoffs for 2-body, 3-body (and higher) distances so we only keep clusters within certain radii.
  • Computing the DFT energies of various sampled structures/configurations of the same underlying lattice.
  • Solving a linear regression problem to learn those “effective cluster interactions” (ECIs).

These ECIs can then be used to predict the property of new configurations cheaply. Tools like icet automate and streamline the above workflow.

MILP Reformulation

To effectively utilize a trained cluster expansion in an optimization problem, we need to reformulate/transform the parameters. Below we show how to do this for a Mixed-Integer Linear Programming (MILP) formulation, where we have explicit linear expressions and binary variables.

Site Indicator

For each site (\(i = 1,\dots,N\)) and each species (\(k\)) (e.g., Cu, Ni, Pd, Ag), we introduce a binary decision variable:

\[ Y_{i,k} = \begin{cases} 1, & \text{if site $i$ is assigned species $k$,}\\ 0, & \text{otherwise.} \end{cases} \]

We will also impose:

\[ \sum_{k} Y_{i,k} = 1 \quad \text{for each site } i, \]

so that exactly one species is chosen for each site.

Cluster Indicator

We index a single cluster by \(n\), and define another binary variable (\(Z_n\)) to indicate that the cluster is present in the configuration:

\[ Z_{n} = \begin{cases} 1, & \text{if cluster $n$ exists in the current configuration,}\\ 0, & \text{otherwise.} \end{cases} \]

We will also need to impose the logic that:

\[ Z_n = \prod_{(i,k) \in \text{cluster } n} Y_{i,k} \quad \text{for all } n, \]

In matopt, these cluster-indicator variables \(Z_n\) and underlying logic (reformulated as linear constraints) will be built automatically to ensure that (\(Z_n = 1\)) happens exactly when the relevant combination of occupant species occurs across that cluster.

Objective Function

Next, we need to find coefficients (\(c_{n}\)) for each cluster \(n\) such that the property/search objective can be linearly represented as:

\[ \mathcal{P} = \sum_{n} c_{n}Z_{n}, \]

subject to the constraints. Hence, the cluster expansion parameters must be fully expanded in the sense that each cluster in the geometry - including each possible combination of sites and species - will correspond to a numeric coefficient value. We designed a helper function to do this automatically that:

  1. Read an icet.ClusterExpansion object.
  2. Enumerate all clusters in your “design supercell” (the structure you want to optimize).
  3. Output “cluster specs” + “coefficients” so that each cluster monomial (e.g., site 10 = Ni & site 11 = Cu) has its unique coefficient.

Finally, we can feed these coefficients into matopt to build and solve the MILP optimization problem.

An End-to-end Example

In the sections below, we show how to:

  1. Load DFT data for Cu–Ni–Pd–Ag structures
  2. Build a ClusterSpace and train the expansion with LassoCV
  3. Fully expand the learned cluster expansion onto a “design supercell”
  4. Build a matopt MILP and solve for an optimal arrangement.

Imports and Helper Function

Let’s start by importing the key packages and defining a helper to “expand” the learned parameters. This function inspects the cluster expansion, enumerates the geometry for your chosen design supercell, and returns a list of cluster specs and coefficients.

Click to expand the code
import itertools
import pandas as pd
import ase
from collections import Counter
from sklearn.linear_model import LassoCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# icet imports
from icet import ClusterSpace, ClusterExpansion, StructureContainer
from icet.core.local_orbit_list_generator import LocalOrbitListGenerator
from icet.core.structure import Structure

# MatOpt imports
from matopt.materials.atom import Atom
from matopt.materials.canvas import Canvas
from matopt.aml.expr import SumSites, SumClusters
from matopt.aml.rule import EqualTo, FixedTo
from matopt.aml.model import MatOptModel

def parameterize_cluster_expansion(cs, ce, design_atoms, weight=1.0, tol=1e-15):
    """
    Expand a trained ClusterExpansion over a 'design_atoms' structure into
    fully enumerated cluster specs and coefficients for use in MatOpt.
    """
    cluster_specs = []
    cluster_coeffs = []

    # (0) Zero-let (constant term)
    zero_coeff = ce.parameters[0] * weight
    if abs(zero_coeff) > tol:
        cluster_specs.append([])
        cluster_coeffs.append(zero_coeff)

    # (1) Generate orbit list for the design structure
    lolg = LocalOrbitListGenerator(
        cs.orbit_list,
        Structure.from_atoms(design_atoms),
        ce.fractional_position_tolerance,
    )
    orbit_list = lolg.generate_full_orbit_list().get_orbit_list()

    # (2) We assume chemical_symbols used in cs match the "elements" list
    elements = list(cs.species_maps[0].keys())

    # (3) Expand each orbit's cluster-vector elements
    param_idx = 1  # param[0] is zero-let
    for orbit in orbit_list:
        cves = orbit.cluster_vector_elements
        if not cves:
            continue
        multiplicity = cves[0]["multiplicity"]

        for cve in cves:
            param_eci = ce.parameters[param_idx]
            param_idx += 1
            outer_coeff = (param_eci / multiplicity) * weight

            if abs(outer_coeff) < tol:
                continue

            # (4) For each actual cluster in that orbit
            for cluster in orbit.clusters:
                site_indices = [site.index for site in cluster.lattice_sites]

                # (5) Loop over all species combos for those sites
                for species_combo in itertools.product(range(len(elements)), repeat=len(site_indices)):
                    final_coeff = outer_coeff
                    if abs(final_coeff) > tol:
                        cluster_spec = list(zip(site_indices, species_combo))
                        cluster_specs.append(cluster_spec)
                        cluster_coeffs.append(final_coeff)

    return cluster_specs, cluster_coeffs

Load and Prepare DFT Data

Next, suppose you have a dataset of Cu–Ni–Pd–Ag structures, each with a final DFT energy. We’ll convert those energies into mixing energies (relative to pure elemental references) and store them in a DataFrame.

Click to expand the code
# Reference energies for pure elements (per atom)
reference_energies = {
    "Cu": -240.11121086,
    "Ni": -352.62417749,
    "Pd": -333.69496589,
    "Ag": -173.55506507,
}
reference_structure_indice = 155
cutoffs = [5, 5]

# Read from JSON database
db = ase.io.read("structures_CuNiPdAg.json", ":")
dft_data = pd.read_csv("properties_CuNiPdAg.csv", index_col=0)
indices = dft_data.index.values

df = pd.DataFrame(columns=["atoms", "MixingEnergy", "TotalEnergy"])
for idx, i in enumerate(indices):
    st = db[i]
    energy = dft_data.loc[i, "final_energy"]
    dict_representation = dict(Counter(st.get_chemical_symbols()))
    NumAtoms = len(st)

    # Mixing energy calculation
    mixing_energy = (
        energy
        - sum(
            [
                v / NumAtoms * reference_energies[k]
                for k, v in dict_representation.items()
            ]
        )
    ) / NumAtoms

    df.loc[idx] = [st, mixing_energy, energy]

Here, df["atoms"] holds an ase.Atoms object, and "MixingEnergy" is the property we want to fit. In a real scenario, you’d have your own structures plus energies.

Build and Train the Cluster Expansion

We now define a ClusterSpace for a chosen reference structure (one from our dataset). Then we store all data in a StructureContainer and fit the expansion with LassoCV:

Click to expand the code
# 4-species expansion: Cu, Ni, Pd, Ag
cs = ClusterSpace(
    structure=df.loc[reference_structure_indice, "atoms"],  # pick a reference
    cutoffs=cutoffs,
    chemical_symbols=list(reference_energies.keys()),
)

# Collect training data
sc = StructureContainer(cluster_space=cs)
for i in df.index:
    sc.add_structure(
        structure=df.loc[i, "atoms"],
        properties={"mixing_energy": df.loc[i, "MixingEnergy"]},
    )

# Extract design matrix (X) and target (y)
X, y = sc.get_fit_data(key="mixing_energy")

# Fit with LassoCV
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lasso_model = LassoCV(cv=5, random_state=42)
lasso_model.fit(x_train, y_train)

print("Train R2:", lasso_model.score(x_train, y_train))
print("Test  R2:", lasso_model.score(x_test, y_test))
print("Test MAE:", mean_absolute_error(lasso_model.predict(x_test), y_test))

# Build the cluster expansion object
ce = ClusterExpansion(cluster_space=cs, parameters=lasso_model.coef_)

# Intercept goes into the zero-let
ce.parameters[0] += lasso_model.intercept_

At this point, ce holds all the learned parameters. You can use ce.predict(supercell_atoms) to estimate mixing energies. But we’re going further: turning it into a MILP model.

Fully Expand the Learned CE for Our Design Structure

Let’s pick one structure from the DataFrame as our “design space,” or you could replicate it to enlarge the supercell. We call our helper parameterize_cluster_expansion(...):

Click to expand the code
design_atoms = df.loc[0, "atoms"]  # for demonstration
cluster_specs, cluster_coeffs = parameterize_cluster_expansion(
    cs, ce, design_atoms, weight=1.0
)

# Example of what we get:
# cluster_specs might look like [ [(40,3),(47,0),(44,3)], [(44,3),(57,1)], ... ]
# cluster_coeffs like [5.8e-05, -1.85e-04, ...]

We now have two lists describing the objective in terms of explicit site–species combos. Notice how we break down the orbits into explicit binary-variable products and store them as [(site_index, species_index), ...].

Construct a matopt Model and Solve

Now we build a matopt Canvas for the chosen geometry (site positions, neighbor lists if needed), define which species are allowed, and specify any constraints. Then we create an objective that sums the cluster terms:

Click to expand the code
points_list = design_atoms.get_positions().tolist()
num_sites = len(design_atoms)
Canv = Canvas(points_list)

# Define the candidate atoms in the same order as "reference_energies.keys()"
AllAtoms = [Atom(s) for s in reference_energies.keys()]

# Convert cluster_specs to "site-Atom" pairs
cluster_specs_expanded = []
for c in cluster_specs:
    c_expanded = []
    for st, spidx in c:
        c_expanded.append((st, AllAtoms[spidx]))
    cluster_specs_expanded.append(c_expanded)

# Create the MatOpt model
m = MatOptModel(Canv, AllAtoms, clusters=cluster_specs_expanded[1:])

# Optional: add composition constraints 
CompBounds = {
    AllAtoms[0]: (0, num_sites),
    AllAtoms[1]: (0, num_sites),
    AllAtoms[2]: (5, 10),  # e.g., require 5–10 Pd
    AllAtoms[3]: (5, 10),  # e.g., require 5–10 Ag
}
m.addGlobalTypesDescriptor("Composition", bounds=CompBounds, rules=EqualTo(SumSites(desc=m.Yik)))

# Force the structure to be fully occupied (or any other site-based constraint)
m.Yi.rules.append(FixedTo(1.0))

# Build the objective from the cluster coefficients
obj_expr = SumClusters(desc=m.Zn, coefs=cluster_coeffs[1:])

# Solve the MILP
D = m.minimize(obj_expr, solver="neos-cplex", tilim=60)

A few highlights:

  • m.Zn: matopt variable for “is cluster \(n\) present?”
  • We linked m.Zn to m.Yik automatically, so Zn[n] = 1 only if all site–species pairs in that cluster are chosen.
  • Compositional constraints can restrict how many sites are allowed to be each type.
  • m.minimize(obj_expr) calls the solver. Use m.maximize(...) if needed.

Analyzing the Results

After solving, D (a Design object) gives the final configuration. We can export or convert to common structure formats:

Click to expand the code
D.toPDB("result.pdb")
D.toXYZ("result.xyz")
D.toPOSCAR("result.vasp")

from pymatgen.core import Structure
structure = D.to_pymatgen()
structure.to("cif", "result.cif")

That’s it! You have a fully integrated pipeline from DFT data → cluster expansion → MILP-based discrete optimization → final structure. You could run a DFT check on that final arrangement or evaluate further properties.

Remarks

Using icet to train a cluster expansion and matopt to solve the resulting discrete optimization problem provides a powerful path toward rational materials design. You can rapidly propose candidate structures that minimize or maximize a desired property, subject to constraints (composition, geometry, etc.). While we showed a small demonstration on a single Cu–Ni–Pd–Ag cell, you can scale up to larger supercells, more complex constraints, or different properties.

Back to top

Citation

BibTeX citation:
@online{yin2025,
  author = {Yin, Xiangyu},
  title = {Designing {Materials} with {Cluster} {Expansion} and
    {MatOpt}},
  date = {2025-01-27},
  url = {https://xiangyu-yin.com/content/post_cluster_expansion.html},
  langid = {en}
}
For attribution, please cite this work as:
Yin, Xiangyu. 2025. “Designing Materials with Cluster Expansion and MatOpt.” January 27, 2025. https://xiangyu-yin.com/content/post_cluster_expansion.html.