1 Background

Ordinary differential equation (ODE) -based simulation is a popular choice for modeling real-world systems, including metabolic networks, gene regulatory systems and epidemiology (Battogtokh et al. 2002; Yu et al. 2007; Keeling and Rohani 2008). For the model to be representative, an inverse problem needs to be solved: estimating unknown parameters from experimental measurements. While an ODE model can often capture the real-world dynamics, solving the inverse problem is hard, involving considerable amounts of ODE solutions and nonlinear optimization (Battogtokh et al. 2002).

Inference of model parameters is often a nonlinear optimization problem, comparing model predictions with measurements, by way of an objective function, and searching for better fitting model parameters (Figure 1.1). This process requires considerable iterations of producing and evaluating different models. A simplified model of central metabolism involves 32 equations and ~200 parameters and is fit with ~1000 data points. A Markov chain Monte Carlo (MCMC) ( 5.1 ) algorithm (Battogtokh et al. 2002) needs around \(10^7\) ODE solutions. Even with a modern ODE solver (Adaptive Backward Differentiation), it still takes about a week to finish. When the parameter set is stiff, considerably larger, if not impossible, simulation time is needed. Previously, we constructed a program to systematically ignore stiff parameter sets (pearc2019). This reduces the simulation time cost, while at the same time unavoidably introduces biases.

Figure 1.1: Inverse problem diagram. The goal is to find a model (in topology and parameter) that can approximate the unknown generating model. The measurements are produced from an unknown underlying model and random noises are also involved. By comparing predictions with the measurements, the assumed model can be improved. Such comparison are often implemented through an objective function, which evaluates parameters in the assumed model.

While ODE solvers have variable time costs, neural networks have near constant cost, which is also often small. There has been success in simulating dynamic systems by neural networks (Breen et al. 2020; Lutter, Ritter, and Peters 2019; Greydanus, Dzamba, and Yosinski 2019), but a high dimensional ODE initial value problem has not been carefully studied yet. In an ODE initial value problem, the input is the parameters of the assumed model and the output is the time dynamics. This is often the inner loop of the inverse problem. In this project, we explored the possibility of replacing ODE solver with trained neural networks to solve initial value problem.

2 Methods

Dynamics (\(Y(t)\)) of an ODE system solely depends on initial conditions (\(Y_{ini}\)), model parameters (\(\theta\)), and time (\(t\)). In both ODE solver and neural network, we can treat the vector \(X=[\theta, Y_{ini}, t]\) as the input and the vector \(Y(t)\) as the output. The solution of such a system by ODE solver can produce pairs \([X, Y(t)]\) as training data.

2.1 Neural network architecture

ResNet has been a popular structure in multiple tasks, including image classification, and it simplifies the training process by identity matching (He et al. 2016). In our modified architecture, we replaced the convolutional layers with linear layers. For example, resnet18_mlp is similar to original resnet18 (He et al. 2016), except that we used nn.Linear to replace nn.Conv2d (see here and following code block). The hidden layer size was controlled by its ratio (\(R\)) to the input layer size so, network complexity was automatically adjusted in dynamic systems of different sizes. Batch normalization and RELU (rectified linear unit) were used in the architecture (Ioffe and Szegedy 2015). Adam was used as the optimizer (Kingma and Ba 2015). Hyperparameter tuning was done both manually and by Optuna (Akiba et al. 2019). Mean squared error (MSE) was the objective function. 20% of samples are left out for testing. Besides ResNet, multilayer perceptron (MLP) and recurrent structures were also tested. The neural network was implemented in PyTorch and shared in GitHub.

class BasicBlock(nn.Module):
    expansion=1
    __constants__=['downsample']

    def __init__(self,inplanes,planes,downsample=None,groups=1,
                 base_width=64,norm_layer=None,p=0.0):
        # inplanes: input size
        # planes: internal size
        # downsample: whehter downsample the idnetify mapping. default: None
        # groups: was used for "Aggregated Residual Transformation" (not used currently) Defualt 1
        # width_per_group: used for "Wide Residual Networks" and "Aggregated Residual Transformation" Defualt 64
        # norm_layer: used to specify batch normalization function. Default None
        # p: dropbout probability Default 0.0
        super(BasicBlock,self).__init__()
        if norm_layer is None:
            norm_layer=nn.BatchNorm1d
        if groups != 1 or base_width != 64:
            raise ValueError('BasicBlock only supports groups=1 and base_width=64')
        self.fc1=line1d(inplanes,planes)
        self.bn1=norm_layer(planes)
        self.relu=nn.ReLU(inplace=True)
        self.fc2=line1d(planes,inplanes)
        self.bn2=norm_layer(inplanes)
        self.downsample=downsample
        self.p=p

    def forward(self,x):
        identity=x
        out=F.dropout(self.fc1(x),training=self.training,p=self.p)
        out=self.bn1(out)
        out=self.relu(out)
        out=F.dropout(self.fc2(out),training=self.training,p=self.p)
        out=self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out=self.relu(out)
        return out

The model is trained on P100 GPUs of both sapelo2 at GACRC and PSC Bridges at XSEDE.

2.2 Training data simulation

We simulated dynamic systems of different complexities as training samples. The coupled linear ODE system is simulated with different dimensions (4-32) (2.1), fixed topologies, and 10000 random generated parameter sets. We shifted the eigen value to be negative in real part to resemble most real-world systems. For linear ODE, we produced analytic solutions without ODE solver and list detailed procedures in 5.2.

Table 2.1: Dimensions of simulated data set. The equations are simualted in complex number so, here number of Ys (real and imaginary) are double the dimension of the system. Refer to 5.2 for details in simulation
dimension of the system number of theta number of Ys number of inputs number of outputs
2 6 4 11 4
3 10 6 17 6
4 14 8 23 8
8 30 16 47 16
12 54 24 79 24
16 74 32 107 32

3 Results

We have successfully approximated solutions for low dimensional ODE systems. For ODE systems with six equations, the neural network can achieve MSE of 0.2 for both training and testing sets after 500 epochs (Figure 3.1). Some representative cases of training and testing samples are presented in Figure 3.2. As the performance is similar for training and testing sets, further training and hyperparameter tuning can probably improve the results. From Figure 3.2, we can also see that different dynamics can be approximated. Similar performance can also be achieved by MLP.

Figure 3.1: MSE through epochs for multiple training settings. Preliminary training results are presented. Solid (dotted) line represent the MSE on training (testing) set. The best models is based on ResNet18, have hidden layer size eight times that of the input layer and use Adam optimizer. Details on training process can be found in Methods 2.2.

Figure 3.2: Fitting results of the preliminary trained model. The first (second) row is the model performance in the training (testing) set. X-axis is time and Y-axis is normalized \(Y\). Two cases are presented for both training and testing set. The blue curve is the ground truth (exact ODE solution) and the orange curve is neural network estimation. Details in neural network training can be found in Methods 2.2.

The neural network performances degrades with higher ODE system dimensions. Specifically, for ODE systems with 32 equations, the model performance is still not acceptable after extensive hyperparameter tuning (including by Optuna (Akiba et al. 2019)) and trying different technical options. Multiple options of scheduler, recurrent architectures, and optimizer were tried, as well as using larger training sample. Among these trials, ReduceLROnPlateau scheduler, Adam optimizer, and larger sample size helped to improve performance.

4 Next steps

  1. Training the neural network on larger data sets. Efficient approaches are needed for generating and training large data sets. Both processes will need parallelization. A separate data loading process is also necessary because of the memory cost.

  2. Data augmentation is needed to promote training. A novel transformation of time dynamics has been implemented and is currently being tested.

5 FAQ

5.1 Why is MCMC used instead of other nonlinear optimizer?

In the ensemble simulation procedure (Battogtokh et al. 2002), MCMC converges to a local fit region in the model parameter space. This local region contains many parameter sets (the model ensemble) and each set in the MCMC sample is of similar fit quality, given the experimental data. An ensemble of similar fitted models gives a natural uncertainty estimation for the fitting. This is particularly crucial when only sparse less experimental data are available for a model of high complexity.

5.2 The detailed procedure for simulating linear ODE

The procedure is to compute the ODE \(\frac{dY(t)}{dt}=HY(t)\) and format the input and output for neural network. There is \(N\) dimensions (equations of complex number) and \(M\) time points.

  1. Random select non-zero elements (NE) for \(H\) matrix. Diagonal elements are always included and they represent the contribution of a value to its own derivative. This topology (location of non-zero elements) is fixed for the following simulations and recorded in \(I_{conn}\)

  2. Generate \(\widetilde{H}\)

    For each NE, sample real and imaginary parts from \(Unif(-1,1)\)

  3. Eigen value shifting

    Decomposition \(\widetilde{H}=U'D'V'\) and \(d'=diag(D')\)

    \(\Delta{d'}=|max(real(d'))|\)

    \(H=\widetilde{H}-S \Delta d' I\) where \(S=1.01\) to make sure all eigen values have negative real parts and so the dynamic system does not diverge.

    Decomposition \(H=UDV\)

  4. Generate \(Y_{ini}\)

    Sample real and imaginary parts from \(Unif(-1,1)\)

  5. Generate time series for time grid \(\hat{t}=[t_1, ..., t_M]\)

    \(A=VY_{ini}\)

    \(B=[u_1 a_1, ... u_k a_k, ..., u_Na_N]=U \ diag(A)\), \(u_k\) is the k-th column vector of U

    \(E=[e^{\hat{d} \hat{t}} ]\) where \(\hat{d}=diag(D)\)

    \(Y(t)=BE\)

  6. Rescale time series

    For every equation (\(n\)) in \(N\), the normalized result is

    \(\widetilde{Y_n(t_m)} = \frac{Y_n(t_m)}{\sqrt{\Omega_{n}}}\) where \(\Omega_{n}=\sum^{M}_{m=1}{|Y_n(t_m)|^2}\)

  7. Formulate input and output for neural network

    Input: \([Re(H(I_{conn})), Im(H(I_{conn})), Re(Y_{ini}), Im(Y_{ini}),\hat{t}]\)

    Output: \([Re(Y(t)), Im(Y(t)]\)

6 Aknowledgement

This is work is supported by NSF MCB-1713746. We thank GACRC and XSEDE for computational resources and technical supports. XSEDE is supported supported by National Science Foundation grant number ACI-1548562. Our start-up grant in XSEDE(TG-MCB180198) use GPU resources in PSC Bridge. We thank the Georgia Research Alliance, the Institute of Bioinformatics, and the Complex Carbohydrate Research Center for supporting this work.

Reference

Akiba, Takuya, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. “Optuna: A Next-Generation Hyperparameter Optimization Framework.” Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Anchorage, AK, USA, 2623–31. doi:10.1145/3292500.3330701.

Battogtokh, D., D. K. Asch, M. E. Case, J. Arnold, and H. B. Schuttler. 2002. “An Ensemble Method for Identifying Regulatory Circuits with Special Reference to the Qa Gene Cluster of Neurospora Crassa.” Proc Natl Acad Sci U S A 99 (26): 16904–9. doi:10.1073/pnas.262658899.

Breen, P. G., C. N. Foley, T. Boekholt, and S. P. Zwart. 2020. “Newton Versus the Machine: Solving the Chaotic Three-Body Problem Using Deep Neural Networks.” Monthly Notices of the Royal Astronomical Society 494 (2): 2465–70. doi:10.1093/mnras/staa713.

Greydanus, Samuel, Misko Dzamba, and Jason Yosinski. 2019. “Hamiltonian Neural Networks,” 15379–89. http://papers.nips.cc/paper/9672-hamiltonian-neural-networks.pdf.

He, K. M., X. Y. Zhang, S. Q. Ren, and J. Sun. 2016. “Deep Residual Learning for Image Recognition.” 2016 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), 770–78. doi:10.1109/Cvpr.2016.90.

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” Proceedings of the 32nd International Conference on Machine Learning. Lille, France, 448–56.

Kingma, Diederik P., and Jimmy Ba. 2015. “Adam: A Method for Stochastic Optimization.” http://arxiv.org/abs/1412.6980.

Lutter, Michael, Christian Ritter, and Jan Peters. 2019. “Deep Lagrangian Networks: Using Physics as Model Prior for Deep Learning,” arXiv:1907.04490. https://ui.adsabs.harvard.edu/abs/2019arXiv190704490L.

Yu, Y., W. Dong, C. Altimus, X. Tang, J. Griffith, M. Morello, L. Dudek, J. Arnold, and H. B. Schuttler. 2007. “A Genetic Network for the Clock of Neurospora Crassa.” Proc Natl Acad Sci U S A 104 (8): 2809–14. doi:10.1073/pnas.0611005104.


  1. yue.wu@uga.edu