Weather forecasting is a tricky problem. Traditionally, it has been done by manually modelling weather dynamics using differential equations, but this approach is highly dependent on us getting the equations right. To avoid this problem, we can use machine learning to directly predict the weather, which let’s us make predictions without modelling the dynamics. However, this approach requires huge amounts of data to reach good performance. Fortunately, there is a middle ground: What if we instead use machine learning to model the dynamics of the weather?
Instead of trying to model how the weather will look in the next time step, what if we instead model how the weather changes between time steps? More concretely: What if we learn the differential equations that govern the change in weather? In this blog article we are going to use Julia and the SciML ecosystem to do just that. We are going to see how neural ordinary differential equations (neural ODEs) relate to “regular” networks, how to train them and see how they can extrapolate time series from just a tiny amount of training data.
Neural ODEs for time series
To start us off, let’s talk about how neural ODEs are used for time series modeling. Recall that in a standard machine learning setting we would assume a discrete set of observations \(y_0, y_1, \dots y_k, \) \( y_i \in \mathbb{R}^n \) at time points \(t_0, t_1, \dots t_k, \), \(t_i \in \mathbb{R} \) to be related through some function \( y_i = f(t_i; \theta) \) where \( \theta \) are learnable parameters. However, in a neural ODE we consider a continuous setting and instead assume that the change in \(y\) is governed by an ODE
\[ \frac{\partial y}{\partial t} = f(y; \theta). \]
The goal is hence not to learn the relationship between \(y\) and \(t\), but the underlying dynamics of change. If the dynamics do not change too much, this has very powerful generalisation capabilities. It is helpful to think of this formulation as “a neural network inside an ODE” or maybe “an ODE with learnable parameters”. In fact, the “forward pass” through a neural ODE is equivalent to solving an initial value problem, where \( y(t_0) \) is the input features and we replace hand-crafted equations with a neural network. This means that a single forward pass gives us an entire trajectory in contrast to e.g. RNNs, where each forward pass through the model gives a single prediction in time.
To make this more concrete, consider a forward pass for \(y(t_0) = 1.0\) where the model has been trained on \(f^{\star}(y) = \exp(1.5y)\). The forward pass consist of inputting \(y(t_0)\) and using an ODE solver to step forward in time to solve \[ y(t) = y(t_0) + \int_{t_0}^t \frac{\partial y}{\partial t} \partial t = y(t_0) + \int_{t_0}^t f^{\star}(y(t)) \partial t, \] where we use a neural network to model \(f^{\star}\).
So how do we train a network inside an ODE? As long as we can
take gradients of \(\theta\) with respect to the loss we can train it
using standard gradient based methods. Fortunately,
DiffEqFlux.jl
takes care of everything required to do this for us.
There are several strategies that can be specified to compute
gradients, and depending on
the problem you might prefer one over the other. However, for this article
the default InterpolatingAdjoint
will be perfectly fine.
Model implementation
Now that we have a good conceptual idea of the model, let’s see it
in practice. We are going to use DiffEqFlux.jl to do the heavy lifting, which combines the ODE
solvers of DifferentialEquations.jl with the differentiable
programming capabilities of Flux.jl. Using DiffEqFlux
, we can simply
construct a neural network to model \(f\) and plug that into a NeuralODE
object. The NeuralODE
object itself has a few additional important
hyper-parameters though. Firstly, we have to specify an ODE solver and a time span to
solve on. We will use the Tsit5
solver, which uses an explicit
method. Secondly, the parameters reltol
and abstol
let us configure the solution
error tolerance to trade off accuracy and training time.
Recall that a forward pass means solving an initial value problem,
hence a lower tolerance gives a more accurate solution, and in turn
better gradient estimates. But of course, this requires
more function evaluations and are consequently slower to compute.
using DiffEqFlux
function neural_ode(t, data_dim; saveat = t)
f = FastChain(FastDense(data_dim, 64, swish),
FastDense(64, 32, swish),
FastDense(32, data_dim))
node = NeuralODE(f, (minimum(t), maximum(t)), Tsit5(),
saveat = saveat, abstol = 1e-9,
reltol = 1e-9)
end
The Delhi dataset
The dataset we are going to use comprises daily measurements of the climate in Delhi over several years. The entire dataset is a single time series, where the last part is set aside for testing. We will combine both the train and test set though, since we won’t even need half of the training set to fit a good model. Let’s load up the data and visualize it.
using DataFrames, CSV
delhi_train = CSV.read("data/timeseries/dehli-temp/DailyDelhiClimateTrain.csv")
delhi_test = CSV.read("data/timeseries/dehli-temp/DailyDelhiClimateTest.csv")
delhi = vcat(delhi_train, delhi_test)
Something is off with the air pressure measurements, indicated by the extreme outliers after 2016. However, prior to 2016 the measurements show a nice periodic behavior.
As an aside, if you read the dataset description it claims that the Mean pressure
variable is measured in atm. Now, I have never been to Delhi, but I have my doubts since that would
mean that the city of Delhi suffers under a thousand atmospheres
worth of pressure. While such a phenomenon would be pretty cool and without a doubt get
scientists excited, I think it is more likely the units simply got
mixed up. Based on that, I have chosen to present the data in
hectopascal, which is commonly used in meteorological forecasts
instead of using its documented unit. Anyway, let’s prepare the data for model training.
Data pre-processing
All the metrics vary a lot from day to day, so to emphasis the overall trend in the data we will average the observations into months.
using Statistics
using Base.Iterators: take, cycle
delhi[:,:year] = Float64.(year.(delhi[:,:date]))
delhi[:,:month] = Float64.(month.(delhi[:,:date]))
df_mean = by(delhi, [:year, :month],
:meantemp => mean,
:humidity => mean,
:wind_speed => mean,
:meanpressure => mean)
rename!(df_mean, [:year, :month, :meantemp,
:humidity, :wind_speed, :meanpressure])
df_mean[!,:date] .= df_mean[:,:year] .+ df_mean[:,:month] ./ 12;
In addition to averaging, we will normalize the data. The features are normalized to have zero mean and unit variance, and the temporal dimension is shifted to start at \(0\). Finally, we take the first \(20\) observations as our training data and leave the remaining for testing.
features = [:meantemp, :humidity, :wind_speed, :meanpressure]
t = df_mean[:, :date] |>
t -> t .- minimum(t) |>
t -> reshape(t, 1, :)
y = df_mean[:, features] |>
y -> Matrix(y)' |>
y -> (y .- mean(y, dims = 2)) ./ std(y, dims = 2)
T = 20
train_dates = df_mean[1:T, :date]
test_dates = df_mean[T+1:end, :date]
train_t, test_t = t[1:T], t[T:end]
train_y, test_y = y[:,1:T], y[:,T:end];
Since we are training our model using gradient descent, we run the risk of getting stuck into a bad local minima. This is particularly true when training on a periodic time series, since the model can easily settle with predicting the time series mean. Mini-batching can be used to mitigate this, but for this problem we are going to employ a different method. We are going to train on the first couple of observations until convergence, then introduce a few more observations. Then train until convergence, introduce some more observations, etc… By doing this, we let the model adapt to local changes, resulting in a better fit. The code below implements this training procedure.
using OrdinaryDiffEq, Flux, Random
function train_one_round(node, θ, y, opt, maxiters,
y0 = y[:, 1]; kwargs...)
predict(θ) = Array(node(y0, θ))
loss(θ) = begin
ŷ = predict(θ)
Flux.mse(ŷ, y)
end
θ = θ == nothing ? node.p : θ
res = DiffEqFlux.sciml_train(
loss, θ, opt,
maxiters = maxiters;
kwargs...
)
return res.minimizer
end
function train(θ = nothing, maxiters = 150, lr = 1e-2)
log_results(θs, losses) =
(θ, loss) -> begin
push!(θs, copy(θ))
push!(losses, loss)
false
end
θs, losses = [], []
num_obs = 4:4:length(train_t)
for k in num_obs
node = neural_ode(train_t[1:k], size(y, 1))
θ = train_one_round(
node, θ, train_y[:, 1:k],
ADAMW(lr), maxiters;
cb = log_results(θs, losses)
)
end
θs, losses
end
Random.seed!(1)
θs, losses = train();
The model trains in a minute or two and fits nicely to the training data. Of course, a more interesting question is whether the model generalises or not. To find out we solve the initial value problem for \(f\) with trained parameters and extrapolate into the future.
Plotting the predictions show the model has successfully learned the dynamics in the training data and does a really good job extrapolating. Recall that the model was only presented with the first observation in prediction time.
Ending notes
In this article we took a look at neural ODEs and how to use them for time series modelling in Julia. They are a very cool class of models, and while they really shine when modeling physical systems or time series, they are of course not a silver bullet. However, I only covered half the story of neural ODEs in this article. Remember how I talked about them as “neural networks in ODEs”? It turns out that there is nothing stopping us from doing the opposite and plugging an ODE inside a neural network, or using an ODE as stand-alone function approximator. In fact, this has proven particularly useful in normalizing flows, since ODEs naturally model homeomorphisms.
Not being able to model arbitrary functions is both a strength and a weaknesses. Because of this, ODEs come with a very strong inductive bias, which is how they can extrapolate from so little data. That is not to say they are inexpressive though. In fact they can be used as universal function approximators much like fully connected networks through augmented neural ODEs.
Finally I would like to touch upon the amount of knobs and dials to tweak when training neural ODEs. While the neural network used to model the dynamics does not have to be particularly large, we still have the regular parameters to worry about such as network architecture, batch size and learning rate. In addition, we now have to worry about which ODE solver we use, the stiffness of the system, and which sensitivity algorithm we use. While neural ODEs combines the best from differential equations and differentiable programming, it is important to remember that it also combines the complexities.
Thank you for reading! Let me know if you have any questions or comments.