Derivative Free Optimization

The book I want to start with is on derivative free optimization[1]. My plan is to do about a chapter a day, though since this text has some denser mathematics than some of the others I'll be looking at, I might slow that down a little bit. We'll see.

As is the case with most books, the first chapter is an introduction, or in other words, it answers the question "Why should I care?" And it's certainly true in the context of optimization that the words "derivative free" should raise some eyebrows. In introductory calculus classes, we define minima (or optima) as being points for which the derivative of a function vanishes. If we're just thinking conceptually, the best way to find such points is to simply "slide downhill", that is, follow the derivative to the point where it bottoms out. Doing that carefully and robustly is not always simple, of course, and increasing the number of dimensions tends to make things a bit more complicated, but that's the basic idea behind most derivative based optimization schemes.

So, why throw away all of that juicy information contained in the derivatives? The answer, of course, is that if you can use the derivatives, you should. Quoting the authors (of a book on derivative-free optimization!):

Finally, we want to make a strong statement that often councils against the use of derivative-free methods: if you can obtain clean derivatives (even if it requires considerable effort) and the functions defining your problem are smooth and free of noise you should not use derivative-free methods.

Buried in that warning, however, are the reasons you might be forced to use derivative-free optimization. Taking them in reverse:

  • Your function may be noisy. This is especially true of situations where your objective function is the output of some larger simulation (numerical noise), but can also be the case if there are sources of noise in the function itself. As computer power increases, and (borrowing a term from J.P. Boyd) arithmurgists become more bold, these kinds of noisy problems will be more and more common. The real world is noisy, after all.
  • Your function may have discontinuities or other features that make derivatives problematic or nonexistent. Noise makes this kind of problem worse, which is why I put it first, but even in the absence of noise, you may want to optimize a discontinuous function.
  • Finding the derivatives for your function may be too expensive. If function evaluations are especially costly, and if the dimensionality of the system you are modeling is high, even finite difference approximations to the derivative (which may or may not be well-behaved enough to actually use; see notes above) can be prohibitively expensive. If you're relying on code that is either proprietary (so you don't have the source) or legacy (so you can't puzzle out the source in a reasonable amount of time) you won't be able to use automatic differentiation methods, either. So, you're stuck.

If it is the case that derivative-free methods are what you want, then you'd like some assurance that they're going to work. The authors give three criteria which, if they hold, should guarantee convergence. They are:

  1. "They incorporate some mechanism to impose descent away from stationarity." I take this to mean that the proposed solution has to get better as the method iterates.
  2. "They must guarantee some form of control of the geometry of the sample sets where the function is evaluated." Call this shape control. (I'll come back to this in a moment.)
  3. "They must drive the step size parameter to zero." Call this size control.

I needed to unpack this a little (and my comments above start that process). If we were doing a finite difference approximation to the derivative, we would have to sample the space at enough points to actually make the approximation---hence the comment about cost above. If that's too expensive, we want to take fewer samples of the space, but we can't get entirely away from sampling the space at some number of points; given a set of inputs our algorithm will predict a next guess, and we then need to evaluate the function at that point to see if the the value improved. These three rules simply constrain the process of guessing. The first says that the guesses have to get better. In the case of a simplex algorithm like Nelder-Mead, this is done by moving away from the worst point in the simplex. Other algorithms take other approaches (which will no doubt be covered later in the book).

The third rule says that the distance of each guess from the inputs we started with has to shrink as the algorithm evolves, until eventually the algorithm has found a fixed point. In the absence of derivatives, this is the best we can do for a stopping condition; we don't have another way to assess whether there is something better out there for us to find. But, if the motion stops, (the step size gets so small that it won't be going anywhere in the time we're willing to wait), we're done.

The second rule is the tricky one. It requires that the algorithm be able to constrain the shape of the points used for sampling. This is similar to the caveat used when evaluating the limit in proving the divergence theorem; we make the volume go to zero in such a way that all of the radii stay roughly the same. This rule isn't quite as explicit, but the purpose is the same: avoid pathological cases where the volume goes to zero but the points are far apart. The authors note that the lack of such control make Nelder-Mead fail for certain problems and initial conditions.

I'll wrap up with a one-off from the middle of the chapter, referring to genetic algorithms, particle swarms, neural networks, and other heuristic methods:

The authors think of these methods as a last resort (that is, applicable to problems where the search space is necessarily large, complex, or poorly understood and more sophisticated mathematical analysis is not applicable) and would use them only if nothing better is available.

Particle swarms are where much of my attention has been focused for the last little while. So, I'm looking forward to the more sophisticated, better methods the authors have in their toolboxes.

[1] A. R. Conn, K. Scheinberg, and L. N. Vicente, Introduction to Derivative Free Optimization
[Bibtex]
@book{DFO:2008,
author = {Andrew R. Conn and Katya Scheinberg and Luis N. Vicente},
title = {Introduction to Derivative Free Optimization},
publisher = {Society for Industrial and Applied Mathematics}
year = {2009}
}

Comments are disabled for this post