Multipleobjective optimization (MOO) aims to simultaneously optimize multiple conflicting objectives and has found important applications in machine learning, such as simultaneously minimizing classification and fairness losses. At an optimum, further optimizing one objective will necessarily increase at least another objective, and decisionmakers need to comprehensively explore multiple optima to pinpoint one final solution. We address the efficiency of exploring the Pareto front that contains all optima. First, stochastic multigradient descent (SMGD) takes time to converge to the Pareto front with large neural networks and datasets. Instead, we explore the Pareto front as a manifold from a few initial optima, based on a predictorcorrector method. Second, for each exploration step, the predictor iteratively solves a largescale linear system that scales quadratically in the number of model parameters, and requires one backpropagation to evaluate a secondorder Hessianvector product per iteration of the solver. We propose a GaussNewton approximation that scales linearly, and that requires only firstorder innerproduct per iteration. T hird, we explore different linear system solvers, including the MINRES and conjugate gradient methods for approximately solving the linear systems. The innovations make predictorcorrector efficient for large networks and datasets. Experiments on a fair misinformation detection task show that 1) the predictorcorrector method can find Pareto fronts better than or similar to SMGD with less time, and 2) the proposed firstorder method does not harm the quality of the Pareto front identified by the secondorder method, while further reducing running time.
more »
« less
LowRank Univariate Sum of Squares Has No Spurious Local Minima
We study the problem of decomposing a polynomial p into a sum of r squares by minimizing a quadratically penalized objective fp(u)=‖‖∑ri=1u2i−p‖‖2. This objective is nonconvex and is equivalent to the rankr BurerMonteiro factorization of a semidefinite program (SDP) encoding the sum of squares decomposition. We show that for all univariate polynomials p, if r≥2 then fp(u) has no spurious secondorder critical points, showing that all local optima are also global optima. This is in contrast to previous work showing that for general SDPs, in addition to genericity conditions, r has to be roughly the square root of the number of constraints (the degree of p) for there to be no spurious secondorder critical points. Our proof uses tools from computational algebraic geometry and can be interpreted as constructing a certificate using the first and secondorder necessary conditions. We also show that by choosing a norm based on sampling equallyspaced points on the circle, the gradient ∇fp can be computed in nearly linear time using fast Fourier transforms. Experimentally we demonstrate that this method has very fast convergence using firstorder optimization algorithms such as LBFGS, with nearlinear scaling to milliondegree polynomials.
more »
« less
 Award ID(s):
 1835443
 NSFPAR ID:
 10490265
 Publisher / Repository:
 SIAM
 Date Published:
 Journal Name:
 SIAM Journal on Optimization
 Volume:
 33
 Issue:
 3
 ISSN:
 10526234
 Page Range / eLocation ID:
 2041 to 2061
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We study a deep learning inspired formulation for the blind demodulation problem, which is the task of recovering two unknown vectors from their entrywise multiplication. We consider the case where the unknown vectors are in the range of known deep generative models, G(1):R^n→R^l and G(2):R^p→R^l. In the case when the networks corresponding to the generative models are expansive, the weight matrices are random and the dimension of the unknown vectors satisfy l=Omega(n^2+p^2), up to log factors, we show that the empirical risk objective has a favorable landscape for optimization. That is, the objective function has a descent direction at every point outside of a small neighborhood around four hyperbolic curves. We also characterize the local maximizers of the empirical risk objective and, hence, show that there does not exist any other stationary points outside of these neighborhood around four hyperbolic curves and the set of local maximizers. We also implement a gradient descent scheme inspired by the geometry of the landscape of the objective function. In order to converge to a global minimizer, this gradient descent scheme exploits the fact that exactly one of the hyperbolic curve corresponds to the global minimizer, and thus points near this hyperbolic curve have a lower objective value than points close to the other spurious hyperbolic curves. We show that this gradient descent scheme can effectively remove distortions synthetically introduced to the MNIST dataset.more » « less

This paper addresses the problem of identification of error in variables switched linear models from experimental input/output data. This problem is known to be generically NP hard and thus computationally expensive to solve. To address this difficulty, several relaxations have been proposed in the past few years. While solvable in polynomial time these (convex) relaxations tend to scale poorly with the number of points and number/order of the subsystems, effectively limiting their applicability to scenarios with relatively small number of data points. To address this difficulty, in this paper we propose an efficient method that only requires performing (number of subsystems) singular value decompositions of matrices whose size is independent of the number of points. The underlying idea is to obtain a sumofsquares polynomial approximation of the support of each subsystem oneatatime, and use these polynomials to segment the data into sets, each generated by a single subsystem. As shown in the paper, exploiting ideas from Christoffel's functions allows for finding these polynomial approximations simply by performing SVDs. The parameters of each subsystem can then be identified from the segmented data using existing errorinvariables (EIV) techniques.more » « less

Variance reduction techniques like SVRG provide simple and fast algorithms for optimizing a convex finitesum objective. For nonconvex objectives, these techniques can also find a firstorder stationary point (with small gradient). However, in nonconvex optimization it is often crucial to find a secondorder stationary point (with small gradient and almost PSD hessian). In this paper, we show that Stabilized SVRG (a simple variant of SVRG) can find an \epssecondorder stationary point using only O(n^{2/3}/\eps^2+n/\eps^{1.5}) stochastic gradients. To our best knowledge, this is the first secondorder guarantee for a simple variant of SVRG. The running time almost matches the known guarantees for finding \epsfirstorder stationary points.more » « less

The maximal coding rate reduction (MCR2) objective for learning structured and compact deep representations is drawing increasing attention, especially after its recent usage in the derivation of fully explainable and highly effective deep network architectures. However, it lacks a complete theoretical justification: only the properties of its global optima are known, and its global landscape has not been studied. In this work, we give a complete characterization of the properties of all its local and global optima, as well as other types of critical points. Specifically, we show that each (local or global) maximizer of the MCR2 problem corresponds to a lowdimensional, discriminative, and diverse representation, and furthermore, each critical point of the objective is either a local maximizer or a strict saddle point. Such a favorable landscape makes MCR2 a natural choice of objective for learning diverse and discriminative representations via firstorder optimization methods. To validate our theoretical findings, we conduct extensive experiments on both synthetic and real data sets.more » « less