Neuro-Algorithmic Policies Enable Fast Combinatorial Generalization

As a continuation of the blackbox-differentiation line of work, we propose to use time-dependent shortest-path solvers in order to enhance generalization capabilities of neural network policies. With imposing a prior on the underlying goal-conditioned MDP structure, we are able to extract well-performing policies through imitation learning that utilize blackbox solvers for receding horizon planning at execution time. Again, this comes with absolutely no sacrifices to the optimality of the solver used.