# Field reports from ICML 2017 in Sydney

My colleague, Mikolaj Binkowski, at Hellebore Capital was at the 34th International Conference on Machine Learning ICML 2017 in Sydney to represent the company and present his work on Deep Learning for Time Series: Autoregressive Convolutional Neural Networks for Asynchronous Time Series.

Below, some of his highlights from ICML 2017:

#### Day 1 - tutorials

Attention - based recurrent neural network models were among the top ones discussed during tutorials. ‘Sequence-to-sequence modelling’ tutorial (by O. Vinyals and N. Jaitly from Google DeepMind/Google Brain) sadly focused mostly on machine translation modelling. Although attention RNN is the model of choice at the moment, RNNs in general are deemed too time-consuming to train and it is expected that some form of convolutional networks will take over the lead in the future. WaveNet by van den Oord et al, 2016 is one of the successful examples.

Tutorial on Deep Learning for Health Care Applications also focused on modelling time series. Authors of the talk presented their own ‘RETAIN’ attention model based on two recurrent networks, where one of them computes weights to the outputs generated by the other one - in very similar manner to Hellebore’s Significance - Offset networks.

Speakers discussed also one of the main problems in applications of ML in healthcare - model interpretability. Since the emergence of successful deep neural networks, many problems in healthcare have been successfully approached using deep learning, which left current state-of-the-art models being black-boxes. Researchers try to tackle this problem by mimicking the state-of-the-art using interpretable ML models (such as boosting or regression), which is done by training interpretable model to predict the outputs (or intermediate features) of the ‘best’ deep network, instead of ground-truth. However, the problem remains open for convolutional architectures used in medical imaging.

#### Day 2

The day started with invited talk by Bernhard Scholkopf who presented several theoretical results on causality between random variables and inference in such situations. He also shown quite unique to the ML community applications in empirical astronomy, such as de-noising light signals for exoplanet detection.

Rest of the day was split between a number of thematic sessions such as deep learning theory, optimisation, deep generative models, reinforcement learning or recurrent networks, all of which focused on fresh research papers from each domain.

Vitaly Kuznetsov (Google Research) discussed challenges for neural networks such as lack of theory, difficulty in choosing the right architecture, non-convex optimisation. The proposed AdaNet procedure (adaptive neural network) aims to tackle these problems through step-by-step simultaneous learning of the structure (i.e. depth, width) and parameters of the network itself. AdaNet possess some desirable theoretical guarantees on the optimality of the learned architecture. Promising empirical results were also shown for simple ‘toy’ datasets as well as bigger publicly available ones.

Another interesting and awaited paper was Wasserstein Generative Adversarial Networks by M. Arjovsky, who proposed to train adversarial networks using an approximation of Wasserstein Distance. This approach has already been improved and mentioned in many other talks and papers.

It is also worth mentioning the presentation from UC Berkeley on Model-agnostic Meta-Learning for fast adaptation of Deep Networks which focused on initialisation procedures that can be used to obtain dramatically faster training in reinforcement multi-task setting.

#### Day 3

One of the most interesting papers from recurrent neural network session introduced SRU - Statistical Recurrent Unit (from B. Poczos group at CMU). The idea behind the model is to simplify recurrent networks by keeping just multiple exponential moving averages in the memory cells (instead of more complicated features obtained through input and forget gates as in LSTM and other recurrent architectures). SRU, however, still allows nonlinear features through activated output gate and possibly additional feed-forward layers.

DeepBach: a Steerable Model for Bach Chorales Generation (from Frank Nielsen and his collaborators) propose quite complicated neural network architecture coupled with Gibbs sampling for generation of quite realistic Bach chorales. In one of evaluation experiments the authors asked ~1300 people to classify if the generated music was indeed generated by computer or original composition by J.S. Bach; the best model was able to trick 50% of markers.

Tensor-Train Recurrent Neural Networks (from Siemens and Univ. of Munich team) have been proposed to tackle dimensionality problems for modelling of videos, as the current CNN + RNN approaches suffer from heavy matrices that process the input. The proposed idea utilises Novikov approximation of 4/5-dimensional tensors and lets authors achieve second-to-best results in the task with networks having ’just’ 6 million parameters.

Day ended with invited talk by Peter Donelly, statistician from Oxford-based Genomics Plc., who discussed achievements and future challenges in statistical medicine that emerge with increasing feasibility of human genome sequencing.

#### Day 4

In the morning plenary talk Raia Hadsell from Google DeepMind presented the DeepMind’s lab for experiments in reinforcement learning, focusing mostly on virtual 3D environments with real-world or video-game physics. Complicated algorithms that consist of multiple LSTMs for memorizing policy, movements and environment features were shown to be successful in teaching a humanoid model to run and bypass obstacles in such environments.

Dynamic Word Embeddings from Disney Research proposes new algorithms for training word embeddings based on Orstein-Uhlenbeck processes. The stochastic construction of the model adds stochastic uncertainty into the word meaning in the embedded space. Particularily interesting experiment showed paths of word embeddings as he training proceeded from 19th century literature to the model books; the model showed how distances between certain pairs of words changed with e.g. emergence of technology.

A new approach that allows differentiability and therefore gradient-descent optimisation was proposed for Dynamic Time Warping (old concept of discrepancy between time series robust to dilatations and shifts) by M. Cuturi. Although mostly a theoretical paper, Soft-DTW: a Differentiable Loss Function for Time-Series presents some interesting applications for time series classification and interpolation between time series (which could possibly be useful to interpolate quotes between different market makers, at least from statistical point of view).

Generation with Coarse-to-Fine Attention (from Harvard University) uses convolutional networks with recurrent autoencoders to extract text from pictures. The experiments were carried out on 100K latex documents with complex mathematical formulas.

The main conference finished with invited talk by Latanya Sweeney, professor of Data Security in Harvard University. The talk focused on the social issues the artificial intelligence is causing or may cause in the future and what impact state policies and regulations have on these issues (which, in general, happens to be either positive or negative in different situations).