[Bloomberg Meetup] The Forefront of Technologies in Finance

Interesting talk by Gary Kazantsev, Bloomberg’s head of the Machine Learning Engineering team. He presented the problems his team is working on at Bloomberg.

Among the many problems they work on, he considers the two following as the most interesting ones:

  • market impact indicators (of the news)
  • question answering

Market impact indicators:

They do have a tremendous amount of data, and very imbalanced classes: only a very tiny fraction of the news are moving the market. How can they build a dataset for learning a classifier to know whether a given piece of news may impact the market or not?

How they approach it:

  1. anomaly detection in time series
  2. gathering of news prior to the anomaly
  3. human labeling of these news

And that’s how they end up with a training set for the market impact task. Notice that it is super costly due to the sheer amount of data and the cost of human labor.

After that, the training on such imbalance classes is still a challenge. A classifier deciding that a piece of news is non-moving the market would be right 99.999999% of the time if using non-relevant accuracy metrics.

Question Answering:

He live-demoed a Bloomberg new feature that takes a request in natural language, and then provides the relevant information. Quite impressive. Seemed to work smoothly even when requesting through the whole 30-year news history.

One last thing that interested me is the brief mention of their information extractor using deep net with attention mechanism. According to him, it achieves super-human performance. Once again, at Bloomberg, they have the benefit to be a data company for quite long, and thus have serious data processes. Their machine learning engineers have access to a huge amount of labeled information. Anecdotically, Gary told us that it was thanks to some guy in Bloomberg 10 years ago, that human annotators, besides just extracting the information, had to associate a bounding-box corresponding to the location of the information in the original document. Having a lot of these annotations, it seems totally feasible to have an end-to-end approach that goes from the pdf (viewed as an image at the pixel level) to the information extracted in the appropriate cells of an excel spreadsheet.

There were 3 other presentations of lesser interest to me:

  • Kevin Kwan, Bloomberg ML engineer, on using K-means clustering to classify companies according to their fundamental metrics. His notebook showcases BQL and BQuant, some database and analysis Bloomberg tools.
  • Ken Lau, from Datatact, explaining the very basics of Word2Vec through examples. He used codalab (Google cloud) notebooks which have access to GPUs: it seems to run amazingly fast.
  • Sam Ho from ThinkCol, a data science consulting start-up, explaining the AutoML approach: basically automating the automation. In practice, running in parallel different models with different parameters and taking the best. There are a couple of framework available for that (some of them open-source).