18:00 - 20:00 | Customer Experience Theatre
ABOUT
BROUGHT TO YOU BY DATA SCIENCE LONDON MEETUP GROUP
Typically when we talk about distributed Machine Learning we talk about how we can build models on bigger and bigger datasets, using the extra information contained in that data to build better and better models, and Apache Spark with SparkML makes this process very simple. However, here at Databricks we are more and more seeing a requirement to train a large number of relatively small models. In this talk we will discuss how we solved this problem, using the PySpark PandasUDF method to parallelise the training of over 100k models to solve a predictive maintenance problem. What's more when building models at this scale, model management becomes a real challenge, we will demonstrate how we utilised ML Flow to solve this problem. Finally, we will also discuss how this method can be utilised to parallelise hyperparameter training for models as well.