CiML 2018‎ > ‎


Challenges in Machine Learning
Machine Learning Challenges "in the wild"

Invited speakers







Mikhail Burtsev, MIPT Moscow, Russia

Bich-Liên Doan, Centrale-Supelec, France

Esteban Arcaute, Facebook, USA

Laura Seaman, Draper Inc., USA

Larry Jackel, North-C Tech., USA

Daniel Polani, U. Hertfordshire, UK

Antoine Marot, RTE, France

     Mikhail Burtsev and Varvara Logacheva

Wild evaluation of chat-bots

Today, evaluation of dialogue agents is severely limited by the absence of accurate formal metrics. Existing statistical measures such as perplexity, BLEU, recall and others are not sufficiently correlated with human evaluation. Blind assessment of communication quality by humans is a straightforward solution famously proposed by Alan Turing as a test for machine intelligence. Unfortunately, human assessment is time and resource consuming. In a series of machine learning competitions we have tried to attract volunteers to score bots. The talk will summarize our experience with chat-bot evaluation in the wild. 

Short Bio Mikhail Burtsev is the head of Neural Networks and Deep Learning Laboratory at Moscow Institute of Physics and Technology. In 2005, he received a Ph.D. degree from Keldysh Institute of Applied Mathematics of Russian Academy of Sciences. From 2011 to 2016 he was head of the lab of Department of Neuroscience at Kurchatov NBIC Centre. Now he is one of the organizers of NIPS Conversational Intelligence Challenges and a series of deep learning hackathons. His research interests lie in the fields of Natural Language Processing, Machine Learning, Artificial Intelligence, and Complex Systems. 

    Julien Hay, Bich-Liên DoanFabrice Popineau  

Renewal news recommendation platform 

Renewal is a platform developed for research and academic purposes, in cooperation between CentraleSupélec, the LRI and the Octopeek company (, which can be powered by code submitted by challenge participants. Renewal is a news recommendation platform developed for research purposes. It allows teams of researchers to submit recommendation algorithms as part of an evaluation campaign. To our knowledge, this platform is the only one which offers a user application fully dedicated to the cross-website and cross-language news articles recommendation task. It also offers a large panel of context / demographic clues and a long-term user history through a dedicated mobile app.

Julien Hay is a PhD student at CentraleSupelec and a member of the LRI (Paris-Saclay University and CNRS). He obtained a master's degree in software development and artificial intelligence. He applies NLP and machine learning techniques on text data from social networks in order to enhance user profiling in recommender systems. In his thesis, he pays special attention towards evaluation and model training through large data collection for machine learning and online evaluation for recommender systems.

Bich-Liên Doan is a professor in Computer Science at CentraleSupélec and a member of the LRI (University Paris-Saclay and CNRS). She is specialized in information retrieval, and her current research is related to contextual and more particularly personalized information retrieval and recommender systems. She is interested in new models from NLP, deep learning, semiotics and quantum mechanics that can help in representing information.  She co-organized in 2005 and 2007 the context-based information retrieval workshop held in conjunction with the international “Context” conference. She co-organized in 2008, 2009, and 2010 the Contextual Information Access, Seeking and Retrieval Evaluation (CIRSE) Workshop in conjunction with the European Conference on Information Retrieval (ECIR).

Fabrice Popineau is a professor of computer science at CentraleSupélec and a full member of the "Laboratoire de Recherche en Informatique" (UMR8623 of the Paris-Saclay University and the CNRS).  For the past fifteen years or so, his research has focused on the contributions of artificial intelligence to the personalization of the user experience on web platforms. He is particularly interested in personalized  recommendation in the context of social networks and also for online educational platforms. Besides his research activities, he his the French translation supervisor for the Russel & Norvig AIMA book. In a previous life,  he has also heavily contributed to the TeXLive project.

       Laura Seaman 

Project Alloy – Machine Learning Challenges for Researching Human-Machine Teaming

Project Alloy (funded by DARPA’s Agile Teams program) aimed to develop and implement intelligent machine agents that team with humans (hybrid teams) in meaningful and supportive ways. Hosting challenges on a citizen science competition platform provided a unique environment to develop and test hybrid team hypotheses. By providing dynamic sensing of each team’s state and progress towards a solution, we enable testing analytical formulations of hybrid team management. The combination of team performance monitoring and leaderboard scoring transformed the contest platform into a machine intelligence laboratory. By providing machine agents that augment or substitute human roles, we can explore a tighter synthesis of human and machine strengths for greater resilience and agility under changing project goals and constraints. 

Laura Seaman is a Senior Machine Intelligence Scientist at Draper and earned her PhD in Bioinformatics from the University of Michigan. Her work focuses on applying machine learning and graph algorithms to a variety of applications with a human focus from human-machine teaming to medical imaging.

      Esteban Arcaute and Umut Ozertem 

Facebook project on developing benchmarks of algorithms in realistic settings

Esteban D. Arcaute was born in Mexico City, and earned his Ph.D. from Stanford University in Computational and Mathematical Engineering. He is currently the Director of Data Science at Facebook for the Artificial Intelligence organization. Prior to that, he was a Sr. Director of Data Science at @WalmartLabs, where he spearheaded the adoption of Machine Learning techniques in various domains, leading to the development of a new search engine and search experience for His current interests span from learning and control systems with a human in the loop, to advanced analytics and experimentation.

Umut Ozertem is a Research Manager in Integrity Solutions team, part of Facebook’s Applied Machine Learning group. We build state of the art AI solutions to keep people safe and well. Our work spans a broad range of areas; hate(ful) speech, misinformation, polarization, suicide prevention, and blood donations to name a few.  Prior to Facebook, Umut was in the speech recognition team at Microsoft AI, and developed machine learning and data mining solutions for speech interfaces of Cortana and Xbox One. Before that he was a scientist at Yahoo! Labs, working on search relevance, autosuggest and related queries. Before that, he was in grad school at the Oregon Graduate Institute (2004-2008) where he received his PhD.

       Larry Jackel 

Measuring Progress in Robotics

It is relatively straightforward to benchmark performance in many tasks where machine learning is used. For example, in image recognition, the MNIST and ImageNet databases, provide clear and compelling tests. However, in robotics tasks benchmarking it is often much more difficult. This difficulty is most prominent in outdoor autonomous navigation by ground robots. It is also a tough challenge to benchmark robotic manipulation tasks. Obtaining convincing benchmarking results with robots is hard because of two main factors: It is difficult or impossible to control the test environment, especially outdoors, making comparisons between tests at different times and locations unreliable No two robots are identical, and many tests include robots with differing capabilities For the past two decades I have been involved in measuring the performance of learned robotic systems and I have used various methods to gauge progress. In this talk I will describe how robot benchmarking was done in two DARPA programs (“Learning Applied to Ground Robots” and “Learning Locomotion”), and two industrial efforts: learned autonomous driving, and robotic manipulation. Some of the key techniques for reliable benchmarking that we used include comparing tested robot performance to that of reference system, standardizing test environments where possible, using nearly identical robots, and testing in simulated environments. Much of this presentation will consist of representative videos.
Larry Jackel is President of North-C Technologies, where he does professional consulting. From 2003-2007 he was a DARPA Program Manager in the IPTO and TTO offices. He conceived and managed programs in Universal Network-Based Document Storage and in Autonomous Ground Robot navigation and Locomotion. For most of his scientific career Jackel was a manager and researcher in Bell Labs and then AT&T Labs. He has created and managed research groups in Microscience and Microfabrication, in Machine Learning and Pattern Recognition, and in Carrier-Scale Telecom Services. Jackel holds a PhD in Experimental Physics from Cornell University with a thesis in superconducting electronics. He is a Fellow of the American Physical Society and the IEEE.

      Daniel Polani 

Competitions to Challenge Artificial Intelligence: from the L-Game to RoboCup 

Since the early times of imagined Artificial Intelligence, competitive games, especially chess, have been considered a primary scenario which would challenge what an intelligent system should be able to achieve (e.g. Maelzel's Mechanical Turk or Ambrose Bierce's short story, "Moxon's Master"). It is no surprise that with the advent of actual automated computation, Chess as the prototypical complex computational game became the gold standard of competitive probing of an AI's abilities. 
 Today, 21 years after Kasparov's Chess defeat by Deep Blue, and about one year after AlphaZero's tremendous success in self-learning strategies for not only Chess, but Go and other challenging board games, it is a good time to review what and how competitions test the abilities of AI's and insights about the nature of the learned skills. From De Bono's minimalistic, but (for humans) nonobvious L-Game, to the RoboCup robotic competition series (which incidentally began in 1997, the year Kasparov was defeated), there are particular principles as to what constitutes an insightful and productive challenge. This talk will discuss what and how AI and robotics competitions can provide a basis to challenge and evaluate the achievement of intelligent systems and, on the other side, how to avoid common pitfalls.

Daniel Polani is Professor University of Hertfordshire. Experienced participant and organizer of robotics competitions. UK. President of RoboCup.

      Antoine Marot

Learning to run a power network

The power grid is a complex system, which must ensure the transit of electrical energy between places of production and consumption at any moment in constant balance. The French long distance high voltage electricity transmission network provided by “Réseau de Transport d’Électricité” (RTE) is undergoing profound changes. The influx of new intermittent generation sources coming from “renewable energies” (solar, wind, tides, etc.) and the development of electricity trading between countries is changing electricity flows, making them more variable and less predictable. With the current limitations on adding new transportation lines, there is an urgent need to adapt the historical way to operate the power grid. Machine Learning and in particular deep learning and reinforcement learning offer new possibilities that are under-exploited. We are preparing a Reinforcement Learning (RL) challenge to encourage the development of new machine learning algorithms in application to the power grid. The challenge will be inspired by the methodology to learn to play the game of Go developed by Google DeepMind and the recent RL competition on “Learning to Run”. The challenge will make use of a game emulating the power network, in which “players” are network operators in charge of modifying in real time the network structure such that the network is operated “safely” and “efficiently” under the constraint of balancing supply and demand of energy. The game is based on the Matpower environment.
Antoine Marot is an R&D engineer at RTE (Reseau de Transport d’Electricité) in charge of a research team on an ambitious project called Apogee that aims at building a personal assistant for control room operators, in contrast of their always growing fragmented multi-applications with multi-screen working environment today. He owns a double master degree in Engineering from Stanford and Ecole Centrale de Paris with specific interests in both machine learning and the field of energy. After interning at Tesla Motors, he joined RTE on the Apogee project 4 years ago in Paris and is looking forward to building new fruitful research collaborations, in addition to the existing ones with INRIA, to build this dreamed personal assistant with a machine learning system. However, it appears that the machine learning for power systems community needs to be reinforced to make faster progress, and he is eager to contribute building such a community, especially by the means of challenges. He previously gave similar talk at NIPS 2016 spatial-temporal workshop for the See4c European Challenge.