[人工智慧演講] Learn Data Science by Doing a Kaggle Competition
台灣人工智慧年會 Day 1 11/9 (四) 11:00-11:45 A0
Once sequenced, a cancer tumor can have thousands of genetic mutations. But the challenge is to distinguish the mutations that contribute to tumor growth from other neutral mutations. Currently the interpretation of genetic mutations is done manually, which is time-consuming and knowledge-demanding. Therefore, Classifying Clinically Actionable Genetic Mutations challenged the Kaggle community to develop algorithms that automatically classify genetic variations based on evidence from text-based clinical literature. As a problem of natural language processing (NLP) and machine learning, this Kaggle competition is not a trivial task. The main difficulties are three fold. First of all, interpreting clinical evidence from literature is very challenging even for human specialists, since it takes expertise of domain knowledge and lengthy time of reading to understand key information in the literature and make classification accordingly. Secondly, only 3321 training data is given, which is far less compared with other Kaggle challenges and will increase the risk of overfitting. Moreover, much of the test data is machine-generated, which boosts the complexities of this task. To tackle this challenge, cooperation between teams of data science and clinical medicine was built up. Algorithms with insights from both fields were developed. To extract the key information of genetic mutation from text, hand crafted feature engineering was done with several state-of-the-art NLP methods. To obtain effective representations of the texts, we also consulted medical experts about how specialists read and classify genetic mutations, and adjusted our approaches accordingly. Furthermore, efforts were spent on classifiers to prevent issue of overfitting resulting from small training dataset. From the experience of participating in this competition, we demonstrated how cooperation between different expertise can bring in further insights to deal with challenging data science problem as this one.
許聞廉 / 中研院資訊所
- 台灣人工智慧實驗室 / 研究員
- 國立交通大學電機系退休 (2017/7)