講者

« 回列表

Learn Data Science by Doing a Kaggle Competition

演講摘要

Once sequenced, a cancer tumor can have thousands of genetic mutations. But the challenge is to distinguish the mutations that contribute to tumor growth from other neutral mutations. Currently the interpretation of genetic mutations is done manually, which is time-consuming and knowledge-demanding. Therefore, Classifying Clinically Actionable Genetic Mutations challenged the Kaggle community to develop algorithms that automatically classify genetic variations based on evidence from text-based clinical literature. As a problem of natural language processing (NLP) and machine learning, this Kaggle competition is not a trivial task. The main difficulties are three fold. First of all, interpreting clinical evidence from literature is very challenging even for human specialists, since it takes expertise of domain knowledge and lengthy time of reading to understand key information in the literature and make classification accordingly. Secondly, only 3321 training data is given, which is far less compared with other Kaggle challenges and will increase the risk of overfitting. Moreover, much of the test data is machine-generated, which boosts the complexities of this task. To tackle this challenge, cooperation between teams of data science and clinical medicine was built up. Algorithms with insights from both fields were developed. To extract the key information of genetic mutation from text, hand crafted feature engineering was done with several state-of-the-art NLP methods. To obtain effective representations of the texts, we also consulted medical experts about how specialists read and classify genetic mutations, and adjusted our approaches accordingly. Furthermore, efforts were spent on classifiers to prevent issue of overfitting resulting from small training dataset. From the experience of participating in this competition, we demonstrated how cooperation between different expertise can bring in further insights to deal with challenging data science problem as this one.

講者簡介

周志成
  • 周志成
  • 台灣人工智慧實驗室 / 研究員
  • 國立交通大學電機系退休 (2017/7)

歡迎在此登錄您的大名及電子郵件地址,日後任何台灣資料科學協會舉辦的相關活動,我們將會以電子郵件通知您。謝謝。