Distributed Data Analysis and Mining

Code 687AA
Credits 6

Learning outcomes

Mining with big data or big data mining has become an active research area. Running current analytical methodologies and software tools on a single personal computer cannot efficiently deal with very large datasets. Distributed computing platforms are a scalable solution for big data mining, obtained by dividing a large problem into smaller ones that are concurrently solved by many single processor/machine. This course aims at teaching the basic theoretical concepts behind the MapReduce distributed computing paradigm, and Hadoop in particular, and at building expertise in the practical usage of high performance computing tools for data engineering, analysis and mining. In particular the students will learn how the classical data mining algorithms can be applied on Big Data using Hadoop (Spark). Real (and open source) datasets will be used to present examples and to let the students build their own projects. Half of the lessons will consists of practice (Lab), and half of lectures.
Syllabus:
-Motivations: What is and Why Distributed Data Mining is needed in a Big Data Scenario
-Recall parallel and distributed computing notions
-Introduction to Hadoop
-Hadoop Ecosystem
-Interacting with HDFS (LAB)
-Map-Reduce Programming Patterns
-Basic Spark
-Recall Python programming (LAB)
-Data Analysis with Spark (LAB)
-Data Mining and Machine Learning with Spark (LAB)
-Example on how to prepare a project
-Real Case Studies