Big Data Mining and Applications, Spring 2018

This course discusses issues in scalable mining of big data in various applications, which is gaining more popularity in recent years.
The major focus in this course will be utilizing parallel progamming in distributed platforms for efficient mining of different kinds of big data.
The course is offered at graduate-level, and will be taught in English.

[*** NOTE ***]: This course is NOT an introductory course for undergraduates. You are required to take the first course of "Introduction to Big Data Analytcis" or "Data Mining" in the undergraduate-level as as prerequisite.
It contains very heavy loads of parallel programming, which is a totally different concept from sequential programming.
Please think twice before taking this course!

Please read the notes to the course very carefully about the requirements for the programming homework and quiz.

Course Information

Latest News

(Tentative) Schedule

x
WeekDateContentReadingNote
1Mar. 2, 2018 Course Overview
2Mar. 9, 2018 Introduction to distributed platforms & MapReduce: Hadoop, Spark [MMDS2] Ch.1 & 2
[MR] Ch.1 & 2
HW#1
3Mar. 16, 2018 MapReduce programming: pairs & stripes, relational algebra [MMDS2] Ch. 2
[MR] Ch. 3
4Mar. 23, 2018 MapReduce programming: matrix multiplication, relational joins [MMDS2] Ch. 2
[MR] Ch. 3
HW#2
5Mar. 30, 2018 Finding similar items [MMDS2] Ch.3 3/30 Due: HW#1
6Apr. 6, 2018 (Compensation Leave for Spring Break)
7Apr. 13, 2018 TA: Homework Q&A, term project proposal, and team member registration
(Leave for ACM SAC 2018)
HW#3
Term Project Proposal
8Apr. 20, 2018 (4/21 (Sat.), 9:00am-12:00: Quiz -- Hadoop and Spark)
4/20 (extended) Due: HW#2
9Apr. 27, 2018 Clustering [MMDS2] Ch.7 HW#4
10May 4, 2018 Dimension reduction
Recommender systems
[MMDS2] Ch.11
[MMDS2] Ch.9
Due: HW#3
11May 11, 2018 (Midterm Exam)
12May 18, 2018 Link analysis - PageRank & HITS
More about PageRank
[MMDS2] Ch.5 Due: Proposal
Due: HW#4
HW#5
13May 25, 2018 Mining social network graphs
Community Detection in Graphs
[MMDS2] Ch.10
14Jun. 1, 2018 Large-scale machine learning: SVM [MMDS2] Ch.12
15Jun. 8, 2018 Term Project Presentation (Week 1): 6 teams completed. Due: HW#5
16Jun. 15, 2018 Term Project Presentation (Week 2)
17Jun. 22, 2018 Term Project Presentation (Week 3)
18Jun. 29, 2018 Term Project Presentation (Week 4)

Programming Assignments and Projects

Please hand in your assignment before deadline according to the following instructions.

Program Submission Instructions

NOTE: Programs or projects in electronic files must be submitted directly to the TA online at Open Cyber Classrooms.
Please follow the instructions before your first login for this course.

If you cannot successfully submit your work, please contact with the TA or the instructor.

Distributed Programming Environment

You can use your own computers, or virtual machines on any cloud computing services to setup your distributed programming environment for this course.
  1. Computers:
    You can build your own distributed environment by commodity computers. Usually, Hadoop/Spark does not require very poweful capability of each machine.
  2. VMs on the cloud:
    You can use existing cloud computing services from Google, Amazon, or Microsoft Azure to setup distributed environemnt. Most cloud services provide free quota for limited usage (CPU, memory, network bandwidth, storage).
  3. VMs on CSIE server: See the following section for more details.

VM Application on CSIE server

To service the undergraduate students who might not have fixed computers to complete the homework, students might apply for the virtual machines on CSIE server.
Please read the Rules for VM Application before applying this service.

Important Notes

Homework

There will be about five programming exercises (and some written exercises) that target at different data analysis tasks.
Given large dataset, all homeworkas require parallel programming using either Hadoop or Spark, in distributed mode.
Since it's totallly different from sequential programming, it takes much more efforts and time to complete. Please think twice before taking this course! r
  1. HW#1: Hadoop/Spark distributed mode setup & simple calculation in MapReduce
    Details about Task and Data
    Due: extended to Mar. 30, 2018
  2. HW#2: Statistics of various data types in MapReduce (co-occurrence)
    Due: extended (again) to Apr. 20, 2018
  3. HW#3: Similarity estimation using MapReduce (for computing minhash signatures, LSH, and KNN search)
    Due: extended to May 4, 2018
  4. HW#4: Matrix multiplication using MapReduce (for dimension reduction using SVD, or CUR)
    Due: May 18, 2018
  5. Bonus Question Set#1: Statistics of crime data using MapReduce
  6. HW#5: Analyzing web graphs in MapReduce (Connectivity, PageRank, ...)
    Due: Jun. 8, 2018

Projects

  1. Term Project: paper presentation or system demonstration
    ItemDescriptionTime
    Proposal You are required to submit a proposal for term project one week after midterm exam. extended to May 18, 2018 (Fri.)
    Topics For paper presentations, the paper quality will *greatly* affect your score in term project. Please *carefully* select good papers to read.
    Schedule
    Due to our time limits, we might have to start the term project presentation as early as Jun. 8, 2018 (Fri.).

    You can check the current schedule of term porject presentations. (As of Jun. 15, 2018, 38 teams in total)
    * [NOTE] All presentations *must* be finished within the scheduled time slots, which will be the last *four* weeks in this semester. No other time slots will be avbailable.
    Jun. 8, 15, 22, 29, 2018
    ReportEach team is *required* to upload the final report after finishing your presentation.
    The final report should contain at least the following:
    1. presentation slides (for all teams), and
    2. source code, installation/execution instructions, team members and task responsibility (for system projects)
    Jun. 29, 2018 (Fri.)

Exams

  1. Quiz: Apr. 21, 2018 (Sat.)
  2. Midterm Exam: Apr. 23-27, 2018
  3. Final Exam: Jun. 25-29, 2018

Scores

Please check the homework submission site for more details.
E-mail: jhwang AT csie . ntut . edu . tw
Created: Mar. 2, 2018.
Last Updated: Jul. 3, 2018.