Big Data Mining and Applications, Fall 2019

This course discusses issues in scalable mining of big data in various applications, which is gaining more popularity in recent years.
The major focus in this course will be utilizing parallel progamming in distributed platforms for efficient mining of different kinds of big data.
The course is offered at graduate-level, and will be taught in English.

[*** NOTE ***]: This course is NOT an introductory course for undergraduates. You are required to take the first course of "Introduction to Big Data Analytcis" or "Data Mining" in the undergraduate-level as as prerequisite.
It contains very heavy loads of parallel programming, which is a totally different concept from sequential programming.
Please think twice before taking this course!

Please read the notes to the course very carefully about the requirements for the programming homework and quiz.

Course Information

Latest News

(Tentative) Schedule

WeekDateContentReadingNote
1Sep. 13, 2019 (Leave for Mid-Autumn Festival)
2Sep. 20, 2019 Course Overview
Ch.1, Introduction to distributed platforms & MapReduce: Hadoop, Spark
[MMDS2] Ch.1 & 2
[MR] Ch.1-3
3Sep. 27, 2019 (Leave for TANET 2019)
TA: Package Installation, platform usage demo
HW#1
4Oct. 4, 2019 (Leave for ROCLING 2019)
TA: Homework Q&A, term project proposal, and team member registration
5Oct. 11, 2019 (Compensaton Leave for National Day)
6Oct. 18, 2019 MapReduce programming: the basics [MMDS2] Ch. 2
[MR] Ch. 3
Term Project Proposal
7Oct. 25, 2019 MapReduce Algorithm Design: design patterns (pairs & stripes), language models Due: HW#1
HW#2
8Nov. 1, 2019 Ch.3, Finding similar items [MMDS2] Ch.3
9Nov. 8, 2019 Ch.7, Clustering
(Ch.11, Dimension reduction)
[MMDS2] Ch.7
(skim through [MMDS2] Ch.11)
HW#3
10Nov. 15, 2019 Ch.9, Recommender systems [MMDS2] Ch.9 Due: HW#2
11Nov. 22, 2019 (Leave for TAAI 2019)
Invited talk about FinTech (in Chinese)
12Nov. 29, 2019 (Midterm Exam) Due: Proposal
HW#4
13Dec. 6, 2019 Ch.5, Link analysis - PageRank & HITS
More about PageRank
[MMDS2] Ch.5 Due: HW#3
14Dec. 13, 2019 Ch.10, Mining social network graphs
Community Detection in Graphs
[MMDS2] Ch.10 HW#5
15Dec. 20, 2019 Ch.12, Large-scale machine learning: SVM [MMDS2] Ch.12
16Dec. 27, 2019 Term Project Presentation (Week 1) Due: HW#4
17Jan. 3, 2020 Term Project Presentation (Week 2)
18Jan. 10, 2020 Term Project Presentation (Week 3) Due: HW#5

Programming Assignments and Projects

Please hand in your assignment before deadline according to the following instructions.

Program Submission Instructions

NOTE: Programs or projects in electronic files must be submitted directly to the TA online at Open Cyber Classrooms.
Please follow the instructions before your first login for this course.

If you cannot successfully submit your work, please contact with the TA or the instructor.

Distributed Programming Environment

You can use your own computers, or virtual machines on any cloud computing services to setup your distributed programming environment for this course.
  1. Computers:
    You can build your own distributed environment by commodity computers. Usually, Hadoop/Spark does not require very poweful capability of each machine.
  2. VMs on the cloud:
    You can use existing cloud computing services from Google, Amazon, or Microsoft Azure to setup distributed environemnt. Most cloud services provide free quota for limited usage (CPU, memory, network bandwidth, storage).

Homework

There will be about five programming exercises (and some written exercises) that target at different data analysis tasks.
Given large dataset, all homeworkas require parallel programming using either Hadoop or Spark, in distributed mode. (for CS students)
Since it's totallly different from sequential programming, it takes much more efforts and time to complete. Please think twice before taking this course!

In the case of very large datasets, you still need to analyze all data objects in the whole dataset. Please design your program accordingly to partition the data into several batches, and merge the final result.

  1. HW#1: Big data analysis platform setup & simple statistics in MapReduce (or Python in Jupyter Notebook)
    Due: Extended to Oct. 25, 2019
  2. HW#2: Statistics (counting, co-occurrence) of multiple data sets in MapReduce (or Python in Jupyter Notebook)
    Due: Extended to Nov. 15, 2019
  3. HW#3: Finding similar documents (by calculating minhash signatures, LSH, and KNN search) in MapReduce (or Python in Jupyter Notebook)
    Due: Extended to Dec. 6, 2019
  4. HW#4: Recommender systems using collaborative filtering in MapReduce (or Python in Jupyter Notebook)
    Due: Extended to Dec. 27, 2019
  5. HW#5: Analyzing web graphs in MapReduce (Connectivity, PageRank, ...)
    Due: Extended to Jan. 10, 2020 (No more extensions possible)

Projects

Notes: About the datasets available for potential topics in term project, please check the competitions.
  1. Term Project: paper presentation or system demonstration
    ItemDescriptionTime
    Proposal You are required to submit a proposal for term project one week after midterm exam. Nov. 22, 2019 (Fri.)
    Schedule
    Due to our time limits, we might have to start the term project presentation as early as Dec. 27, 2019 (Fri.).

    The current schedule for term project presentation (as of Dec.27, 2019)
    *** [NOTE] Since we have much more teams than expected, we can only allow for 15 minutes for each presentation. Please plan accordingly.
    * [NOTE] All presentations *must* be finished within the scheduled time slots, which will be the last *three* weeks in this semester. No other time slots will be avbailable.
    Dec. 27, 2019, Jan. 3, 10, 2020
    ReportEach team is *required* to upload the final report after finishing your presentation.
    The final report should contain at least the following:
    1. presentation slides (for all teams), and
    2. source code, installation/execution instructions, team members and task responsibility (for system projects)
    Jan. 10, 2020 (Fri.)

Exams

  1. Midterm Exam: Nov. 4-8, 2019
  2. Final Exam: Jan. 6-10, 2020

Scores

Please check the homework submission site for more details.
E-mail: jhwang AT csie . ntut . edu . tw
Created: Sep. 16, 2019.
Last Updated: Jan. 16, 2020.