Big Data Mining and Applications, Fall 2021

This course discusses issues in scalable mining of big data in various applications, which is gaining more popularity in recent years.
The major focus in this course will be utilizing parallel progamming in distributed platforms for efficient mining of different kinds of big data.
The course is offered at graduate-level, and will be taught in English.

Notes on Online Courses

All students enrolled in this course have been added to the team in Microsoft Teams.
You can attend the online course in Microsft Teams at: the team created for Big Data Mining [course number: 293679]

[*** NOTE ***]: This course is NOT an introductory course for undergraduates. You are required to take the first course of "Introduction to Big Data Analytcis" or "Data Mining" in the undergraduate-level as a prerequisite.
It contains very heavy loads of parallel programming, which is a totally different concept from sequential programming.
Please think twice before taking this course!

Please read the notes to the course very carefully about the requirements for the programming homework and quiz.

Course Information

Latest News

(Tentative) Schedule

WeekDateContentReadingNote
1Sep. 21, 2021 (Leave for Mid-Autumn Festival)
2Sep. 28, 2021 Course Overview
Ch.1, Introduction to distributed platforms & MapReduce: Hadoop, Spark
[MMDS3] Ch.1 & 2
[MR] Ch.1-3
3Oct. 5, 2021 MapReduce programming: the basics [MMDS3] Ch. 2
[MR] Ch. 3
4Oct. 12, 2021 MapReduce Algorithm Design: design patterns (pairs & stripes), language models [MMDS3] Ch. 2
[MR] Ch. 3
HW#1
5Oct. 19, 2021 (Leave for taking COVID-19 vaccines)
TA: Multi-node Hadoop/Spark Installation & configuration, platform usage demo
TA: Homework Q&A, term project proposal, and team member registration
6Oct. 26, 2021 Ch.3, Finding similar items [MMDS3] Ch.3
7Nov. 2, 2021 Ch.7, Clustering [MMDS3] Ch.7 HW#2
Due: HW#1
8Nov. 9, 2021 Ch.11, Dimension reduction skim through [MMDS3] Ch.11 Term Project Proposal
9Nov. 16, 2021 Ch.9, Recommender systems [MMDS3] Ch.9 HW#3
10Nov. 23, 2021 (Midterm Exam) Due: HW#2
11Nov. 30, 2021 (Leave for taking COVID-19 vaccines) [MMDS3] Ch.5 Due: Proposal
12Dec. 7, 2021 Ch.5, Link analysis - PageRank & HITS
Part II of link analysis
[MMDS3] Ch.5 HW#4
Due: HW#3
13Dec. 14, 2021 Ch.10, Mining social network graphs: Community Detection [MMDS3] Ch.10
14Dec. 21, 2021 Part II: Overalapping Communities [MMDS3] Ch.10 HW#5
15Dec. 28, 2021 Ch.12, Large-scale machine learning: kNN, Perceptron
Ch.12, Large-scale machine learning: SVM
[MMDS3] Ch.12 Due: HW#4
16Jan. 4, 2022 Term Project Presentation (Week 1)
17Jan. 11, 2022 Term Project Presentation (Week 2) Due: HW#5
18Jan. 18, 2022 Term Project Presentation (Week 3)

Programming Assignments and Projects

Please hand in your assignment before deadline according to the following instructions.

Program Submission Instructions

NOTE: Programs or projects in electronic files must be submitted directly to the TA online at iSchool+.

If you cannot successfully submit your work, please contact with the TA or the instructor.

Distributed Programming Environment

You can use your own computers, or virtual machines on any cloud computing services to setup your distributed programming environment for this course.
  1. Computers:
    You can build your own distributed environment by commodity computers. Usually, Hadoop/Spark does not require very poweful capability of each machine.
  2. VMs on the cloud:
    You can use existing cloud computing services from Google, Amazon, or Microsoft Azure to setup distributed environemnt. Most cloud services provide free quota for limited usage (CPU, memory, network bandwidth, storage).

Homework

There will be about five programming exercises (and some written exercises) that target at different data analysis tasks.
Given large dataset, all homeworkas require parallel programming using either Hadoop or Spark, in distributed mode. (for CS students)
Since it's totallly different from sequential programming, it takes much more efforts and time to complete. Please think twice before taking this course!

In the case of very large datasets, you still need to analyze all data objects in the whole dataset. Please design your program accordingly to partition the data into several batches, and merge the final result.

  1. HW#1 : Big data analysis platform setup & simple statistics in MapReduce (or Python in Jupyter Notebook)
    Due: Nov. 2, 2021
  2. HW#2 : Statistics (counting, co-occurrence) of multiple data sets in MapReduce (or Python in Jupyter Notebook)
    Due: Nov. 23, 2021
  3. HW#3 : Finding similar documents (by calculating minhash signatures, LSH, and KNN search) in MapReduce (or Python in Jupyter Notebook)
    Due: Dec. 7, 2021
  4. HW#4 : Recommender systems using collaborative filtering in MapReduce (or Python in Jupyter Notebook)
    Due: Extended to Dec. 28, 2021
  5. HW#5 : Analyzing web graphs in MapReduce (Connectivity, PageRank, ...)
    Due: Jan. 11, 2022

Projects

Notes: About the datasets available for potential topics in term project, please check the competitions.
  1. Term Project: paper presentation or system demonstration
    ItemDescriptionTime
    Proposal You are required to submit a proposal for term project one week after midterm exam. Nov. 23, 2021 (Tue.)
    Schedule
    Due to our time limits, we might have to start the term project presentation as early as Dec. 28, 2021 (Tue.).


    * [NOTE] All presentations *must* be finished within the scheduled time slots, which will be the last *three* weeks in this semester. No other time slots will be avbailable.
    (Dec. 28, 2021,) Jan. 4, 11, 18, 2022
    ReportEach team is *required* to upload the final report after finishing your presentation.
    The final report should contain at least the following:
    1. presentation slides (for all teams), and
    2. source code, installation/execution instructions, team members and task responsibility (for system projects)
    Jan. 22, 2022 (Fri.)

Exams

  1. Midterm Exam: Nov. 15-19, 2021
  2. Final Exam: Jan. 17-22, 2022

Scores

Please check the homework submission site for more details.
E-mail: jhwang AT ntut . edu . tw
Created: Sep. 15, 2021.
Last Updated: Dec. 21, 2021.