Big Data Mining and Applications, Fall 2023

This course discusses issues in scalable mining of big data in various applications, which is gaining more popularity in recent years.
The major focus of this course will be utilizing parallel progamming in distributed platforms for efficient mining of different kinds of big data.
The course is offered at graduate-level, and will be taught in English.

Notes on Online Sessions of this Course

All students enrolled in this course have been added to the two separate teams in Microsoft Teams for the corresponding course numbers: 321524, 323392.
The major link to the online course sessions and recordings in Microsft Teams at: the team created for Big Data Mining [course number: 321524]

[NOTE] For those who really cannot attend the class in person due to some unforeseeable reasons, you could join the corresponding online sessions in Teams. During course sessions, remember to sign up your name and student ID in Teams, when the TA will remind you.

Please read the notes to the course very carefully about the requirements for the programming homework and quiz. [*** NOTE ***]: This course is NOT an introductory course for undergraduates. You are strongly suggested to take the first course of "Introduction to Big Data Analytcis" or "Data Mining" in the undergraduate-level as a prerequisite.
It contains very heavy loads of parallel programming, which is a totally different concept from sequential programming.
Please think twice before taking this course!

Course Information

Latest News

(Tentative) Schedule

WeekDateContentReadingNote
1Sep. 14, 2023 Course Overview
2Sep. 21, 2023 Ch.1, Introduction to distributed platforms & MapReduce: Hadoop, Spark [MMDS3] Ch.1 & 2
[MR] Ch.1-3
3Sep. 28, 2023 MapReduce programming: the basics
TA: Spark Cluster: Installation & configuration, platform usage demo
[MMDS3] Ch. 2
[MR] Ch. 3
4Oct. 5, 2023 MapReduce Algorithm Design: design patterns (pairs & stripes), language models [MMDS3] Ch. 2
[MR] Ch. 3
5Oct. 12, 2023 Ch.3, Finding similar items
TA: Homework Q&A, term project proposal, and team member registration
[MMDS3] Ch.3 HW#0
6Oct. 19, 2023 Ch.3, Finding similar items [MMDS3] Ch.3 Due: HW#0
HW#1
7Oct. 26, 2023 Ch.7, Clustering [MMDS3] Ch.7
8Nov. 2, 2023 Ch.11, Dimension reduction skim through [MMDS3] Ch.11 Term Project Proposal
Due: HW#1
HW#2
9Nov. 9, 2023 Ch.9, Recommender systems - Part I
Ch.9, Recommender systems - Part II
[MMDS3] Ch.9
10Nov. 16, 2023 (11/16: Midterm Exam)
11Nov. 23, 2023 Ch.5, Link analysis - PageRank & HITS [MMDS3] Ch.5 Due: HW#2
Due: Proposal
HW#3
12Nov. 30, 2023 Part II of link analysis: TrustRank, WebSpam [MMDS3] Ch.5
13Dec. 7, 2023 Ch.10, Mining social network graphs: Community Detection [MMDS3] Ch.10 Due: HW#3
HW#4
HW#5 (optional)
14Dec. 14, 2023 (Leave for IEEE BigData 2023)
(TA: Questions about homeworks and term projects)
(Part II: Overalapping Communities)
[MMDS3] Ch.10
15Dec. 21, 2023 (Leave for IEEE BigData 2023)
(TA: Questions about homeworks and term projects)
(Ch.12, Large-scale machine learning: kNN, Perceptron
Ch.12, Large-scale machine learning: SVM)
[MMDS3] Ch.12
16Dec. 28, 2023 Term Project Presentation (Week 1)
17Jan. 4, 2024 Term Project Presentation (Week 2) Due: HW#4
Due: HW#5
18Jan. 11, 2024 Term Project Presentation (Week 3)

Programming Assignments and Projects

Please hand in your assignment before deadline according to the following instructions.

Program Submission Instructions

NOTE: Programs or projects in electronic files must be submitted directly to the TA online at iSchool+.

If you cannot successfully submit your work, please contact with the TA or the instructor.

Distributed Programming Environment

You can use your own computers, or virtual machines on any cloud computing services to setup your distributed programming environment for this course.
  1. Computers:
    You can build your own distributed environment by commodity computers. Usually, Hadoop/Spark does not require very poweful capability of each machine.
  2. VMs on the cloud:
    You can use existing cloud computing services from Google, Amazon, or Microsoft Azure to setup distributed environemnt. Most cloud services provide free quota for limited usage (CPU, memory, network bandwidth, storage).

Homework

There will be about five programming exercises (and some written exercises) that target at different data analysis tasks.
Given large datasets, all homeworks require parallel programming using either Hadoop or Spark, in distributed mode. (for CS students)
Since it's totallly different from sequential programming, it takes much more efforts and time to complete. Please think twice before taking this course!

In the case of very large datasets, you still need to analyze all data objects in the whole dataset. Please design your program accordingly to first partition the data into several batches, and then merge the final result.

  1. HW#0 : Big data analysis platform setup & simple statistics in MapReduce (or Python in Jupyter Notebook)
    Due: Oct. 19, 2023
  2. HW#1 : Analyzing documents in MapReduce (or Python in Jupyter Notebook)
    Due: Nov. 2, 2023
  3. HW#2 : Statistics (counting, co-occurrence) of various data types in MapReduce (or Python in Jupyter Notebook)
    Due: Nov. 23, 2023
  4. HW#3 : Finding similar documents (by calculating minhash signatures, LSH, and KNN search) in MapReduce (or Python in Jupyter Notebook)
    Due: Dec. 7, 2023
  5. HW#4 : Recommender systems using collaborative filtering in MapReduce (or Python in Jupyter Notebook)
    Due: Jan. 4, 2024
  6. HW#5 (Optional) : Analyzing web graphs in MapReduce (Connectivity, PageRank, ...)
    Due: Jan. 4, 2024

Projects

Notes: About the datasets available for potential topics in term project, please check information about recent competitions.
  1. Term Project: paper presentation or system demonstration
    ItemDescriptionTime
    Proposal You are required to submit a proposal for term project one week after midterm exam. Nov. 23, 2023 (Tue.)
    Schedule
    Due to our time limits, we might have to start the term project presentation as early as Dec. 28, 2023 (Thu.).


    * [NOTE] All presentations *must* be finished within the scheduled time slots, which will be the last *three* weeks in this semester. No other time slots will be avbailable.
    Dec. 28, 2023, Jan. 4, 11, 2024
    ReportEach team is *required* to upload the final report after finishing your presentation.
    The final report should contain at least the following:
    1. presentation slides (for all teams), and
    2. source code, installation/execution instructions, team members and task responsibility (for system projects)
    Jan. 12, 2024 (Fri.)

Exams

  1. Midterm Exam: Nov. 6-10, 2023
  2. Final Exam: Jan. 8-12, 2024

Scores

Please check the homework submission site for more details.
E-mail: jhwang AT ntut . edu . tw
Created: Sep. 1, 2023.
Last Updated: Dec. 6, 2023.