Big Data Mining and Applications, Fall 2025
This course discusses issues in scalable mining of big data in various applications, which is gaining more popularity in recent years.
The major focus of this course will be utilizing parallel progamming in distributed platforms for efficient mining of different kinds of big data.
The course is offered at graduate-level, and will be taught in English.
Notes on Online Sessions of this Course
All students enrolled in this course will be added to the teams in Microsoft Teams for the corresponding course numbers.
The channel for the team is to be created for Big Data Mining [course number: xxx] (TBD)
[NOTE] For those who really cannot attend the class in person due to some unforeseeable reasons, you could join the corresponding online sessions in Teams.
During course sessions, remember to sign up your name and student ID in Teams, when the TA will remind you.
Please read the notes to the course very carefully about the requirements for the programming homework and quiz.
[*** NOTE ***]: This course is NOT an introductory course for undergraduates.
You are strongly suggested to take the first course of "Introduction to Big Data Analytcis" or "Data Mining" in the undergraduate-level as a prerequisite.
It contains very heavy loads of parallel programming, which is a totally different concept from sequential programming.
Please think twice before taking this course!
Course Information
- Instructor: Jenq-Haur Wang
- E-mail: mailbox at the domain of NTUT, and my username is jhwang
- Office: R1534, Technology Building (ext. 4238)
- Office Hours: Tue. & Fri. 9:10-12:00am
- Class Hours: Mon. 15:10pm-18:00pm, R234, Technology Building
- TA: (TBD) (R1424, Technology Building)
e-mail: (TBA)
- Prerequisite: data structures (and algorithms), database systems, discrete math (probability), linear algebra, basic data mining,
and working experience of high-level programming languages
- Textbook: (Selected chapters)
- Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman, Mining of Massive Datasets, 3rd Edition, Cambridge University Press, Feb. 2020. ([MMDS3])
- Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, Morgan & Claypool Publishers, 2010. ([MR])
- References:
- Jiawei Han, Jian Pei, and Hanghang Tong, Data Mining: Concepts and Techniques, 4th ed., Morgan Kaufmann Publishers, Oct. 2022. ([DM4])
- Tom White, Hadoop: The Definitive Guide, 4th ed., O'Reilly Media, 2015.
- Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia, Learning Spark: Lightning-Fast Big Data Analysis, O'Reilly Media, January 2015.
- Others: Web documents, tutorials, and selected academic papers.
Latest News
- Aug. 14, 2025: The homepage for Big Data Mining and Applications (Fall 2025) has been setup.
(Tentative) Schedule
Week | Date | Content | Reading | Note |
1 | Sep. 8, 2025 |
Course Overview
|
|
|
2 | Sep. 15, 2025 |
Ch.1, Introduction to distributed platforms & MapReduce: Hadoop, Spark
|
[MMDS3] Ch.1 & 2
[MR] Ch.1-3
|
|
3 | Sep. 22, 2025 |
MapReduce programming: the basics
|
[MMDS3] Ch. 2
[MR] Ch. 3
|
|
4 | Sep. 29, 2025 |
MapReduce Algorithm Design: design patterns (pairs & stripes), language models
|
[MMDS3] Ch. 2
[MR] Ch. 3
|
HW#0
|
5 | Oct. 6, 2025 |
Ch.3, Finding similar items
TA: Spark Cluster: Installation & configuration, platform usage demo
|
[MMDS3] Ch.3
|
|
6 | Oct. 13, 2025 |
Ch.3, Finding similar items
TA: Homework Q&A, term project proposal, and team member registration |
[MMDS3] Ch.3
|
Due: HW#0
HW#1
|
7 | Oct. 20, 2025 |
Ch.7, Clustering
|
[MMDS3] Ch.7
|
|
8 | Oct. 27, 2025 |
Ch.11, Dimension reduction
|
skim through [MMDS3] Ch.11
|
Term Project Proposal
Due: HW#1
HW#2
|
9 | Nov. 3, 2025 |
Ch.9, Recommender systems - Part I
Ch.9, Recommender systems - Part II
|
[MMDS3] Ch.9
|
|
10 | Nov. 10, 2025 |
(11/10: Midterm Exam)
|
|
|
11 | Nov. 17, 2025 |
Ch.5, Link analysis - PageRank & HITS (Part I)
|
[MMDS3] Ch.5
|
Due: HW#2
Due: Proposal
HW#3
|
12 | Nov. 24, 2025 |
Link analysis: TrustRank, WebSpam (Part II)
|
[MMDS3] Ch.5
|
|
13 | Dec. 1, 2025 |
Ch.10, Mining social network graphs: Community Detection
|
[MMDS3] Ch.10
([MMDS3] Ch.12)
|
Due: HW#3
HW#4
|
14 | Dec. 8, 2025 |
(Part II: Overalapping Communities)
(Ch.12, Large-scale machine learning: kNN, Perceptron
Ch.12, Large-scale machine learning: SVM)
|
|
|
15 | Dec. 15, 2025 |
Term Project Presentation (Week 1)
|
|
Due: HW#4
HW#5
|
16 | Dec. 22, 2025 |
Term Project Presentation (Week 2)
| |
|
17 | Dec. 29, 2025 |
Term Project Presentation (Week 3)
|
|
Due: HW#5
|
18 | Jan. 5, 2026 |
Term Project Presentation (Week 4)
|
|
|
Programming Assignments and Projects
Please hand in your assignment before deadline according to the following instructions.
Program Submission Instructions
NOTE: Programs or projects in electronic files must be submitted directly to the TA online at
iSchool+.
If you cannot successfully submit your work, please contact with the TA or the instructor.
Distributed Programming Environment
You can use your own computers, or virtual machines on any cloud computing services to setup your distributed programming environment for this course.
- Computers:
You can build your own distributed environment by commodity computers.
Usually, Hadoop/Spark does not require very poweful capability of each machine.
- VMs on the cloud:
You can use existing cloud computing services from Google, Amazon, or Microsoft Azure to setup distributed environemnt.
Most cloud services provide free quota for limited usage (CPU, memory, network bandwidth, storage).
Homework
There will be about 5 programming exercises (and some written exercises) that target at different data analysis tasks.
Given large datasets, all homeworks require parallel programming using either Hadoop or Spark, in distributed mode. (for CS students)
Since it's totallly different from sequential programming, it takes much more efforts and time to complete.
Please think twice before taking this course!
In the case of very large datasets, you still need to analyze all data objects in the whole dataset.
Please design your program accordingly to first partition the data into several batches, and then merge the final result.
-
HW#0
:
Big data analysis platform setup & simple statistics in MapReduce (or Python in Jupyter Notebook)
-
HW#1
:
Analyzing documents in MapReduce (or Python in Jupyter Notebook)
-
HW#2
:
Statistics (counting, co-occurrence) of various data types in MapReduce (or Python in Jupyter Notebook)
-
HW#3
:
Finding similar documents (by calculating minhash signatures, LSH, and KNN search) in MapReduce (or Python in Jupyter Notebook)
-
HW#4
:
Recommender systems using collaborative filtering in MapReduce (or Python in Jupyter Notebook)
-
HW#5
:
Analyzing web graphs in MapReduce (Connectivity, PageRank, ...)
Projects
Notes: About the datasets available for potential topics in term project, please check information about recent competitions.
- Term Project: paper presentation or system demonstration
Item | Description | Time |
Proposal | You are required to submit a
proposal for term project one week after midterm exam.
| Nov. 17, 2025 (Mon.) |
Schedule |
Due to our time limits, we might have to start the term
project presentation as early as Dec. 15, 2025 (Mon.).
* [NOTE] All presentations *must* be finished within the scheduled time slots, which will be the last *four* weeks in this semester.
No other time slots will be avbailable.
|
Dec. 15, 22, 29, 2025, Jan. 5, 2026 |
Report | Each team is *required* to upload the final report after finishing your presentation.
The final report should contain at least the following:
- presentation slides (for all teams), and
- source code, installation/execution instructions,
team members and task responsibility (for system projects)
|
Jan. 9, 2026 (Fri.)
|
Exams
- Midterm Exam: Nov. 3-7, 2025
- Date: Nov. 10, 2025 (Mon.)
- Time: 15:10pm-18:00pm
- Location: R234, Technology Building
- Range: Ch. 1, 3, 7, 9, 11
- Final Exam: Jan. 5-9, 2026
- Note: There will be no final exam in this course. Instead, you are required to finish a term project for system development.
Scores
Please check the homework submission site for more details.
E-mail: jhwang AT ntut . edu . tw
Created: Aug. 14, 2025.
Last Updated: Aug. 14, 2025.