Statistical Learning (IEMS-402, Northwestern)
Graduate course, Northwestern IEMS, 2025
Required graduate course on mathematical foundations of data science. Syllabus:[link]
Objectives:
This course provides foundational and advanced concepts in statistical learning theory, essential for analyzing complex data and making informed predictions. Students will delve into both asymptotic and non-asymptotic analyses of machine learning algorithms, addressing critical challenges such as model bias, variance, and robustness in uncertain environments. Toward the end of the course, students will apply these principles to modern machine learning contexts, including the scaling laws/benign overfitting of deep learning, generative AI, and language models. (e.g. Neural Tagent Kernel, Mean-Field Limit of Neural Network and In-context Learning)
ChatGPT Tutor: The [link] here provides a Large Language Model agents that is specifically trained for this course.
If you have a question, to get a response from the teaching staff quickly we strongly encourage you to post it to the class Piazza forum. For suggestions to improve Yiping’s teaching, you can use the (anonymous) form. You can also leave a private matter here. If you wish to contact me via email, kindly include the tag “[IEMS402]” in the subject line. This will help ensure that I do not overlook your message.
Syllabus
Text Book
- [Learning Theory from First Principles] Francis Bach [LTFP]
- [Asymptotic Statistics] A. W. van der Vaart [AS]
Prelinminary
- Calculus, Linear Algebra
- IEMS 302 Probability
- Probability and Statistics: Strong Law of Large Numbers, Central Limit Theorem, Big-O, little-o notation,
- Optimization Theory: Lagrangian Duality Theory IEMS 450-2: Mathematical Optimization II (Interestingly, IEMS 450-1 is not required)
Here is a review of all the preliminary we’ll use in this class:
- Probability and Optimization Review: [link]
- Different notion of convergence in probability is useful but not required in our class. Our class will just do intuituive proof.
Scheldue
- Lecture 1-2: Concept 1.6/1.8 [Feedback for Lecture 1] [Feedback for Lecture 2]
- Supervised Learning, Empirical Risk Minimization, Risk Decomposition ([LTFP] Section 2)
- Deisgn of Loss function: Max Likelihood, Surrogate Loss
- Difference between IEMS402 (Statistical Learning) and IEMS401 (Applied Statistics)
- Loss convergence V.S. Parameter Recover
- Non-parametric V.S. Parametric
- Degree of Freedom.
- Why the success of Deep Learning is a mystery.
- Curse of Dimensioanality for approximation
- Benigning Overfitting
- Non-convexity of Optimization Landscape, Selection of Multiple Minimum
- Distrbution Mismatch
- Sugessted Reading:
- Zhang C, Bengio S, Hardt M, et al. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 2021, 64(3): 107-115.
- Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Nakkiran P, Kaplun G, Bansal Y, et al. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021, 2021(12): 124003.
- Curth A, Jeffares A, van der Schaar M. A U-turn on double descent: Rethinking parameter counting in statistical learning. Advances in Neural Information Processing Systems, 2024, 36.
- Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling L Xiao arXiv preprint arXiv:2409.15156
- (Required)
Mallinar N, Simon J, Abedsoltan A, et al. Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting. Advances in Neural Information Processing Systems, 2022, 35: 1182-1195.
- Li H, Xu Z, Taylor G, et al. Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 2018, 31.
- Keskar N S, Mudigere D, Nocedal J, et al. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
- Wipf D P, Rao B D. Sparse Bayesian learning for basis selection IEEE Transactions on Signal processing, 2004, 52(8): 2153-2164.
- Mücke N, Steinwart I. Empirical Risk Minimization in the Interpolating Regime with Application to Neural Network Learning. arXiv preprint arXiv:1905.10686, 2019.
- Lecture 3: 1.13
- Bias-Variance Trade-off of Kernel Smoothing Estimator, Curse of Dimensionality ([LTFP] Section 6)
- Advanced Reading:
- Xing Y, Song Q, Cheng G. Benefit of interpolation in nearest neighbor algorithms. SIAM Journal on Mathematics of Data Science, 2022, 4(2): 935-956.
- Chhor J, Sigalla S, Tsybakov A B. Benign overfitting and adaptive nonparametric regression. Probability Theory and Related Fields, 2024: 1-32.
- Lecture 4: Concept 1.15
- Diagram of Learning: Supervised, Unsupervised, Semi-supervised, Self-supervised, Generative AI
- spectral clustering and t-SNE
- Infomax and self-supervised learning
- Relationship between spectral clustering, t-SNE and self-supervised learning
- Suggested Reading:
- Zhou X, Belkin M. Semi-supervised learning by higher order regularization Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2011: 892-900.
- Hjelm R D, Fedorov A, Lavoie-Marchildon S, et al. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
- Linderman G C, Steinerberger S. Clustering with t-SNE, provably. SIAM journal on mathematics of data science, 2019, 1(2): 313-332.
- HaoChen J Z, Wei C, Gaidon A, et al. Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems, 2021, 34: 5000-5011.
- Advanced Reading:
- X. Cheng and N. Wu. “Eigen-convergence of Gaussian kernelized graph Laplacian by manifold heat interpolation”. Applied and Computational Harmonic Analysis, 61, 132-190 (2022)
- Cai T T, Ma R. Theoretical foundations of t-sne for visualizing high-dimensional clustered data[J]. Journal of Machine Learning Research, 2022, 23(301): 1-54.
- Lecture 5: 1.20 [No Class] (Martin Luther King Jr. Day)
- Lecture 6: 1.22
- Asymptotic normality
- Inverse function theorem, (Implicit) Delta Method
- Moment Methods
- Lecture 7: 1.27
- Influnce function, Fisher Information
- Cramer-Rao Bound and Con of Unbiased Estimators (James-Stein estimator)
- Sugessted Reading:
- Koh P W, Liang P. Understanding black-box predictions via influence functions International conference on machine learning PMLR, 2017: 1885-1894. (ICML 2017 best paper)
- Basu S, Pope P, Feizi S. Influence functions in deep learning are fragile.
- Lecture 8: 1.29
- Concerntration Inequality
- Application of Concerntration Inequality: Johnson-Lindenstrauss lemma
- Lecture 9: 2.3
- Generalization via Uniform Bound, Covering Number
- Lecture 10: 2.5
- Symmetrization, Rademacher complexity
- Advanced Reading:
- Kur G, Putterman E, Rakhlin A. On the variance, admissibility, and stability of empirical risk minimization. Advances in Neural Information Processing Systems, 2024, 36.
- Kur, Gil, et al. Minimum Norm Interpolation Meets The Local Theory of Banach Spaces. Forty-first International Conference on Machine Learning.
- Xu, Yunbei, and Assaf Zeevi. “Towards optimal problem dependent generalization error bounds in statistical learning theory.” Mathematics of Operations Research (2024).
- Lecture 11: 2.10 [Advanced Topic]
- Generalization Theory of Neural Network [Textbook]
- Other Ideas of Generalization:
- PAC-Bayes, Algorithm Stability, …
- Suggested Reading:
- Bartlett, P.L., 1998. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE transactions on Information Theory, 44(2), pp.525-536
- Bartlett, P.L., Foster, D.J. and Telgarsky, M., 2017, December. Spectrally-normalized margin bounds for neural networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 6241-6250)
- Jiang Y, Neyshabur B, Mobahi H, et al. Fantastic generalization measures and where to find them. arXiv preprint arXiv:1912.02178, 2019.
- Jiang Y, Nagarajan V, Baek C, et al. Assessing generalization of SGD via disagreement. arXiv preprint arXiv:2106.13799, 2021.
- Lecture 12: [Midterm] 2.12
- Lecture 13: 2.17
- Ledoux-Talagrand Contraction Principle
- Dudley’s theorem
- Lecture 14: 2.19 [Advanced Topic]
- Localized Complexity
- Non-parametric Least-square
- Suggested Reading:
- Lecture 15: 2.24
- Reproducing Kernel Hilbert Space
- Advanced Reading:
- Haas M, Holzmüller D, Luxburg U, et al. Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension[J]. Advances in Neural Information Processing Systems, 2024, 36.
- Schölpple M, Steinwart I. Which Spaces can be Embedded in Reproducing Kernel Hilbert Spaces? arXiv preprint arXiv:2312.14711, 2023.
- Lu Y, Lin D, Du Q. Which Spaces can be Embedded in Lp-type Reproducing Kernel Banach Space? A Characterization via Metric Entropy. arXiv preprint arXiv:2410.11116, 2024.
- Lecture 15: 2.26
- Distrbution Shift (Stanford CS329D Machine Learning Under Distribution Shift)
- Distributionally Robust Optimization
- Suggested Reading:
- Toward a inductive modeling language for distribution shifts
- Geirhos R, Jacobsen J H, Michaelis C, et al. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2020, 2(11): 665-673.
- Sagawa S, Koh P W, Hashimoto T B, et al. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization.
- Suggested Reading:
- Duchi J, Namkoong H. Variance-based regularization with convex objectives. Journal of Machine Learning Research, 2019, 20(68): 1-55. (Neurips 2017 Best paper)
- Duchi J C, Namkoong H. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 2021, 49(3): 1378-1406.
- Hu W, Niu G, Sato I, et al. Does distributionally robust supervised learning give robust classifiers? International Conference on Machine Learning. PMLR, 2018: 2029-2037.
- Lecture 15: 3.3 [Advanced Topic]
- Optimal Transport, Kantorovich Duality
- Statistics of Optimal Transport and Smoothed Optimal Transport
- Suggested Reading (Optimal Transport):
- (Textbook) Computational Optimal Transport
- (Central Limit Theorem) Goldfeld Z, Greenewald K. [Gaussian-smoothed optimal transport: Metric structure and statistical efficiency] International Conference on Artificial Intelligence and Statistics. PMLR, 2020: 3327-3337.
- (Central Limit Theorem) Si N, Blanchet J, Ghosh S, et al. Quantifying the empirical Wasserstein distance to a set of measures: Beating the curse of dimensionality. Advances in Neural Information Processing Systems, 2020, 33: 21260-21270.
- (Optimization) Li J, Tang J, Kong L, et al. A convergent single-loop algorithm for relaxation of gromov-wasserstein in graph data. arXiv preprint arXiv:2303.
- Lecture 16: 3.5 [Advanced Topic]
- Sampling via Wasserstein Gradient Flow (JKO Scheme)
- Generative Modeling and Diffusion Model
- Suggested Reading (Sampling):
- Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
- Liu Q, Wang D. Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems, 2016, 29.
- Log-concave sampling
- Chen A Y, Sridharan K. From Optimization to Sampling via Lyapunov Potentials. arXiv preprint arXiv:2410.02979, 2024.
- Suggested Reading
- Song Y, Sohl-Dickstein J, Kingma D P, et al. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Lipman Y, Chen R T Q, Ben-Hamu H, et al. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.
- Albergo M S, Boffi N M, Vanden-Eijnden E. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023.
- Lecture 17: 3.10
- Deep Learning Theory: ([LTFP] Section 12.3, 12.5)
- Neural Tagent Kernel
- Mean-field Limit of Neural Network
- Suggested Reading:
- Bach F. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 2017, 18(19): 1-53.
- Chizat L, Oyallon E, Bach F. On lazy training in differentiable programming. Advances in neural information processing systems, 2019, 32.
- Advanced Reading:
- Chizat L, Bach F. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 2018, 31.
- Yang G, Hu E J, Babuschkin I, et al. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
- Hayou S, Ghosh N, Yu B. Lora+: Efficient low rank adaptation of large models. arXiv preprint arXiv:2402.12354, 2024.
- Ishikawa S, Karakida R. On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width. arXiv preprint arXiv:2312.12226, 2023.
- Deep Learning Theory: ([LTFP] Section 12.3, 12.5)
- Lecture 18: 3.12 [Advanced Topic]
- Implicit Bias ([LTFP] Section 12.1)
- Large Language Model
- In-context Learning, Chain-of-thoughts and Circuit Theory, Alignment of AI
- Advacned Reading:
- Kim J, Nakamaki T, Suzuki T. Transformers are minimax optimal nonparametric in-context learner. arXiv preprint arXiv:2408.12186, 2024.
- Von Oswald J, Niklasson E, Randazzo E, et al. Transformers learn in-context by gradient descent International Conference on Machine Learning. PMLR, 2023: 35151-35174.
- Shen L, Mishra A, Khashabi D. Do pretrained Transformers Really Learn In-context by Gradient Descent?. arXiv preprint arXiv:2310.08540, 2023.
- Giannou A, Yang L, Wang T, et al. How Well Can Transformers Emulate In-context Newton’s Method?. arXiv preprint arXiv:2403.03183, 2024.
- Feng G, Zhang B, Gu Y, et al. Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 2024, 36.
- Lee J, Xie A, Pacchiano A, et al. Supervised pretraining can learn in-context reinforcement learning. Advances in Neural Information Processing Systems, 2024, 36.
Homework
Your grade will be computed by max(HW1,HW8)+max(HW2,HW3)+max(HW4,HW5)+max(HW6,HW7).
- [Homework 1]: Review of Probability Statistics and Optimization
- [Homework 2]: Bias and Variance Trade-off 1: Overparameterization
- [Homework 3]: Bias and Variance Trade-off 2: Kernel Smoothing
- [Homework 4]: Asymptotic Theory 1
- [Homework 5]: Asymptotic Theory 2 and Concentration Inequality
- [Homework 6]: Generalization Error
- [Homework 7]: Complexity of Hypothesis Space
- [Homework 8]: RKHS and Robust Learning
Exams
- [Practice Mid-Term Exam]
- Modern Machine Learning Concepts, Bias and Variance Trade-off
- Kernel Smoothing, Asymptotic Theory, Influnce Function Concentration Inequality, Uniform Bound
- [Practice Final Exam]
- Rademacher complexity, Covering Number, Dudley’s theorem
- RKHS, Optimal Transport, Robust Learning
Other Reading
General Policies
In general, we do not grant extensions on assignments/exams. There are several exceptions:
- Medical Emergencies: If you are sick and unable to complete an assignment or attend class, please go to University Health Services. For minor illnesses, we expect grace days or our late penalties to provide sufficient accommodation. For medical emergencies (e.g. prolonged hospitalization), students may request an extension afterward by contacting their Student Liaison or Academic Advisor and having them reach out to the instructor on their behalf. Please plan ahead if possible.
- Family/Personal Emergencies: If you have a family emergency (e.g. death in the family) or a personal emergency (e.g. mental health crisis), please contact your academic adviser or Counseling and Psychological Services (CaPS). In addition to offering support, they will reach out to the instructors for all your courses on your behalf to request an extension.
- University-Approved Absences: If you are attending an out-of-town university-approved event (e.g. multi-day athletic/academic trip organized by the university), you may request an extension for the duration of the trip. You must provide confirmation of your attendance, usually from a faculty or staff organizer of the event.
Accommodations for Students with Disabilities: If you have a disability and have an accommodation letter from the Disability Resources office, I encourage you to discuss your accommodations and needs with the Moses Center for Student Accessibility as early as possible in the semester for assistance. (Telephone: 212-998-4980, Website: http://www.nyu.edu/csd) We will work with you to ensure that accommodations are provided as appropriate. If you suspect that you may have a disability and would benefit from accommodations but are not yet registered with the Office of Disability Resources, I encourage you to contact them. Please note that it is your responsibility to schedule exams at the Moses Center, and to ensure that you are receiving all accommodations you are approved for.
Collaboration among Students: The purpose of student collaboration is to facilitate learning, not to circumvent it. Studying the material in groups is strongly encouraged. It is also allowed to seek help from other students in understanding the material needed to solve a particular homework problem, provided no written notes (including code) are shared, or are taken at that time, and provided learning is facilitated, not circumvented. The actual solution must be done by each student alone.
The presence or absence of any form of help or collaboration, whether given or received, must be explicitly stated and disclosed in full by all involved. Specifically, each assignment solution must include answering the following questions:
- Did you receive any help whatsoever from anyone in solving this assignment? Yes / No. If you answered ‘yes’, give full details: ____ (e.g. “Jane Doe explained to me what is asked in Question 3.4”)
- Did you give any help whatsoever to anyone in solving this assignment? Yes / No. If you answered ‘yes’, give full details: _____ (e.g. “I pointed Joe Smith to section 2.3 since he didn’t know how to proceed with Question 2”)