Aim: rate of learning is a function space with uniform convergence property(As an example: Holder functions) in different settings.
Textbook and Reference
Syllabus
Week 1: Concentration Inequality and Basic Asymptotic Statistics
 Basic Concentration Inequality for Subgaussian/Subexp random variables
 Asymptotic normality, delta method, moment method
 Efficiency of the models, local asymptotic normality
Reference:
 [HDP] Chapter 1
 [AS] Chapter 2 3 4 5
 [SemiPara] Chapter 2 and Chapter 3
Reading:
 Han J, Hu R, Long J. A class of dimensionfree metrics for the convergence of empirical measures. Stochastic Processes and their Applications, 2023, 164: 242287.
 Symmetrization
 Covering Number, Duley Integral
Reference:
 [WCEP] Chapter 2
 Motivation: Goodnessoffit test, Kolmogorov and Smironv Statistics
 Uniform Center Limit Theorem, functional delta methods
Reference:
 Applications of Empirical Process Theory Section 4.2
Reading:
 Quantifying the Empirical Wasserstein Distance to a Set of Measures: Beating the Curse of Dimensionality (NeurIPS), 2020
 Gretton A, Borgwardt K M, Rasch M J, et al. A kernel twosample test. The Journal of Machine Learning Research, 2012, 13(1): 723773.
Week4: Nonparametric Statistics 1: RKHS(Optional)
Reading:
 Dieuleveut A, Bach F. Nonparametric stochastic approximation with large stepsizes
 Fischer S, Steinwart I. Sobolev norm learning rates for regularized leastsquares algorithms. Journal of Machine Learning Research, 2020, 21(205): 138.
Week5: MinMax Lower Bound
Reference:
 https://web.stanford.edu/class/ee378c/
Week6: Nonparametric Statistics 2: Localized Complexity
Reference:
 Applications of Empirical Process Theory Section 4.3
 geoffrey chinot‘s thesis : section 1
 Lecture Note 2: Generalization Errors by Taiji Suzuki(In Japaness)
 Towards Problemdependent Optimal Learning Rates (More detailed version: Xu Y, Zeevi A. Towards Optimal Problem Dependent Generalization Error Bounds in Statistical Learning Theory. arXiv preprint arXiv:2011.06186, 2020.)
 Rakhlin A, Sridharan K, Tsybakov A B. Empirical entropy, minimax regret and minimax risk. Bernoulli, 2017, 23(2): 789824.
 Variance Regularized Slow rate in [Orthogonal Statistical Learning paper]
Reading:
 2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization by Vladimir Koltchinskii
 Yang J, Sun S, Roy D M. Fastrate PACBayes Generalization Bounds via Shifted Rademacher Processes NeurIPS. 2019: 1080210812.
 Rakhlin A, Sridharan K, Tsybakov A B. Empirical entropy, minimax regret and minimax risk. Bernoulli, 2017, 23(2): 789824.
 Kur G, Putterman E, Rakhlin A. On the Variance, Admissibility, and Stability of Empirical Risk Minimization. Advances in Neural Information Processing Systems, 2024, 36.
Week8: Nonparametric Statistics 4: Deep Learning
 Neural Scaling Laws and explaining Neural Scaling Laws by nonparametric statistics
Reference:
 Explaining Neural Scaling Laws by Yasaman Bahr et al. arxiv:2102.06701
 Suzuki T. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality ICLR2019
 Chen M, Jiang H, Liao W, et al. Nonparametric regression on lowdimensional manifolds using deep relu networks Neurips 2019
 Singh S, Uppal A, Li B, et al. Nonparametric density estimation under adversarial losses Neurips 2018
 local radamencher complexity of DNN: http://ibis.t.utokyo.ac.jp/suzuki/lecture/2020/intensive2/%E8%AC%9B%E7%BE%A92.pdf
Week 9: Nonparametric Statistics 5: Classification
 Tsybakov A B. Optimal aggregation of classifiers in statistical learning[J]. The Annals of Statistics, 2004, 32(1): 135166.
 Kim Y, Ohn I, Kim D. Fast convergence rates of deep neural networks for classification[J]. arXiv preprint arXiv:1812.03599, 2018.
Using Inf covering number
 Bos T, SchmidtHieber J. Convergence rates of deep ReLU networks for multiclass classification. arXiv preprint arXiv:2108.00969, 2021.
 Audibert J Y, Tsybakov A B. Fast learning rates for plugin classifiers[J]. The Annals of statistics, 2007, 35(2): 608633.
Week9: Semiparametric Statistics
Reference:
 [SemiPara] section 3 4
 Foster D J, Syrgkanis V. Orthogonal statistical learning. arXiv preprint arXiv:1901.09036, 2019.
 Mackey L, Syrgkanis V, Zadik I. Orthogonal machine learning: Power and limitations International Conference on Machine Learning. PMLR, 2018: 33753383.

 Causal Inference Tutorial by Rahul Singh
 Stats361: Causal Inference by Stefan Wager
Reading:
 Knowledge Distillation as SemiParametric Inference ICLR 2021
 Demirer M, Syrgkanis V, Lewis G, et al. Semiparametric efficient policy learning with continuous actions. arXiv preprint arXiv:1905.10116, 2019.
 Chernozhukov V, Chetverikov D, Demirer M, et al. Double/debiased machine learning for treatment and structural parameters. 2018.
Reading：
 Chernozhukov V, Newey W, Singh R, et al. Adversarial Estimation of Riesz Representers. arXiv preprint arXiv:2101.00009, 2020.
https://arxiv.org/pdf/2105.15197.pdf
Week11: Nonparametric Bandits
Reference
 Foster D, Rakhlin A. Beyond UCB: Optimal and efficient contextual bandits with regression oracles International Conference on Machine Learning. PMLR, 2020: 31993210.
 Hu Y, Kallus N, Mao X. Smooth contextual bandits: Bridging the parametric and nondifferentiable regret regimes Conference on Learning Theory. PMLR, 2020: 20072010.
 Philippe Rigollet and Assaf Zeevi. Nonparametric bandits with covariates. Conference on Learning Theory (COLT), page 54, 2010.
 Krishnamurthy S K, Hadad V, Athey S. Adapting to misspecification in contextual bandits with offline regression oracles[J]. arXiv preprint arXiv:2102.13240, 2021.
Reading
 Wang T, Rudin C. Bandits for BMO Functions International Conference on Machine Learning. PMLR, 2020: 999610006.
 Wang T, Ye W, Geng D, et al. Towards practical Lipschitz stochastic bandits. arXiv preprint arXiv:1901.09277, 2019.
Week12: Nonparametric Weakly Supervised Learning
 Active Learning, Disagreement factorbased methods
 Nonparametric transfer learning
Reference:
 Works by Samory Kpotufe
 Cai T T, Wei H. Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. The Annals of Statistics, 2021, 49(1): 100128.
 Stanislav Minsker. Plugin approach to active learning. Journal of Machine Learning Research, 13 (Jan):67–90, 2012a.
 Castro R M, Nowak R D. Minimax bounds for active learning[J]. IEEE Transactions on Information Theory, 2008, 54(5): 23392353.
Reference:
 https://github.com/tengyuma/cs229m_notes/blob/main/Winter2021/pdf/02172021.pdf
 https://github.com/tengyuma/cs229m_notes/blob/main/Winter2021/pdf/03012021.pdf
 https://github.com/tengyuma/cs229m_notes/blob/main/Winter2021/pdf/03032021.pdf
 Mei S, Bai Y, Montanari A. The landscape of empirical risk for nonconvex losses. Annals of Statistics, 2018, 46(6A): 27472774.
 Foster D J, Sekhari A, Sridharan K. Uniform convergence of gradients for nonconvex learning and optimization[J]. arXiv preprint arXiv:1810.11059, 2018.
Week 14: Drawback Of Empirical Process, interpolation estimators
Reference:
 Yang Z, Bai Y, Mei S. Exact gap between generalization error and uniform convergence in random feature models[J]. arXiv preprint arXiv:2103.04554, 2021.
 Average case complexity, reference:
Week 15: nonPDonscker
Example: logconcave densities, convex functions. Cases that ERM is not optimal
References:
 Kur, Gil, Yuval Dagan, and Alexander Rakhlin. “Optimality of maximum likelihood for logconcave density estimation and bounded convex regression.” arXiv preprint arXiv:1903.05315 (2019).
 Kur G, Rakhlin A. On the Minimal Error of Empirical Risk Minimization. arXiv preprint arXiv:2102.12066, 2021.
Week16 Learning Dynamic systems
 Hang H, Steinwart I. A Bernsteintype inequality for some mixing processes and dynamical systems with an application to learning[J]. The Annals of Statistics, 2017, 45(2): 708743.
 Foster D, Sarkar T, Rakhlin A. Learning nonlinear dynamical systems from a single trajectory Learning for Dynamics and Control. PMLR, 2020: 851861.
 Nickl R, Ray K. Nonparametric statistical inference for drift vector fields of multidimensional diffusions. The Annals of Statistics, 2020, 48(3): 13831408.
 Hang H, Steinwart I. Fast learning from αmixing observations[J]. Journal of Multivariate Analysis, 2014, 127: 184199.
Reading List
]]>