2022 Reading Group
Published:
Aim: rate of learning is a function space with uniform convergence property(As an example: Holder functions) in different settings.
Textbook and Reference
- weak convergence and empirical processes by van der Vaart [WCEP]
- Concentration inequalities: a nonasymptotic theory of independence by Boucheron et al. [CI]
- Asymptotic Statistics, by van der Vaart [AS]
- High Dimensional Probability, by Roman Vershynin [HDP]
- High Dimensional Statistics, by Martin Wainwright [HDS]
- Empirical Processes in M-Estimation by van der Vaart [M-Estimate]
- Introduction to Semi-parametric Statistics by Bodhisattva Sen [Semi-Para]
- Introduction to Nonparametric Statistics by Bodhisattva Sen [Non-para]
- Mathematical Statistics: A Non-asymptotic Approach by A.Rakhlin
Syllabus
Week 1: Concentration Inequality and Basic Asymptotic Statistics
- Basic Concentration Inequality for Sub-gaussian/Sub-exp random variables
- Asymptotic normality, delta method, moment method
- Efficiency of the models, local asymptotic normality Reference:
- [HDP] Chapter 1
- [AS] Chapter 2 3 4 5
- [Semi-Para] Chapter 2 and Chapter 3 Reading:
- Han J, Hu R, Long J. A class of dimension-free metrics for the convergence of empirical measures. Stochastic Processes and their Applications, 2023, 164: 242-287.
Week 2: Uniform Law of Large Number
- Symmetrization
- Covering Number, Duley Integral Reference:
- [WCEP] Chapter 2
Week3: Uniform Center Limit Theorem(Optional)
- Motivation: Goodness-of-fit test, Kolmogorov and Smironv Statistics
- Uniform Center Limit Theorem, functional delta methods Reference:
- Applications of Empirical Process Theory Section 4.2 Reading:
- Quantifying the Empirical Wasserstein Distance to a Set of Measures: Beating the Curse of Dimensionality (NeurIPS), 2020
- Gretton A, Borgwardt K M, Rasch M J, et al. A kernel two-sample test. The Journal of Machine Learning Research, 2012, 13(1): 723-773.
Week4: Nonparametric Statistics 1: RKHS(Optional)
Reading:
- Dieuleveut A, Bach F. Nonparametric stochastic approximation with large step-sizes
- Fischer S, Steinwart I. Sobolev norm learning rates for regularized least-squares algorithms. Journal of Machine Learning Research, 2020, 21(205): 1-38.
Week5: Min-Max Lower Bound
Reference:
- https://web.stanford.edu/class/ee378c/
Week6: Nonparametric Statistics 2: Localized Complexity
Reference:
- Applications of Empirical Process Theory Section 4.3
- geoffrey chinot‘s thesis : section 1
- Lecture Note 2: Generalization Errors by Taiji Suzuki(In Japaness)
- Towards Problem-dependent Optimal Learning Rates (More detailed version: Xu Y, Zeevi A. Towards Optimal Problem Dependent Generalization Error Bounds in Statistical Learning Theory. arXiv preprint arXiv:2011.06186, 2020.)
- Rakhlin A, Sridharan K, Tsybakov A B. Empirical entropy, minimax regret and minimax risk. Bernoulli, 2017, 23(2): 789-824.
- Variance Regularized Slow rate in [Orthogonal Statistical Learning paper] Reading:
- 2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization by Vladimir Koltchinskii
- Yang J, Sun S, Roy D M. Fast-rate PAC-Bayes Generalization Bounds via Shifted Rademacher Processes NeurIPS. 2019: 10802-10812.
- Rakhlin A, Sridharan K, Tsybakov A B. Empirical entropy, minimax regret and minimax risk. Bernoulli, 2017, 23(2): 789-824.
- Kur G, Putterman E, Rakhlin A. On the Variance, Admissibility, and Stability of Empirical Risk Minimization. Advances in Neural Information Processing Systems, 2024, 36.
Week8: Non-parametric Statistics 4: Deep Learning
- Neural Scaling Laws and explaining Neural Scaling Laws by non-parametric statistics Reference:
- Explaining Neural Scaling Laws by Yasaman Bahr et al. arxiv:2102.06701
- Suzuki T. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality ICLR2019
- Chen M, Jiang H, Liao W, et al. Nonparametric regression on low-dimensional manifolds using deep relu networks Neurips 2019
- Singh S, Uppal A, Li B, et al. Nonparametric density estimation under adversarial losses Neurips 2018
- local radamencher complexity of DNN: http://ibis.t.u-tokyo.ac.jp/suzuki/lecture/2020/intensive2/%E8%AC%9B%E7%BE%A92.pdf
Week 9: Non-parametric Statistics 5: Classification
- Tsybakov A B. Optimal aggregation of classifiers in statistical learning[J]. The Annals of Statistics, 2004, 32(1): 135-166.
- Kim Y, Ohn I, Kim D. Fast convergence rates of deep neural networks for classification[J]. arXiv preprint arXiv:1812.03599, 2018. Using Inf covering number
- Bos T, Schmidt-Hieber J. Convergence rates of deep ReLU networks for multiclass classification. arXiv preprint arXiv:2108.00969, 2021.
- Audibert J Y, Tsybakov A B. Fast learning rates for plug-in classifiers[J]. The Annals of statistics, 2007, 35(2): 608-633.
Week9: Semi-parametric Statistics
Reference:
- [Semi-Para] section 3 4
- Foster D J, Syrgkanis V. Orthogonal statistical learning. arXiv preprint arXiv:1901.09036, 2019.
- Mackey L, Syrgkanis V, Zadik I. Orthogonal machine learning: Power and limitations International Conference on Machine Learning. PMLR, 2018: 3375-3383.
- Causal Inference Tutorial by Rahul Singh
- Stats361: Causal Inference by Stefan Wager Reading:
- Knowledge Distillation as Semi-Parametric Inference ICLR 2021
- Demirer M, Syrgkanis V, Lewis G, et al. Semi-parametric efficient policy learning with continuous actions. arXiv preprint arXiv:1905.10116, 2019.
- Chernozhukov V, Chetverikov D, Demirer M, et al. Double/debiased machine learning for treatment and structural parameters. 2018. Reading:
- Chernozhukov V, Newey W, Singh R, et al. Adversarial Estimation of Riesz Representers. arXiv preprint arXiv:2101.00009, 2020. https://arxiv.org/pdf/2105.15197.pdf
Week11: Non-parametric Bandits
Reference
- Foster D, Rakhlin A. Beyond UCB: Optimal and efficient contextual bandits with regression oracles International Conference on Machine Learning. PMLR, 2020: 3199-3210.
- Hu Y, Kallus N, Mao X. Smooth contextual bandits: Bridging the parametric and non-differentiable regret regimes Conference on Learning Theory. PMLR, 2020: 2007-2010.
- Philippe Rigollet and Assaf Zeevi. Nonparametric bandits with covariates. Conference on Learning Theory (COLT), page 54, 2010.
- Krishnamurthy S K, Hadad V, Athey S. Adapting to misspecification in contextual bandits with offline regression oracles[J]. arXiv preprint arXiv:2102.13240, 2021. Reading
- Wang T, Rudin C. Bandits for BMO Functions International Conference on Machine Learning. PMLR, 2020: 9996-10006.
- Wang T, Ye W, Geng D, et al. Towards practical Lipschitz stochastic bandits. arXiv preprint arXiv:1901.09277, 2019.
Week12: Non-parametric Weakly Supervised Learning
- Active Learning, Disagreement factor-based methods
- Non-parametric transfer learning Reference:
- Works by Samory Kpotufe
- Cai T T, Wei H. Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. The Annals of Statistics, 2021, 49(1): 100-128.
- Stanislav Minsker. Plug-in approach to active learning. Journal of Machine Learning Research, 13 (Jan):67–90, 2012a.
- Castro R M, Nowak R D. Minimax bounds for active learning[J]. IEEE Transactions on Information Theory, 2008, 54(5): 2339-2353.
Week13: Empirical Landscape: uniform convergence of gradients
Reference:
- https://github.com/tengyuma/cs229m_notes/blob/main/Winter2021/pdf/02-17-2021.pdf
- https://github.com/tengyuma/cs229m_notes/blob/main/Winter2021/pdf/03-01-2021.pdf
- https://github.com/tengyuma/cs229m_notes/blob/main/Winter2021/pdf/03-03-2021.pdf
- Mei S, Bai Y, Montanari A. The landscape of empirical risk for nonconvex losses. Annals of Statistics, 2018, 46(6A): 2747-2774.
- Foster D J, Sekhari A, Sridharan K. Uniform convergence of gradients for non-convex learning and optimization[J]. arXiv preprint arXiv:1810.11059, 2018.
Week 14: Drawback Of Empirical Process, interpolation estimators
Reference:
- Yang Z, Bai Y, Mei S. Exact gap between generalization error and uniform convergence in random feature models[J]. arXiv preprint arXiv:2103.04554, 2021.
- Average case complexity, reference:
Week 15: non-P-Donscker
Example: log-concave densities, convex functions. Cases that ERM is not optimal References:
- Kur, Gil, Yuval Dagan, and Alexander Rakhlin. “Optimality of maximum likelihood for log-concave density estimation and bounded convex regression.” arXiv preprint arXiv:1903.05315 (2019).
- Kur G, Rakhlin A. On the Minimal Error of Empirical Risk Minimization. arXiv preprint arXiv:2102.12066, 2021.
Week16 Learning Dynamic systems
- Hang H, Steinwart I. A Bernstein-type inequality for some mixing processes and dynamical systems with an application to learning[J]. The Annals of Statistics, 2017, 45(2): 708-743.
- Foster D, Sarkar T, Rakhlin A. Learning nonlinear dynamical systems from a single trajectory Learning for Dynamics and Control. PMLR, 2020: 851-861.
- Nickl R, Ray K. Nonparametric statistical inference for drift vector fields of multi-dimensional diffusions. The Annals of Statistics, 2020, 48(3): 1383-1408.
- Hang H, Steinwart I. Fast learning from α-mixing observations[J]. Journal of Multivariate Analysis, 2014, 127: 184-199.
Reading List
- Fusheng Liu, Haizhao Yang, Soufiane Hayou, and Qianxiao Li. Connecting optimization and generalization via gradient flow path length. arXiv preprint arXiv:2202.10670, 2022.
- Adaptive learning rates for support vector machines working on data with low intrinsic dimension
- Which Spaces can be Embedded in Reproducing Kernel Hilbert Spaces?