2022 Reading Group

7 minute read


Aim: rate of learning is a function space with uniform convergence property(As an example: Holder functions) in different settings.

Textbook and Reference


Week 1: Concentration Inequality and Basic Asymptotic Statistics

  • Basic Concentration Inequality for Sub-gaussian/Sub-exp random variables
  • Asymptotic normality, delta method, moment method
  • Efficiency of the models, local asymptotic normality Reference:
  • [HDP] Chapter 1
  • [AS] Chapter 2 3 4 5
  • [Semi-Para] Chapter 2 and Chapter 3 Reading:
  • Han J, Hu R, Long J. A class of dimension-free metrics for the convergence of empirical measures. Stochastic Processes and their Applications, 2023, 164: 242-287.

Week 2: Uniform Law of Large Number

  • Symmetrization
  • Covering Number, Duley Integral Reference:
  • [WCEP] Chapter 2

Week3: Uniform Center Limit Theorem(Optional)

  • Motivation: Goodness-of-fit test, Kolmogorov and Smironv Statistics
  • Uniform Center Limit Theorem, functional delta methods Reference:
  • Applications of Empirical Process Theory Section 4.2 Reading:
  • Quantifying the Empirical Wasserstein Distance to a Set of Measures: Beating the Curse of Dimensionality (NeurIPS), 2020
  • Gretton A, Borgwardt K M, Rasch M J, et al. A kernel two-sample test. The Journal of Machine Learning Research, 2012, 13(1): 723-773.

Week4: Nonparametric Statistics 1: RKHS(Optional)


  • Dieuleveut A, Bach F. Nonparametric stochastic approximation with large step-sizes
  • Fischer S, Steinwart I. Sobolev norm learning rates for regularized least-squares algorithms. Journal of Machine Learning Research, 2020, 21(205): 1-38.

Week5: Min-Max Lower Bound


  • https://web.stanford.edu/class/ee378c/

Week6: Nonparametric Statistics 2: Localized Complexity


  • Applications of Empirical Process Theory Section 4.3
  • geoffrey chinot‘s thesis : section 1
  • Lecture Note 2: Generalization Errors by Taiji Suzuki(In Japaness)
  • Towards Problem-dependent Optimal Learning Rates (More detailed version: Xu Y, Zeevi A. Towards Optimal Problem Dependent Generalization Error Bounds in Statistical Learning Theory. arXiv preprint arXiv:2011.06186, 2020.)
  • Rakhlin A, Sridharan K, Tsybakov A B. Empirical entropy, minimax regret and minimax risk. Bernoulli, 2017, 23(2): 789-824.
  • Variance Regularized Slow rate in [Orthogonal Statistical Learning paper] Reading:
  • 2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization by Vladimir Koltchinskii
  • Yang J, Sun S, Roy D M. Fast-rate PAC-Bayes Generalization Bounds via Shifted Rademacher Processes NeurIPS. 2019: 10802-10812.
  • Rakhlin A, Sridharan K, Tsybakov A B. Empirical entropy, minimax regret and minimax risk. Bernoulli, 2017, 23(2): 789-824.
  • Kur G, Putterman E, Rakhlin A. On the Variance, Admissibility, and Stability of Empirical Risk Minimization. Advances in Neural Information Processing Systems, 2024, 36.

Week8: Non-parametric Statistics 4: Deep Learning

  • Neural Scaling Laws and explaining Neural Scaling Laws by non-parametric statistics Reference:
  • Explaining Neural Scaling Laws by Yasaman Bahr et al. arxiv:2102.06701
  • Suzuki T. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality ICLR2019
  • Chen M, Jiang H, Liao W, et al. Nonparametric regression on low-dimensional manifolds using deep relu networks Neurips 2019
  • Singh S, Uppal A, Li B, et al. Nonparametric density estimation under adversarial losses Neurips 2018
  • local radamencher complexity of DNN: http://ibis.t.u-tokyo.ac.jp/suzuki/lecture/2020/intensive2/%E8%AC%9B%E7%BE%A92.pdf

Week 9: Non-parametric Statistics 5: Classification

  • Tsybakov A B. Optimal aggregation of classifiers in statistical learning[J]. The Annals of Statistics, 2004, 32(1): 135-166.
  • Kim Y, Ohn I, Kim D. Fast convergence rates of deep neural networks for classification[J]. arXiv preprint arXiv:1812.03599, 2018. Using Inf covering number
  • Bos T, Schmidt-Hieber J. Convergence rates of deep ReLU networks for multiclass classification. arXiv preprint arXiv:2108.00969, 2021.
  • Audibert J Y, Tsybakov A B. Fast learning rates for plug-in classifiers[J]. The Annals of statistics, 2007, 35(2): 608-633.

Week9: Semi-parametric Statistics


  • [Semi-Para] section 3 4
  • Foster D J, Syrgkanis V. Orthogonal statistical learning. arXiv preprint arXiv:1901.09036, 2019.
  • Mackey L, Syrgkanis V, Zadik I. Orthogonal machine learning: Power and limitations International Conference on Machine Learning. PMLR, 2018: 3375-3383.
    • Causal Inference Tutorial by Rahul Singh
  • Stats361: Causal Inference by Stefan Wager Reading:
  • Knowledge Distillation as Semi-Parametric Inference ICLR 2021
  • Demirer M, Syrgkanis V, Lewis G, et al. Semi-parametric efficient policy learning with continuous actions. arXiv preprint arXiv:1905.10116, 2019.
  • Chernozhukov V, Chetverikov D, Demirer M, et al. Double/debiased machine learning for treatment and structural parameters. 2018. Reading:
  • Chernozhukov V, Newey W, Singh R, et al. Adversarial Estimation of Riesz Representers. arXiv preprint arXiv:2101.00009, 2020. https://arxiv.org/pdf/2105.15197.pdf

Week11: Non-parametric Bandits


  • Foster D, Rakhlin A. Beyond UCB: Optimal and efficient contextual bandits with regression oracles International Conference on Machine Learning. PMLR, 2020: 3199-3210.
  • Hu Y, Kallus N, Mao X. Smooth contextual bandits: Bridging the parametric and non-differentiable regret regimes Conference on Learning Theory. PMLR, 2020: 2007-2010.
  • Philippe Rigollet and Assaf Zeevi. Nonparametric bandits with covariates. Conference on Learning Theory (COLT), page 54, 2010.
  • Krishnamurthy S K, Hadad V, Athey S. Adapting to misspecification in contextual bandits with offline regression oracles[J]. arXiv preprint arXiv:2102.13240, 2021. Reading
  • Wang T, Rudin C. Bandits for BMO Functions International Conference on Machine Learning. PMLR, 2020: 9996-10006.
  • Wang T, Ye W, Geng D, et al. Towards practical Lipschitz stochastic bandits. arXiv preprint arXiv:1901.09277, 2019.

Week12: Non-parametric Weakly Supervised Learning

  • Active Learning, Disagreement factor-based methods
  • Non-parametric transfer learning Reference:
  • Works by Samory Kpotufe
  • Cai T T, Wei H. Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. The Annals of Statistics, 2021, 49(1): 100-128.
  • Stanislav Minsker. Plug-in approach to active learning. Journal of Machine Learning Research, 13 (Jan):67–90, 2012a.
  • Castro R M, Nowak R D. Minimax bounds for active learning[J]. IEEE Transactions on Information Theory, 2008, 54(5): 2339-2353.

Week13: Empirical Landscape: uniform convergence of gradients


  • https://github.com/tengyuma/cs229m_notes/blob/main/Winter2021/pdf/02-17-2021.pdf
  • https://github.com/tengyuma/cs229m_notes/blob/main/Winter2021/pdf/03-01-2021.pdf
  • https://github.com/tengyuma/cs229m_notes/blob/main/Winter2021/pdf/03-03-2021.pdf
  • Mei S, Bai Y, Montanari A. The landscape of empirical risk for nonconvex losses. Annals of Statistics, 2018, 46(6A): 2747-2774.
  • Foster D J, Sekhari A, Sridharan K. Uniform convergence of gradients for non-convex learning and optimization[J]. arXiv preprint arXiv:1810.11059, 2018.

Week 14: Drawback Of Empirical Process, interpolation estimators


  • Yang Z, Bai Y, Mei S. Exact gap between generalization error and uniform convergence in random feature models[J]. arXiv preprint arXiv:2103.04554, 2021.
  • Average case complexity, reference:

Week 15: non-P-Donscker

Example: log-concave densities, convex functions. Cases that ERM is not optimal References:

  • Kur, Gil, Yuval Dagan, and Alexander Rakhlin. “Optimality of maximum likelihood for log-concave density estimation and bounded convex regression.” arXiv preprint arXiv:1903.05315 (2019).
  • Kur G, Rakhlin A. On the Minimal Error of Empirical Risk Minimization. arXiv preprint arXiv:2102.12066, 2021.

Week16 Learning Dynamic systems

  • Hang H, Steinwart I. A Bernstein-type inequality for some mixing processes and dynamical systems with an application to learning[J]. The Annals of Statistics, 2017, 45(2): 708-743.
  • Foster D, Sarkar T, Rakhlin A. Learning nonlinear dynamical systems from a single trajectory Learning for Dynamics and Control. PMLR, 2020: 851-861.
  • Nickl R, Ray K. Nonparametric statistical inference for drift vector fields of multi-dimensional diffusions. The Annals of Statistics, 2020, 48(3): 1383-1408.
  • Hang H, Steinwart I. Fast learning from α-mixing observations[J]. Journal of Multivariate Analysis, 2014, 127: 184-199.

Reading List