Advanced Data Analysis from an Elementary Point of View

4 months ago 10

by Cosma Rohilla Shalizi

This is a draft textbook on data analysis methods, intended for a one-semester course for advance undergraduate students who have already taken classes in probability, mathematical statistics, and linear regression. It began as the lecture notes for 36-402 at Carnegie Mellon University.

By making this draft generally available, I am not promising to provide any assistance or even clarification whatsoever. Comments are, however, generally welcome.

The book is under contract to Cambridge University Press; it should be turned over to the press at the end of 2013 or beginning of 2014 in early before the end of 2015 by the end of 2018 2019, inshallah when I can manage. A copy of the next-to-final version will remain freely accessible here permanently.

What you're probably looking for

Complete draft in PDF

Directory of chapter-by-chapter R files for examples

Directory of data sets used in examples

Table of contents


    I. Regression and Its Generalizations
  1. Regression Basics
  2. The Truth about Linear Regression
  3. Model Evaluation
  4. Smoothing in Regression
  5. Simulation
  6. The Bootstrap
  7. Splines
  8. Additive Models
  9. Testing Regression Specifications
  10. Weighting and Variance
  11. Logistic Regression
  12. Generalized Linear Models and Generalized Additive Models
  13. Classification and Regression Trees
    II. Distributions and Latent Structure
  14. Density Estimation
  15. Principal Components Analysis
  16. Factor Models
  17. Mixture Models
  18. Graphical Models
    III. Causal Inference
  19. Graphical Causal Models
  20. Identifying Causal Effects
  21. Estimating Causal Effects
  22. Discovering Causal Structure
    IV. Dependent Data
  23. Time Series
  24. Simulation-Based Inference
    Online-only Appendices
    • Big O and Little o Notation
    • Taylor Expansions
    • Propagation of Error, and Standard Errors for Derived Quantities
    • Optimization
    • Relative Distributions and Smooth Tests of Goodness of Fit
    • Nonlinear Dimensionality Reduction
    • Rudimentary Graph Theory
    • Missing Data
    • Writing R Functions

    Data-Analysis Assignments

Planned changes

  • Remove redundant versions of the data-analysis assignments; provide solutions as a separate document through publisher
  • Unified treatment of information theory as an appendix
  • Improved (=correct) treatment of nonparametric instrument variables
  • Trim time-series chapter so it's less of a catalog of everything that might be useful
  • Break out stuff on heuristic essential asymptotics as a separate appendix
  • Make sure notation is consistent throughout: insist that vectors are always matrices, or use more geometric notation?
  • Figure out how to cut at least 50 pages
  • Index: currently (8 February 2025) done for most chapters but not proofed or unified, and possibly missing for some sections... Revisit to standardize terms and limit levels of hierarchy

(Text last updated 8 February 2025; this page last updated 15 January 2024)

Read Entire Article