List of Skills Taught:
SAS
Interface
- Introduction to the SAS interface
Basics
- Comments in SAS
- Permanent & temporary folders
- Introduction to data steps
- Introduction to procedures
- Viewing and printing data
- Basic functions & operators
- If-then statements
- Creating indicator variables
- Keep & drop statements
- Where statements
- Missing data
- Date basics
- Time intervals
- Sorting data
- Duplicate observations
- Creating lag and lead variables
- Creating count variables
- Creating sub-samples
- Stacking data
- Merging with a data step
- Merging with Proc SQL
- Merging and dates
- Summarizing data using Proc Means
- Summarizing data using Proc SQL
- Calculating buy-and-hold raw returns
- Ranking variables
- Creating groups or portfolios
- Regrouping across time
- Summarizing data with groups or portfolios
Outliers
- Winsorizing & trimming
- Robust regression
MACRO Basics
- Creating MACRO variables
- Creating and running MACROS
- MACRO repositories
WRDS
- Remote connecting to WRDS
- Running code at WRDS
- Pulling Compustat data
- Pulling CRSP data
- Merging Compustat and CRSP
- Uploading & downloading data
Common Statistics Procedures
- Proc Means, Univaritate, Corr Ttest and Npar1way
- Proc reg (along with fixed effects, clustered errors, predicted and residual values)
- Proc surveyreg, logistic surveylogistic, genmod, qlim, and robustreg
Python
Basics
- Setting up Python
- Anacondas installation
- Common libraries (pandas, numpy)
- Basic commands
- Basic markdown
- Reading in, importing, exporting data
- Data types, dataframes and data configuration
- Merging data, stacking data
- Subsetting, dropping data
- Cleaning data, missing values, backfilling, winsorize/truncate
- Exploring data, descriptive statistics
- Variables: creating variables, dropping variables, leads and lags, dealing with dates, ranked variables
- Grouping, frequency tables
- Basic regression
WRDS
- Installing the WRDS library
- Connecting to WRDS via Python
- Employing SQL within Python
- Acquiring Compustat data
- Creating common financial statement variables
- Acquiring CRSP data
- Calculate cumulative and buy-and-hold abnormal returns
- Merging Compustat and CRSP
Web scraping
- Introduction to HTML structure
- Identifying HTML xpaths
- Scraping with scrapy
- Splitting HTML as text
- Pandas read_html
- Interacting with a web browser using Selenium (sending text to search boxes, clicking buttons, extracting page source)
- Scraping EDGAR
- EDGAR file management and master IDX files
- Introduction to EDGAR XBRL data
- Scraping Yahoo! Finance
Robotic process automation
- Obtaining and validating user input
- Automating the keyboard (basic keyboard functions, special keys)
- Automating the mouse (position, movement, clicking, scrolling)
- Basics of operating system file management
- Deleting files
- Copying files
- Moving files
- Listing files in a directory
- Creating folders
Textual analysis:
- Introduction to natural language processing
- Text functions (replace, strip, upper, lower, count, join, find, startswith, endswith)
- Introduction to regular expressions (match, search, split, findall, sub, groups)
- Tokenization (word_tokenize, sent_tokenize, regexp_tokenize)
- Text pre-processing (stemming, stop words, punctuation)
- Counting words
- Calculating disclosure tone
- Calculating the Fog Index
Advanced Python:
- Machine learning
- Intro to machine learning
- What is machine learning?
- Types of machine learning models: supervised vs. unsupervised
- Applications of machine learning in practice
- Applications of machine learning in academic research
- The basic machine learning process
- Logistic regression
- Introduce machine learning concepts for predicting binary dependent variables.
- Predict default using financial statement and stock price data
- True negatives, true positives, false negatives, false positives
- Accuracy vs. sensitivity vs. specificity
- statsmodels vs. scikit-learn in Python
- Exercise: predict employee attrition
- Linear Regression
- Introduce machine learning concepts for continuous dependent variables
- Predict CDS spreads using financial statement and stock price data
- Evaluating the prediction
- Exercise: predict future revenue growth
- Random forest classification
- Introduction to decision trees
- From decision trees to random forest (including a discussion of overfitting)
- RandomForestClassifier from scikit-learn
- Feature importance
- Exercise: predict employee attrition
- Random forest regression
- Difference between random forest classification and random forest regression
- RandomForestRegressor from scikit-learn
- Exercise: predict future revenue growth
- Neural network classification
- What are neural networks (e.g., input layer, hidden layers, output layer, etc.)
- MLPClassifier from scikit-learn
- Standardizing input variables
- Permutation feature importance
- Exercise: predict banking customer churn
- Neural network regression
- Difference between neural network classification and neural network regression
- MLPRegressor from scikit-learn
- Exercise: Predict housing prices
- Support vector classification
- The basics of support vector machines (SVM) (e.g., hyperplanes, data separation, margin, kernels, decision boundaries, cost parameter, support vectors)
- SVC from scikit-learn
- Standardizing input variables
- Exercise: predict banking customer churn
- Support vector regression
- Difference between support vector classification and support vector regression
- SVR from scikit-learn
- Cherkassky and Ma (2004) method for obtaining epsilon and C parameters
- Exercise: Predict housing prices
- Cross validation and Grid Search
- What is cross validation?
- Holdout cross validation
- K-fold cross validation
- Data independence in K-fold cross validation
- Rolling window cross validation for panel data
- Hyperparameter tuning with grid search
- Exercise: predict default using cross validation techniques
- Machine learning and textual analysis introduction
- Machine learning with a large number of independent variables (e.g., words or ngrams)
- CountVectorizer from sklearn
- Stop words
- Stemming using the Porter Stemmer algorithm
- Creating matrices for use in the machine learning models
- Feature importance
- Exercise: predict restaurant review ratings based on comments
- Machine learning and textual analysis – TF-IDF
- An introduction to term-frequent inverse document frequency (tf-idf)
- TfidfVectorizer from scikit-learn
- Exercise: predict restaurant review ratings based on comments using tf-idf
- Machine learning and textual analysis – Creating customized vectorizers
- Limitations with built-in vectorizers
- How to create your own customizable vectorizer
- Sparse data
- Running textual analysis machine learning models on documents stored in your file system
- Exercise: predict restaurant review ratings based on a customized vectorizer
- Latent Dirichlet Allocation
- Unsupervised machine learning model
- What is Latent Dirichlet Allocation (LDA) and how does it work?
- genism
- Exercise: Implement LDA on a set of earnings conference call transcripts
- Machine learning project
- Use the machine learning models taught in the course to predict daily abnormal stock returns using conference call transcripts
- ChatGPT
- Introduction to ChatGPT
- What is ChatGPT and why you should use it in coding
- An overview with examples of how to use ChatGPT to enhance your coding (e.g., to write, optimize, and debug code)
- Use ChatGPT to write code
- Step by step guide to writing prompts and interacting with responses to write code
- Example: We ask ChatGPT to write code to extract Apple’s net income from its 2022 10-K on the SEC’s EDGAR site.
- Exercise: Use ChatGPT to write code to extract the tone of words spoken by analysts on earnings conference calls.
- Use ChatGPT to optimize code
- Multiple examples of inefficient code with prompts to ChatGPT to optimize and remove inefficiencies.
- Exercise: Write code to answer basic coding questions. Then use ChatGPT to optimize any inefficiencies in your code
- Use ChatGPT to debug code
- Multiple examples of coding errors with prompts to ChatGPT to debug and remove these errors.
- Integrated Development Environments
- Introduction to Integrated Development Environments (IDEs)
- Python Idle
- Examples of IDEs for Python
- Reasons for using a more sophisticated IDE
- Visual Studio Code
- Overview of the Visual Studio Code environment
- Explorer Window
- Extensions
- Color Themes
- Command Palette
- Commands and features (e.g., intellisense, multiple cursors, clicking functions, warnings, errors, etc.)
- Debugging
- Jupyter Notebook Extension
- Markdown vs. code cells
- Debugging within cells
- Keyboard shortcuts and video tutorials
- PyCharm
- PyCharm Installation
- Overview of the PyCharm Environment
- Projects explorer
- Debugging
- Find Action
- Color Themes
- Commands and features (e.g., intellisense, multiple cursors, clicking functions, wrapping code, extract method, etc.)
- Local history to restore deleted code
STATA
Basics
- Understanding the STATA environment
- Reading files from SAS and Python
- Viewing a dataset
- Generating variables
- Replacing variables
- Using local variables
- Keeping and dropping variables
- Sorting data
- Using for loops
Model Output
- Running basic regression models
- Creating Excel output tables using outreg2
- Modifying Excel output table options
- Adding statistics to Excel output tables (Adj. R2, Pseudo R2)
- Appending additional columns
- Adding column headers
- Excluding variables from Excel output tables