Chapter 2 Introduction

2.1 Structural Estimation and Counterfactual Analysis

2.1.1 Example

  • Igami (2017) “Estimating the Innovator’s Dilemma: Structural Analysis of Creative Destruction in the Hard Disk Drive Industry, 1981-1998”.

  • Question:

    • Does “Innovator’s Dilemma” (Christensen, 1997) or the delay of innovation among incumbents exist?
    • Christensen argued that old winners tend to lag behind entrants even when introducing a new technology is not too difficult, with a case study of the HDD industry.
  • Apple’s smartphone vs. Nokia’s feature phones.

  • Amazon vs. Borders.

  • Kodak’s digital camera.

  • If it exists, what is the reason for that?

  • How do we empirically answer this question?

Figure 1 of Igam (2017)

Figure 2.1: Figure 1 of Igam (2017)

  • Hypotheses:

  • Identify potentially competing hypotheses to explain the phenomenon.

    1. Cannibalization: Because of cannibalization, the benefits of introducing a new product are smaller for incumbents than for entrants.
    2. Different costs: The incumbents may have higher costs for innovation due to the organizational inertia, but at the same time they may have some cost advantage due to accumulated R&D and better financial access.
    3. Preemption: The incumbents have additional incentive for innovation to preempt potential rivals.
    4. Institutional environment: The impacts of the three components differ across different institutional contexts such as the rules governing patents and market size.
  • Casual empiricists pick up their favorite factors to make up a story.

  • Serious empiricists should try to separate the contributions of each factor from data.

  • To do so, the author develops an economic model that explicitly incorporates the above mentioned factors, while keeping the model parameters flexible enough to let the data tell the sign and size of the effects of each factor on innovation.

  • Economic model:

  • The time is discrete with finite horizon \(t = 1, \cdots, T\).

  • In each year, there is a finite number of firms indexed by \(i\).

  • Each firm is in one of the technological states: \[\begin{equation} s_{it} \in \{\text{old only, both, new only, potential entrant}\}, \end{equation}\] where the first two states are for incumbents (stick to the old technology or start using the new technology) and the last two states are for actual and potential entrants (enter with the new technology or stay outside the market).

  • In each year:

    • Pre-innovation incumbent (\(s_{it} =\) old): exit or innovate by paying a sunk cost \(\kappa^{inc}\) (to be \(s_{i, t + 1} =\) both).
    • Post-innovation incumbent (\(s_{it} =\) both): exit or stay to be both.
    • Potential entrant (\(s_{it} =\) potential entrant): give up entry or enter with the new technology by paying a sunk cost \(\kappa^{net}\) (to be \(s_{i, t + 1} =\) new).
    • Actual entrant (\(s_{it} =\) new): exit or stay to be new.
  • Given the industry state \(s_t = \{s_{it}\}_i\), the product market competition opens and the profit of firm \(i\), \(\pi_t(s_{it}, s_{-it})\), is realized for each active firm.

  • As the product market competition closes:

    • Pre-innovation incumbents draw private cost shocks and make decisions: \(a_t^{pre}\).
    • Observing this, post-innovation incumbents draw private cost shocks and make decisions: \(a_t^{post}\).
    • Observing this, actual entrants draw private cost shocks and make decisions: \(a_t^{act}\).
    • Observing this, potential entrants draw private cost shocks and make decisions: \(a_t^{pot}\).
  • This is a dynamic game. The equilibrium is defined by the concept of Markov-perfect equilibrium (Maskin & Tirole, 1988b).

  • The representation of the competing theories in the model:

    • The existence of cannibalization is represented by the assumption that an incumbent maximizes the joint profits of old and new technology products.
    • The size of cannibalization is captured by the shape of profit function.
    • The difference in the cost of innovation is captured by the difference in the sunk costs of innovation.
    • The preemptive incentive for incumbents are embodied in the dynamic optimization problem for each incumbent.
  • Econometric model:

  • The author then turns the economic model into an econometric model.

  • This amounts to specify which part of the economic model is observed/known and which part is unobserved/unknown.

  • The author collects the data set of the HDD industry during 1977-99.

  • Based on the data, the author specify the identities of active firms and their products and the technologies embodied in the products in each year to code their state variables.

  • Moreover, by tracking the change in the state, the author code their action variables.

  • Thus, the state and action variables, \(s_t\) and \(a_t\). These are the observables.

  • The author does not observe:

    • Profit function \(\pi_t(\cdot)\).
    • Sunk cost of innovation for pre-innovation incumbents \(\kappa^{inc}\).
    • Sunk cost of entry for potential entrants \(\kappa^{net}\).
    • Private cost shocks.
  • These are the unobservables.

  • Among the unobservables, the profit function and sunk costs are the parameter of interets and the private cost shocks are nuissance parameters in the sense only the knowledge about the distribution of the latter is demanded.

  • Identification:

  • Can we infer the unobservables from the observables and the restrictions on the distribution of observable by the economic theory?

  • The profit function is identified from estimating the demand function for each firm’s product, and estimating the cost function for each firm from using their price setting behavior.

  • The sunk costs of innovation are identified from the conditional probability of innovation across various states. If the cost is low, the probability should be high.

  • Estimation:

    • The identification established that in principle we can uncover the parameters of interests from observables under the restrictions of economic theory.
  • Finally, we apply a statistical method to the econometric model and infer the parameters of interest.

  • Counterfactual analysis:

    • If we can uncover the parameters of interest, we can conduct comparative statics: study the change in the endogenous variables when the exogenous variables including the model parameters are set different. In the current framework, this exercise is often called the counterfactual analysis.
  • What if there was no cannibalization?:

    • An incumbents separately maximizes the profit from old technology and new technology instead of jointly maximizing the profits. Solve the model under this new assumption everything else being equal.
  • Free of cannibalization concerns, 8.95 incumbents start producing new HDDs in the first 10 years, compared with 6.30 in the baseline.

  • The cumulative numbers of innovators among incumbents and entrants differ only by 2.8 compared with 6.45 in the baseline.

  • Thus cannibalization can explain a significant part of the incumbent-entrant innovation gap.

  • What if there was no preemption?:

    • A potential entrant ignores the incumbents’ innovations upon making entry decisions.
      • Without the preemptive motives, only 6.02 incumbents would innovate in the first 10 7ears, compared with 6.30 in the baseline.
      • The cumulative incumbent-entrant innovation gap widen to 8.91 compared with 6.45 in the baseline.
  • The sunk cost of entry is smaller for incumbents than for entrants in the baseline.

  • Interpretations and policy/managerial implication:

  • Despite the cost advantage and the preemptive motives, the speed of innovation is slower among incumbents due to the strong cannibalization effect.

  • Incumbents that attempt to avoid the “innovator’s dilemma” should separate the decision makings between old and new sections inside the organization so that it can avoid the concern for cannibalization.

2.1.2 Recap

  • The structural approach in empirical industrial organization consists of the following components:
  1. Research question.
  2. Competing hypotheses.
  3. Economic model.
  4. Econometric model
  5. Identification.
  6. Data collection.
  7. Data cleaning.
  8. Estimation.
  9. Counterfactual analysis.
  10. Coding.
  11. Interpretations and policy/managerial implications.
  • The goal of this course is to be familiar with the standard methodology to complete this process.
  • The methodology covered in this class is mostly developed to analyze the standard framework to dynamic or oligopoly competition.
  • The policy implications are centered around competition policies.
  • But the basic idea can be extend to different class of situations such as auction, matching, voting, contract, marketing, and so on.
  • Note that the depth of the research question and the relevance of the policy/managerial implications are the most important part of the research.
  • Focusing on the methodology in this class is to minimize the time to allocate to less important issues and maximize the attention and time to the most valuable part in the future research.
  • Given a research question, what kind of data is necessary to answer the question?
  • Given data, what kind of research questions can you address? Which question can be credibly answered? Which question can be an over-stretch?
  • Given a research question and data, what is the best way to answer the question? What type of problem can you avoid using the method? What is the limitation of your approach? How will you defend the possible referee comments?
  • Given a result, what kinds of interpretation can you credibly derive? What kinds of interpretation can be contested by potential opponents? What kinds of contribution can you claim?
  • To address these issues is necessary to publish a paper and it is necessary to be familiar with the methodology to do so.

2.1.3 Historical Remark

  • The words reduced-form and structural-form date back to the literature of estimation of simultaneous equations in macroeconomics (Hsiao, 1983).
  • Let \(y\) be the vector of observed endogenous variables, \(x\) be the vector of observed exogenous variables, and \(\epsilon\) be the vector of unobserved exogenous variables.
  • The equilibrium condition for \(y\) on \(x\) and \(\epsilon\) is often written as: \[\begin{equation} Ay + Bx = \Sigma \epsilon. \tag{2.1} \end{equation}\]
  • These equations implicitly determine the vector of endogenous variables \(y\) .
  • If \(A\) is invertible, we can solve the equations for \(y\) to obtain: \[\begin{equation} y = - A^{-1} B x + A^{-1} \Sigma \epsilon. \tag{2.2} \end{equation}\]
  • These equations explicitly determine the vector of endogenous variables \(y\).
  • Equation (2.1) is the structural-form and (2.2) is the reduced-form.
  • If \(y\) and \(x\) are observed and \(x\) is of full column rank, then \(A^{-1}B\) and \(A^{-1} \Sigma A^{-1}\) will be estimated by regression for (2.2). But this does not mean that \(A, B\) and \(\Sigma\) are separately estimated.
  • This was the traditional identification problems.
  • Thus, reduced-form does not mean either of:
    • Regression analysis;
    • Statistical analysis free from economic assumptions.
  • Recent development in this line of literature of identification is found in Matzkin (2007).
  • In econometrics, the idea of imposing restrictions from economic theories seems to have been formalized by the work of Manski (1994) and Matzkin (1994).

2.2 Setting Up The Environment

  • Assume that R, RStudio and LaTex are all installed in the local computer.

2.2.1 RStudio Project

  • The assignments should be conducted inside a project folder for this course.
  • File > New Project...> New Directory > New Directory > R Package using RcppEigen.
  • Name the directory EmpiricalIO and place in your favorite location.
  • You can open this project from the upper right menu of RStudio or by double clicking the EmpiricalIO.Rproj file in the EmpiricalIO directory.
  • This navigates you to the root directory of the project.
  • In the root directory, make folders named:
    • assignment.
    • input.
    • output.
    • figuretable.
  • We will store R functions in R folder, C/C++ functions in src folder, and data in input folder, data generated from the code in output, and figures and tables in figurtable folder.
  • Open src/Makevars and erase the content. Then, write: PKG_CPPFLAGS = -w -std=c++11 -O3
  • Open src/ and erase the content. Then, write: PKG_CPPFLAGS = -w -std=c++11

2.2.2 Basic Programming in R

  • File > New File > R Script to open Untitled file.
  • Ctrl (Cmd) + S to save it with test.R in assignment folder.
  • In the console, type and push enter:
## [1] 2
##  [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
## [21] 120 121 122 123 124 125 126 127 128 129 130
  • This is the interactive way of using R functionalities.

  • In test.R, write:

  • Then, save the file and push Run.

  • Alternatively, place the mouse over the 1 + 1 line in test.R file.

  • Then, Ctrl (Cmd) + Enter to run the line.

  • In this way, we can write procedures in the file and send to the console to run.

  • There are functions to conduct basic calculations:

## [1] 3
## [1] 6
## [1] 3
## [1] 3
## [1] 8
  • We can define objects and assign values to them.
## [1] 1
## [1] 3
  • In addition to scalar object, we can define a vector by:
## [1]  2  3  4  5  6  7  8  9 10
##  [1]  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
## [1]  2  3  5  9 10
## [1] 1 3 5 7 9
  • seq is a function with initial value, end values, and the increment value.

  • By typing seq in the help, we can read the manual page of the function.

  • seq {base} means that this function is named seq and is contained in the library called base.

  • Some libraries are automatically called when the R is launched, but some are not.

  • Some libraries are even not installed.

  • We can install a library from a repository called CRAN.

  • To use the package, we have to load by:

  • Use qplot function in ggplot2 library to draw a scatter plot.

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

- Instead of loading a package by library, you can directly call it as:

  • We can write own functions.
## [1] 5
## [1] 6
## [1] 32
## [1] 41
  • We can set.seed to obtain the same realization of random variables.
## [1] 30
## [1] 30
  • When a variable used in a function is not given as its argument, the function calls the variable in the global environment:
## [1] 2
## [1] 3
  • However, you should NOT do this. All variables used in a function should be given as its arguments:
## [1] 5
## [1] 6
  • The best practice is to use findGlobals function in codetools to check global variables in a funciton:
## [1] "{"      "+"      "return" "y"
## [1] "{"      "+"      "return"
  • This function returns the list of global variables used in a function. If this returns a global variable other than the system global gariables, you should include it as the argument of the function.

  • You can write the functions in the files with executing codes.

  • But I recommend you to separate files for writing functions and executing codes.

  • File > New File > R Rcript and name it as functions.R and save to R folder.

  • Cut the function you wrote and paste it in functions.R.

  • There are two ways of calling a function in functions.R from test.R.

  • One way is to use source function.

  • When this line is read, the codes in the file are executed.

  • The other way is to bundle functions as a package and load it.

  • Choose Build > Clean and Rebuild.

  • This compiles files in src folder and bundle functions in R folder and build a package named ECON6120I.

  • Now, the functions in R folder and src folder can be used by loading the package by:

  • Best practice:

    1. Write functions in the scratch file.
    2. As the functions are tested, move them to R/functions.R.
    3. Clean and rebuild and load them as a package.

2.2.3 Reproducible Reports using Rmarkdown

  • Reporting in empirical studies involves:
  1. Writing texts;
  2. Writing formulas;
  3. Writing and implementing programs;
  4. Demonstrating the results with figures and tables.
  • Moreover, this has to be done in a reproducible manner: Whoever can reproduce the output from the scratch.

  • “Whoever” includes yourself in the future. Because the revision process of structural papers is usually lengthy, you often have to remember the content few weeks or few months later. It is inefficient if you cannot recall what you have done.

  • We use Rmarkdown to achieve this goal.

  • This assumes that you have LaTex installed.

  • Install package Rmarkdown:

  • File > New File > R Markdown... > HTML with title Test.

  • Save it in assignment folder with name test.Rmd.

  • From Knit tab, choose Knit to HTML.

  • This outputs the content to html file.

  • You can also choose Knit to PDF from Knit tab to obtain output in pdf file.

  • Reports should be knit to pdf or html to submit.

  • Refer to the help page for further information.

2.2.4 Coding rule

  • Use verb in the file name, section name, and function name
    • 8_entry_model to 8_estimate_entry_model
    • # travel frequency to adjecent mesh to # make travel frequency to adjecent mesh
  • Do not use . in the file, object, and variable names. Use _ instead.
    • _ works in C/C++ but . does not. Write object names robust to translation into C/C++.
  • Use Ctrl+Shift+R to make a section.
    • By doing so, the table of contents are automatically generated in RStudio.
    • When making a subsection, add one # and remove one -.
  • Put a space after ,, ), } and so on. Put spaces before and after =.
  • Break lines when there are more than one arguments in a function-like:
gpu_output <-
  foreach (
    i = 1:nrow(gpu),
    .combine = "rbind"
    ) %dopar% {
    model_i <- 
            gpu[i, ] %>% 
    release_i <- 
            gpu[i, ] %>% 
    time_i <- paste(release_i, current_date)
    trend_i <- 
        keyword = model_i,
        time = time_i
    trend_i <- trend_i$interest_over_time
  • Call a function in a library in dplyr::mutate format.
    • To avoid name space conflictions.
    • To make the dependence on library explicit.
    • To make it compatible with a multi-node parallelization.
    • Exceptions are magrittr for %>% and foreach for %do% and %dopar%, and packages for reporting such as ggplot2, modelsummary, etc.
  • Avoid using an upper case letter in the function name and variable names.
    • Exceptions are a single letter variable for constant and matrix/dataframe to make the notations the same with the model.
  • Pipe %>% should be used only for a sequence of operations comprising a single meaningful task.
    • For example, if you drop rows including NA and/or outliers and then construct some new variables, you should separate these two tasks, and should not connect with %>%.
  • When we join data frames, only use dplyr::left_join. Never use other joins such as dplyr::full_join and dplyr::inner_join.
    • You have a list of observations on the left data frame. We never change this list by a join operation.
    • If you need to join and drop rows with some NA, do this explicitly by dplyr::left_join followed by dplyr::filter.
    • No operation that mystifies changes the list of observations is allowed.
  • When obtaining multiple results, standardize the format of the result and hold them as elements of a list. Write a function applicable to the standard and happy to each element of the list by purrr::map.
    • For example, when running multiple regressions, prepare a list of formulas and pass it to purrrr::map to run multiple regressions and obtain the results in another list.
  • The file names of the same category should have the same format.
    • For example, if there is a figure for a long window and a short window, the file name should be figure_long.png and figure_short.png. They should not be like figure.png and figure_short.png. If you added figure_short.png later, make sure to rename figure.png to figure_long.png.
  • Do not ignore any warnings.
    • If you ignore “ignorable” warnings, you will also ignore dangerous warnings.
    • Warnings are often signals of unexpected behavior or a bug.
  • Do not access columns of dataframe by index. It is not robust to the change in column structure. They should be accessed by the column name. Access by an index is in principle only allowed for the elements of a list which does not depend on the order of elements.
  • Save and read files by saveRDS and readRDS using .rds extension. Do not use save and load, because the object names become unknown.

2.2.5 GitHub and GitHub Classroom

  • The assignments should be conducted through GitHub Classroom. Steps to set up Git environment and to submit assignments through Git are provided as follows.

  • Sign up in GitHub

  • In the GitHub Classroom, a Rstudio project folder has been created. What you need to do is to clone the project in the GitHub (the remote repository) to your local computer so that you can work on the assignments locally.

  • To do this, download Git (Windows) or Git (MacOS). This allows you to use Git Bash.

  • Then, set up your Git username and Git email address following the instructions: Set up your username and Set up your email address.

  • To connect to GitHub, you need to generate a ssh key , add the ssh key to the ssh-agent, and also add it to your GitHub account.This saves you from entering your username and password everytime you interact with GitHub repository. A detailed instruction can be found here.

  • Open the link and accept the invitation to an sample assignment in GitHub classroom. If you are asked to select your GitHub account, choose skip to the next step. The next page gives you a repository url.

  • Open Rstudio: File > New Project...> Version Control.

  • Copy the above-mentioned repository url into the first blank, name the directory Assignment and place the project in your favorite location. Now the project has been copied from GitHub.

  • You can open this project from the upper right menu of RStudio or by double clicking the Assignment.Rproj file in the Assignment directory.

  • Try Build > Clean and Rebuild to make sure the project can be compiled properly (or in the latest version of Rstudio Build > Install Package) .

  • the Git option in the upper right window navigates you between your local repository and the remote repository in GitHub. Before you make any changes, use Pull button to pull all new changes from the remote repository so that your local repository is up to date.

  • Use Push button in the Git option to upload your changes to the remote repository.

  • To submit the assignments, please follow the steps:

    1. Pull all new changes from the remote repository.
    2. Work on the assignment and save your assignment.
    3. If everything is ok, click Commit button, input some commit message, and commit your changes. This saves your changes locally and creates a subject to be submitted to the remote repository.
    4. Push your changes to the remote repository so that the instructor can view your assignment.
  • Some advice:

  1. More git command can be found here. They can be exerted in Git > More > Shell, or Rstudio terminal. You can also download Git Desktop with user interface which is easy to handle.
  2. Pull before you make new changes, otherwise there might be conflicts when you push the changes.
  3. Commit the changes only if you want to push the changes. If you just want to temporarily save the changes locally so that they will not be covered by the remote repository in case you pull, use git stash.
  4. If you are still not familiar with Git, make a copy of your changes before you try some new commands.


Christensen, C. M. (1997). The innovator’s dilemma : When new technologies cause great firms to fail. New York: HarperBusiness.
Hsiao, C. (1983). Chapter 4 Identification. Handbook of Econometrics, 1, 223–283.
Igami, M. (2017). Estimating the Innovator’s Dilemma: Structural Analysis of Creative Destruction in the Hard Disk Drive Industry, 1981. Journal of Political Economy, 125(3), 798–847.
Manski, C. F. (1994). Chapter 43 Analog estimation of econometric models. Handbook of Econometrics, 4, 2559–2582.
Maskin, E., & Tirole, J. (1988b). A Theory of Dynamic Oligopoly, I: Overview and Quantity Competition with Large Fixed Costs (No. 3) (Vol. 56, p. 549).
Matzkin, R. L. (1994). Chapter 42 Restrictions of economic theory in nonparametric methods. Handbook of Econometrics, 4, 2523–2558.
Matzkin, R. L. (2007). Chapter 73 Nonparametric identification. Handbook of Econometrics, 6, 5307–5368.