Case Study – Flu Risk Forecasting

Project Design & Data Preparation

Research Questions


To effectively conduct the geospatial analysis and assess population-level risk, initial research questions were explored based on the understanding of vulnerable groups and infection prevention.

  • RQ 1 – If a state has a higher proportion of residents aged 65 and older, then it will experience a higher number of influenza-related deaths during the influenza season.
  • RQ 2 -If a state has a higher proportion of children under 5, then it will also experience a higher number of influenza-related deaths during the influenza season.
  • RQ 3 – If a state has lower influenza vaccination rates among children under 5, then it will experience a higher number of influenza-related deaths during the influenza season.

Data Limitations


  • Number of influenza-related deaths

    The number of deaths may be underreported because death certificates only list one cause of death, potentially omitting influenza as a contributing factor in cases involving comorbidities or delayed complications.
  • Additionally, the CDC suppresses death counts between 1 and 9 to protect privacy, which disproportionately affects smaller populations and younger age groups, such as children under five. This can obscure the true impact of influenza in these groups.
  • Population demographic data

    Population demographics are collected by decennial census and require estimates to account for annual changes between census dates. Inaccurate assumptions in these estimates could introduce errors, affecting the reliability of the data.

Data Integrity & Quality


  • Number of influenza-Related Deaths

    Extensive data cleaning was required due to suppression and missing values. Out of 66,096 total records, 54,013 (approximately 82%) were suppressed. This included all 11,016 records for children under five, and 5,508 records for individuals with the age group marked as “Not Stated.”

    The “Not Stated” records were removed from the dataset, as they could not be linked to any specific age group and were therefore not suitable for analysis.

    The remaining suppressed records were imputed using random integers between 1 and 9 to support broader trend identification. While this approach allowed for inclusion of the 0–4 age group in high-level analysis, it limits the precision of age-specific mortality estimates, particularly for smaller populations.
  • Population demographic data

    In this dataset 3,278 duplicate records were identified and removed to maintain dataset integrity and prevent overrepresentation in the analysis.

Research Hypothesis


Due to suppression of all influenza-related death records for children under five, Research Questions 2 and 3 could not be tested. As a result, the focus of the analysis shifted to a single, testable research hypothesis:

  • Research Hypothesis – If a state has a higher proportion of residents aged 65 and older, then it will experience a higher number of influenza-related deaths during the influenza season.

This hypothesis reflects the well-established vulnerability of older adults and serves as the foundation for the statistical analysis conducted in this project.

Data Modelling


To support the analysis, the datasets were integrated at the state-year level, linking mortality figures with population data per age group.

Age groups were standardised into 10-year intervals, with additional consolidated categories for 0–64 years and 65+ years being added, and new variables were derived for the death rates per age group.