
Project Design & Data Preparation
Data Limitations
- The Instacart Online Grocery Shopping Dataset
The dataset is outdated (2017) and may not provide relevant insights into current market conditions.
- Supplementary Customer dataset
The dataset includes fictional data and does not reflect real market trends.
Data Integrity and Quality
- The Instacart Online Grocery Shopping Dataset
Product data contained four records with duplicate product names linked to different product IDs, and sixteen records with missing product names. Since these accounted for less than 0.01% of the population, they were removed. The dataset also had five duplicate records that were removed, and two extreme outliers were identified in the price variable. These outliers were adjusted based on the prices of comparable products.
- Supplementary Customer dataset
The customer dataset contained personally identifiable information (PII), which could be used to identify individuals. These were removed to safeguard customer privacy and ensure compliance with data protection standards as it did not impact the analysis.
Data Modelling
During the data merging process, 0.095% of the data did not merge successfully, likely due to product records removed during data cleaning. The loss is minimal and unlikely to have a significant impact on the analysis.
Several variables were derived through statistical analysis of customer order behaviour, in conjunction with customer segmentation based on demographic information.
A detailed breakdown of the column derivations and aggregations is available in the GitHub repository.
