Sdam071 ((new)) May 2026

Question 8 — Data Preparation and Feature Engineering (23 marks) a) You are given a mixed dataset (numerical, categorical, timestamps). Outline a concrete preprocessing pipeline suitable for modeling, including encoding, scaling, and handling time features. Provide brief justification for each step. (14 marks) b) Design two new features (name + formula or construction) that could improve model performance for a predictive task and explain why. (9 marks)

| Concept | Formula / Command | When to Use | |---------|-------------------|------------| | | mean(x) | Central tendency for symmetric data. | | Standard Deviation | sd(x) | Dispersion around the mean. | | t‑test | t.test(x, y) | Compare means of two groups (normally distributed). | | Linear Model | lm(y ~ x1 + x2, data = df) | Predict a continuous outcome. | | Residual Plot | plot(lm_model, which = 1) | Check linearity & homoscedasticity. | | AIC | AIC(lm_model) | Compare non‑nested models (lower = better). | | Cross‑validation | train(y ~ ., data = df, method = "lm", trControl = trainControl(method = "cv", number = 5)) (caret) | Estimate out‑of‑sample performance. | | Bootstrap CI | boot.ci(boot.out, type = "perc") | Non‑parametric confidence intervals. | | Effect Size (Cohen’s d) | cohen.d(x, y) (effsize) | Quantify magnitude of mean differences. | sdam071

Additionally, the code appears to be highly sensitive, requiring specific conditions to function properly. This has led some to speculate that sdam071 may be more than just a simple algorithm – it could be a key component in a much larger system. Question 8 — Data Preparation and Feature Engineering