FIT 2086 Assignment 3

password

icon

Name: Yu Chen

ID: 35301341

Question 1.1

Code output

We fitted a multiple linear regression in R with nine predictors, the model is highly significant (F=54.34, p<2.2e-16) with good fit (R²=0.657, adj. R²=0.645) and residual SD ≈ 1.688 km/L. Based on t-tests and coefficient signs, predictors possibly associated with fuel efficiency are: engine displacement (estimate −1.331, each +1L lowers km/L; p=3.22e-12), number of gears (−0.194, p=0.00019), lock-up torque converter (Y vs N −0.562, p=0.0046), aspiration (vs N: SC ≈ −0.799, p ≈ 0.050, TC ≈ −1.217, p ≈ 5.3e-08， TS ≈ −1.351, p ≈ 0.045), and drive system (FWD vs 4WD +1.535, p ≈ 2.41e-07); others show insufficient or only marginal evidence (e.g., Drive.SysP, Fuel.TypeGP). The three strongest predictors are engine displacement, drive system (overall, especially F vs 4), and aspiration (overall), because they have the strongest statistical evidence and the largest, directionally consistent effects: about −1.33 km/L per 1L more displacement, +1.54 km/L for FWD vs 4WD, and −1.22 km/L for turbocharged vs naturally aspirated.

Question 1.2

Using a Bonferroni correction, the per–test threshold is 0.05/17 = 0.00294. Compared with the unadjusted 0.05 level, fewer terms remain significant: only Eng.Displacement, No.Gears, Aspiration (TC vs N), and Drive.Sys (F vs 4) have p-values below 0.00294, previously significant Lockup.Torque.Converter (Y vs N) and Aspiration (SC/TS vs N) are no longer significant. Hence the assessment becomes more conservative, retaining only these four.

Question 1.3

Holding other variables constant, engine displacement shows a −1.331 effect: each additional 1 liter reduces the mean fuel efficiency by about 1.33 km/L, significant. Drive.SysF compares front-wheel drive to the 4-wheel-drive reference, its coefficient is +1.535, meaning FWD cars average about 1.54 km/L higher mean fuel efficiency than 4WD, all else equal, significant.

On average engine displacement affects the fuel efficiency of -1.33km/per extra litre whereas the (Drive.Syd) +1.54 km/ per extra litre. (based on estimate table)

Question 1.4

Code output

Using BIC stepwise selection (k = log n, direction = “both”) from the full model, the final model keeps Eng.Displacement, No.Gears, Aspiration, Lockup.Torque.Converter, and Drive.Sys, and removes Model.Year, No.Cylinders, Max.Ethanol, and Fuel.Type. Reference levels are Aspiration = N, Drive.Sys = 4, and Lockup = N. The pruned regression equation is

The model is highly significant (F=82.64, p<2.2e−16) with R^2=0.651 (0.643) coefficients for factor levels are differences relative to the reference categories.

Question 1.5

Code Output

a) Using the BIC model, the mean fuel efficiency for the car in row 33 is about 13.37 km/L, with a 95% confidence interval of [12.99, 13.75] km/L.

b) Because 13 km/L lies inside this mean CI, the model does not suggest the new car is better than your current car; the evidence is inconclusive.

Question 2.1

Code Output

Using the tree package and the course wrapper, I fitted a classification tree for HD and selected the size by 10-fold CV repeated 5,000 times, minimizing misclassification error; the best tree uses CP, SLOPE, OLDPEAK, CHOL, CA, THAL, AGE and has 8 terminal nodes.

Question 2.2

The tree predicts heart disease (HD = Y) if any of the following paths holds:

CP is Atypical / Non Anginal / Typical, SLOPE = Flat, and OLDPEAK ≥ 1.95.

CP is Atypical / Non Anginal / Typical, SLOPE = Flat, OLDPEAK < 1.95, and CHOL ≥ 260.5.

CP is Asymptomatic and CA ≥ 0.5 (at least one major vessel).

CP is Asymptomatic, CA < 0.5, THAL ≠ Normal (Fixed or Reversible), and AGE ≥ 56.

Question 2.3

According to the CV-selected tree, the lowest probability of having heart disease occurs at the leftmost N leaf with the path

CP in Atypical or Non-Anginal or Typical and slpoe in Down or Up.

At this leaf the model’s class proportion is roughly P(HD=Y) ≈ 0.087 (about 9%), which is the smallest among all leaves.

Question 2.4

Code Output

Using KIC stepwise (k=3, direction="both"), the final logistic model keeps CA, CHOL, CP, OLDPEAK, SEX, SLOPE, TRESTBPS. Compared with the CV tree, the overlap is CA, CHOL, CP, OLDPEAK, SLOPE; logistic-only variables are SEX, TRESTBPS， tree-only variables are AGE, THAL. The most important predictor in the logistic model is CP by the largest LR chi-square (LRT=35.75, p=8.45e-08).

Question 2.5

Code Output

Using step-wise selection with KIC (direction = “both”), this is the final logistic model above.

Reference categories: SEX = F (so SEXM=1 means male),

CP = Asymptomatic, SLOPE = Down. (All other factors not shown were not selected.)

Question 2.6

Code Output

In the KIC-selected logistic model, holding all other predictors fixed, CA has a positive effect. Specifically, each additional affected vessel (CA +1) increases the odds of heart disease by a factor of exp(β_CA). With our fitted model β_CA ≈ 1.0748, so the odds multiply by 2.93× per extra vessel, since CA ranges 0–3, going from 0 to 3 multiplies the odds by (2.93)^3 ≈ 25×.

Question 2.7

Code Output

On the 92-patient test set using my.pred.stats(), the CV-tree achieves Accuracy 0.783, Sensitivity 0.641, Specificity 0.887, AUC 0.821, Log-loss 87.37. The KIC step-wise logistic model achieves Accuracy 0.826, Sensitivity 0.795, Specificity 0.849, AUC 0.885, Log-loss 39.44.

Because logistic has a higher AUC (+0.064) and a much lower log-loss (−47.93) and both threshold-free measures and it also improves accuracy and sensitivity, it offers better generalization and probability calibration. The tree only wins on specificity (0.887 vs 0.849), i.e., slightly fewer false positives. If minimizing false positives were the priority one might prefer the tree, but under balanced criteria the logistic model is clearly better.

Question 2.8

Code Output

For test patient 10, using odds = p/(1−p), the CV‐selected tree gives p = 0.929825 → odds = 13.25, while the KIC step-wise logistic model gives p = 0.999499 → odds = 1993.98. Thus, the logistic model assigns about 150.5× higher odds than the tree. Both models indicate a very high risk; the logistic model is more extreme because its predicted probability is closer to 1.

Question 2.9

Code Output

Using the KIC-selected logistic model and 5,000 BCa bootstrap replications, the 95% CIs for the odds of heart disease are [5.21, 129.05] for patient #65 and [0.0299, 0.2992] for patient #66 (odds = p/(1−p)). The odds ratio OR(66/65) has a BCa 95% CI of [0.0004, 0.0356], which is well below 1. Therefore, there is strong evidence of a real difference in population odds: patient #66’s odds are only about 0.04%–3.56% of patient #65’s, i.e., #65 is high-risk while #66 is low-risk.

Question 3.1

Code Output

Using kknn with kernel="optimal", I trained on the noisy spectrum ms.measured.2025 (intensity ~ MZ) and predicted intensities at the MZ locations in ms.truth.2025. For each k = 1,\dots,25 I computed the RMSE between the predictions and the true intensities and plotted RMSE versus k. The curve is U-shaped and reaches its minimum at k = 7 with **RMSE = 1.492. This k balances variance at small k and bias at large k, so we will use k=7 as the recommended smoothing window.

Question 3.2

Code Output

I plot four panels for k = 2, 6, 12, 25. Grey dots show the noisy training measurements from ms.measured.2025. The blue solid line is the true spectrum from ms.truth.2025, and the red dashed line is the k-NN estimate predicted at the truth MZ locations using kernel="optimal". As k increases, noise is reduced but peaks are progressively flattened: k=2 keeps sharp peaks with more jitter, k=6 is well balanced, k=12 starts to underestimate peak heights, but k=25 is clearly over-smoothed.

Question 3.3

The RMSE–k curve is U-shaped: RMSE decreases from small k and reaches its minimum at k = 7 with RMSE = 1.492, then increases steadily as bias dominates. Visually, increasing k makes the estimate smoother: at k=2 the curve follows noise but preserves sharp peaks (high variance), at k=6–7 it balances smoothness and peak preservation, at k=12 peaks begin to be flattened and widened, at k=25 the spectrum is over-smoothed and loses peak height. Therefore, k=7 gives the best bias–variance trade-off for this dataset.

Question 3.4

Code Output

Using the built-in cross-validation of kknn with kernel="optimal" on the training data ms.measured.2025, the method selects k = 7. This matches the oracle choice k = 7 found in Q3.1 from the true-RMSE curve (minimum RMSE = 1.492).

Question 3.5

Code output

Using the CV-selected k (=7) with kernel="optimal", I fitted k-NN on ms.measured.2025 and predicted the same training data. The residual standard deviation gives an in-sample estimate of the measurement noise SD:

= 2.518.

Question 3.6

Code output

From the Q3.2 panels (k = 2, 6, 12, 25) and the overlay above with the CV choice (k ≈ 7), no single k perfectly achieves both goals at once.

k = 2: peaks are kept sharp, but the baseline is noisy.

k = 12–25: the baseline is very smooth, but peaks are flattened/widened (heights underestimated).

k = 6–7: gives the best compromise background noise is clearly reduced and the main peaks are close to the truth, though the second peak is slightly attenuated due to averaging.

Why k-NN cannot fully satisfy both: a single global k is a fixed bandwidth. Flat background prefers larger bandwidth to denoise, while sharp peaks require smaller bandwidth to preserve height. Distance weighting helps a bit, but k-NN still averages across the peak and lowers it. Hence we choose k ≈ 6–7 as the practical balance, but it does not exactly recover peak heights.

Question 3.7

Code output

Using the smoothed signal obtained with the CV-selected k (from Q3.4) and kernel="optimal", I predict on the ms.truth$MZ grid and take the global maximum of the smoothed curve. The peak MZ is ≈ 7963.3, with maximum estimated intensity 95.48.

Question 3.8

Code output

At the peak location mz_0 found in Q3.7, I bootstrapped the k-NN intensity with R = 5000 resamples and reported BCa 95% CIs:

k = 7 (CV): point = 95.48, 95% CI [89.92, 98.97], width 9.04

k = 3: point = 97.97, 95% CI [92.62, 99.23], width 6.61

k = 20: point = 83.03, 95% CI [70.50, 92.98], width 22.48

Interpretation: k = 3 and k = 7 give tighter intervals around the peak, while k = 20 is over smoothed—its point estimate is much lower (peak flattened) and the bootstrap distribution is skewed, which makes the BCa interval the widest. Bootstrap CIs mainly reflect variance and do not correct bias; thus a large k can yield a narrow CI in some settings, but at a sharp peak the neighborhood spans both sides, causing skew and a wider BCa interval.