US Births 2018: Linear Regression- R2 0.378

4 min readAug 3, 2020

Introduction

I’ll be doing a short analysis on Baby Weight using a dataset cleaned by Amol Deshmukh: https://www.kaggle.com/des137/us-births-2018. With Linear Regression, I can find a few variables that may explain a positive or negative change in birth weight.

A low birth weight is a term used to describe babies who are born weighing 5 pounds or less. Under weight babies can be healthy, however, low-weight newborns can also have some serious problems. For this sake, my goal is to find the variables that can best explain the positive and negative changes of birth weight.

Outline:
1. Realize the Difference
2. Interpret The Difference
3. Address the Difference

Realize the Difference

The dataset contains around 3.8 million rows and 55 columns. Here is a distribution plot of Baby Weight.

Plotting 2 distributions of baby weights overlaying each other. The blue represents the mothers who smoked daily before inception and the orange are those who have not. Although they seem to be closely aligned, they are statistically different.

Plotting distributions via box plots separated by total months the mother received checkups. The 1 on the x axis represents one month of checkups, 2 as 2 months.. and on. Although these groups seem similar, they are statistically different.

Plotting a distributions via box plots separated by the “length of pregnancy”(in months). Please note the possible errors of having gestation periods of 12 months. The “pregnancy length” was feature engineered by subtracting the Month/Year of last normal menses from the Month/Year of the baby born.

A linear line can be drawn on the distribution plots with gestation periods. It seems, the longer the longer the gestation period, the heavier the baby gets. But what happens if I make it a little bit more complex by adding additional information.

The blue boxes represent the cases where mother had to undergo C-section and the blue boxes are those that did not. It’s a little harder to draw a best fit line to fully represent this. In order to compute a best fit line, I used OLS.

I created my first model with just a few variables; mother’s weight, mother’s height, and gender of baby. I continued to add select features that improved my R2 score. The R2 is a metric that represents how much the variance of the baby weight is being explained by the features. In the end, my model had 71 features with an R2 score of 0.378.

Interpret the Difference

Some interesting coefficients:

smoked_None: 31.81
If the mother had never smoked, the model increased the expected weight by 31 grams. It’s very possible that the mothers who were smokers before could have had other health conditions

PRECARE: 429.28(+429.28/standard deviation or +429.28/1.5 months)
For every 1.5 months of checkups, the model increased the expected weight by 429 grams. Pre-natal care is pretty expensive. Maybe, it has something to do with class or income background.

RDMETH_REC_3: -1389.83
If the mother received Cesearen Section, the model subtracted 1400 grams. Maybe the mothers that required C-section did so because of other baby complications.

Address the Difference

In a simple world, everything would be normal and even. But unfortunately, many things aren’t. Going back to the findings of Prenatal care, why does the number of monthly checkups determine baby weights? Is prenatal care easily available for all classes or is there some economic hurdle that is creating this difference? This study isn’t supposed to create predictions, but it’s to address findings so that we can eventually break predictors.

Links:
Many thanks to Amol Deshmuhk:
https://www.kaggle.com/des137/us-births-2018
My kernel:
https://www.kaggle.com/albertum/us-births-2018-linear-regression-r2-0-378

US Births 2018: Linear Regression- R2 0.378

Introduction

Realize the Difference

Interpret the Difference

Address the Difference

Written by Albert Um