Photo by Alejandro Escamilla on Unsplash

For this blog, I will import a CSV file to a MySQL server(using MAMP) to create a practice platform for SQL statements. More specifically, I will import the titanic dataset(train.csv), which I can find here (https://www.kaggle.com/c/titanic). This tutorial will be broken down into three steps:

  1. Installing MAMP
    Installing and editing configuration parameters
  2. Creating Database & Table
    Creating Database and a table of titanic meta-data
  3. Load Data Infile
    Populating the table with the data

I can download the MAMP software from here (https://www.mamp.info/en/downloads/). After running the installation, the setup might allow you to install MAMP Pro; however, I opted not to…


Photo by Harshal S. Hirve on Unsplash

Pearson’s chi-squared test for independence is used to test whether there is an association between categorical variables by seeing if there is a statistical difference between the expected counts against the observed. The test uses the aggregated counts of the categorical variables that summarize the data into a table called a contingency table. Like other hypothesis testings, the chi-squared test proves by disproof by first assuming the expected counts are identical to observed(actual counts). …


Agile is an approach to project management that aims always to have a working product while continuously improving in short increments. Instead of delivering a product in the end, as in the case for Waterfall, Agile looks to provide a minimum viable product(MVP) and improve on it iteratively based on customer feedback. Agile is not a management methodology but more of principles and philosophies behind completing projects for customers. For this blog, I will discuss the values of Agile (from Agile Manifesto) as well as a commonly used agile framework, Scrum.

The Agile Manifesto was written in 2001 by tech…


On March 21, 2021, a massive container ship, Ever Given, was found stuck in the Suez Canal. The Suez Canal is an important trade route as it connects a water path between Europe and Asia without going around Africa. The ship blocked the pathway for six days delaying hundreds of other vessels from crossing.

The Suez Canal is a famous bottleneck due to its narrow waterways and high demand. Ships that want to cross are often required to wait in a queue before entering. I feared for supply chain managers after hearing the news of a complete blockage of the…


Photo by Tingey Injury Law Firm on Unsplash

For this blog, I will demonstrate three techniques to handle class imbalance using NYS PUMS(Public Use Microdata Sample) Census data. (You can find the dataset here.) Training classification models with imbalanced classes can lead to the model biasedly predicting the majority class.

  • Undersampling
  • Oversampling
  • SMOTE-NC

A pseudo-objective is to classify New York residents whether they have health insurance coverage. It’s a binary classification problem using the 2019 NYS PUMS Census data. I thinned out the features to only have seven independent variables and one dependent variable, ‘HICOV’ — Health Insurance Coverage.


Photo by Mel Poole on Unsplash

For this blog, I will run a hypothesis test if the population count around a well affects its functionality. I will be using the dataset from the Tanzania Water Pump Challenge hosted by Data-Driven and the WorldPop population estimations to feature engineer population estimations in a 1km, 5km, and 10km radius around the well.

  1. Pump it Up: Data Mining the Water Table
    - Introduction to challenge and dataset
  2. WorldPop Estimation
    - Peoples per pixel(‘ppp’; raster image of counts)
  3. Population analysis
    - Kruskal Wallis test for medians

I will use the dataset from the Pump it Up Water Challenge hosted by…


Photo by Paolo Chiabrando on Unsplash

Regularization in gradient boosted regression trees are applied to the leaf values and not the feature coefficients like in lasso/ridge regression. For this blog, I will break down the explanation into three steps:

  1. Lasso & Ridge Regression
    - A brief re-cap of lasso and ridge regression
  2. Gradient Boosted Regression Trees
    - Leaf values in boosted regression trees
  3. L1 & L2 in XGBoost
    - Adding penalties for residual leaves

Please recall, lasso and ridge regression applies an additional penalty term to the loss function. (A standard loss function for regression is the squared error, and I’ll be using this throughout the…


I recently came across an interesting problem that required some understanding of Absorbing Markov Chains. The objective to calculate the percentages(in the long run) of ending states given an initial state. The input is a frequency table where each state has counts of transitions based on its index. To elaborate on the problem, I have created a visual explanation of the frequency table.


Photo by Delaney Archer on Unsplash

Github is an online code hosting platform that uses the version control functionality of Git. Version control is the practice of managing changes on a project over time. Git, a version control software, is a great resource to back up your work and organize experiments without changing the main codebase. A local version control system stores the changes solely on your computer, and a remote version control stores the changes on a distributed version control system(i.e., Github).


https://www.touropia.com/regions-in-australia-map/

Between July 2019 through March 2020, Australia experienced one of its worst wildfire seasons. To gain attention to such a problem, the Call for Code: Australian Wildfires started in November 2020. The competition‘s objective is to forecast daily wildfires by region for February 2021. Contest information provided here: https://community.ibm.com/community/user/datascience/blogs/susan-malaika/2020/11/10/call-for-code-spot-challenge-for-wildfires

The competition provides a stack of CSVs for the competitors; wildfires, vegetation, weather. The hosts compiled these datasets by aggregating petabytes of raster and vector data. It’s essential to try to understand how the hosts compressed these datasets before using them for modeling.

  1. Historical Wildfires(dt: 2005–2021Jan.) The dataset contains daily statistics…

Albert Um

Hello! My name is Albert Um.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store