Lecture 2: How Models Learn
Read the lecture transcript below to understand the mechanics of model training.
Lecture Transcript
Welcome back to our Introduction to Machine Learning series. In the previous lecture, we defined machine learning and explored its three main types. Today, we are going to look under the hood and understand how models actually learn from data. We will cover training data, model architecture, loss functions, gradient descent, and backpropagation.
Every machine learning model begins with training data. This is the dataset the model will learn from. The quality and quantity of training data are arguably the most important factors in determining how well a model performs. Training data is typically split into three subsets: a training set used to teach the model, a validation set used to tune the model's settings during development, and a test set used to evaluate the model's final performance on data it has never seen before. A common split ratio is 70% training, 15% validation, and 15% test data.
Model architecture refers to the structure and design of the learning system. In neural networks, this means the number of layers, the number of neurons in each layer, and how they are connected. A simple neural network consists of an input layer that receives the data, one or more hidden layers that process the data, and an output layer that produces the prediction. Each connection between neurons has an associated weight, and each neuron has a bias term. These weights and biases are the parameters that the model learns during training.
The loss function, also called the cost function or objective function, measures how far the model's predictions are from the actual correct values. For regression problems, a common loss function is Mean Squared Error, which calculates the average of the squared differences between predicted and actual values. For classification problems, Cross-Entropy Loss is widely used. The goal of training is to minimize the loss function, which means making the model's predictions as close to the true values as possible.
Gradient descent is the optimization algorithm used to minimize the loss function. It works by computing the gradient, which is the direction and magnitude of the steepest increase in the loss. The algorithm then takes a step in the opposite direction to reduce the loss. The size of each step is controlled by a parameter called the learning rate. If the learning rate is too large, the model may overshoot the minimum. If it is too small, training will be extremely slow. Stochastic gradient descent, or SGD, is a variant that computes the gradient using small random batches of data rather than the entire dataset, making training faster and more practical for large datasets.
Backpropagation is the key algorithm that makes training deep neural networks possible. It adjusts weights by propagating error gradients backward through the network, from the output layer back to the input layer. Using the chain rule of calculus, backpropagation efficiently computes how much each weight contributed to the overall error, and then updates each weight proportionally. Without backpropagation, training networks with many layers would be computationally infeasible.
Finally, we must address overfitting, which is one of the most common challenges in machine learning. Overfitting occurs when a model learns the training data too well, including its noise and random fluctuations, and fails to generalize to new, unseen data. Signs of overfitting include very high accuracy on training data but poor accuracy on validation or test data. Common techniques to prevent overfitting include regularization (which penalizes overly complex models), dropout (which randomly deactivates neurons during training), and early stopping (which halts training when validation performance stops improving).
Data is split into three subsets: training set (to teach the model), validation set (to tune settings), and test set (to evaluate final performance). A common split is 70/15/15.
The loss function measures how far predictions are from actual values. Mean Squared Error is used for regression, and Cross-Entropy Loss is used for classification. Training aims to minimize the loss.
Backpropagation adjusts weights by propagating error gradients backward through the network from the output layer to the input layer. It uses the chain rule of calculus to compute how much each weight contributed to the error.
Overfitting occurs when a model learns training data too well, including noise, and fails to generalize to new data. It is combated with regularization, dropout, and early stopping.