Intro to Predictive Analytics Using Python

- a course by the University of Pennsylvania

My dairy about the journey of studies:

Instructor: Brandon Krakowsky
Coursera Course - Intro to Predictive Analytics Using Python

Intro to Predictive Analytics Using Python - 3 Modules

There are 3 modules in this course
"Introduction to Predictive Analytics and Advanced Predictive Analytics Using Python" is specially designed to enhance your skills in building, refining, and implementing predictive models using Python. This course serves as a comprehensive introduction to predictive analytics, beginning with the fundamentals of linear and logistic regression. These models are the cornerstone of predictive analytics, enabling you to forecast future events by learning from historical data. We cover a bit of the theory behind these models, but in particular, their application in real-world scenarios and the process of evaluating their performance to ensure accuracy and reliability. As the course progresses, we delve deeper into the realm of machine learning with a focus on decision trees and random forests. These techniques represent a more advanced aspect of supervised learning, offering powerful tools for both classification and regression tasks. Through practical examples and hands-on exercises, you'll learn how to build these models, understand their intricacies, and apply them to complex datasets to identify patterns and make predictions. Additionally, we introduce the concepts of unsupervised learning and clustering, broadening your analytics toolkit, and providing you with the skills to tackle data without predefined labels or categories. By the end of this course, you'll not only have a thorough understanding of various predictive analytics techniques, but also be capable of applying these techniques to solve real-world problems, setting the stage for continued growth and exploration in the field of data analytics.
***** -------------------------------------------------------------------------------------------------------- *****

Module 1: Introduction to Predictive Analytics and Regression

(Click the picture to enlarge)

03-Oct-2025: Module 1: Intro to Predictive Analytics & Regression - How to use Data - Specialization Intro

MOCC1:
MOOC2:

MOOC3:

03-Oct-2025: Module 1: Intro to Predictive Analytics Using Python

About the instructor:
Supervised Machine-Learning:

Oct-10-2025: Module 1 - Lesson 2: Supervised Predictive Models

Typical Machine Learning Pipeline:

Oct-13-2025: Module 1 - Week 1 - Linear Regression

Script: >> Recall the ML pipeline that ultimately produces a model f. In Linear Regression, we make an assumption that the model is a linear model. In other words, linear regression assumes a linear relation between some given inputs and the target output. For example, imagine we want to predict the age of a customer based on the number of orders they've made for a certain product. In this case, the number of orders they've made is the input variable and the age of the customer is the output variable. To perform linear regression, we gather data on various customers, including their orders and ages. We plot this data on a graph with the x-axis representing the number of orders and the y-axis representing the age.
Each data point is represented by a blue circle on the graph. The goal of linear regression is to find a line that best fits the data points. This line is represented by a red line on the graph. The position and slope of this line are determined through a mathematical process that minimizes the distance between the line and the data points. Once we have this line, we can use it to make predictions. For example, if we have the number of orders for a new customer, we can use the line to estimate their age. The line represents the learned model that allows us to predict the output variable age based on the input variable number of orders.
In linear regression, the model parameters refer to the values that determine the specific characteristics of the linear model. These parameters define the slope and intercept of the line that represents the relationship between the input variable and the output variable. Let's go back to the example of predicting customer ages based on their number of orders. In this case, the parameters of the linear regression model are as follows. The slope represents the change in the output variable age for a unit change in the input variable number of orders. The intercept is the point where the line intersects the y-axis. How do we find the most optimal model parameters?
If the training dataset has a significant trend, we can easily draw out a line that looks optimal rather quickly. But for noisy real-life datasets, finding the most optimal model is not as straightforward. During the training process of linear regression, the algorithm learns the optimal values for these parameters by minimizing the difference between the predicted values of the model and the actual values from the training data. We need a metric that measures how well the model performs. For machine learning problems, a loss function is usually defined to evaluate the model. Intuitively, the loss function would have small values if the predicted output is close to the desired output and large values otherwise. A commonly used loss function is the Mean Squared Error or MSE Loss Function.
This loss represents the average squared difference between the predicted output and the desired output across all samples in the training dataset. The optimal linear regression model minimizes the MSE loss. In practice, we can use the mean squared error function in the scikit-learn machine learning library to calculate the MSE. A simple linear regression involves one independent variable and one dependent variable. A multivariate linear regression extends to include multiple independent variables. Overfitting occurs when the learned model fits the training dataset very well but fails to generalize to new examples. The model on the right has no loss on the training data but does not actually capture the patterns of the dataset.
Overfitting is a common issue that can occur when using complex models. When a model becomes too complex, it may start to capture noise or random fluctuations in the training data, leading to poor generalization to new unseen data. To detect overfitting, a training test split protocol is typically used where we use holdout test data to estimate loss on new unseen data. Here's what the process looks like. Step one, randomly shuffle the data set and split the data set into a training set and testing set. Step two, train the model on the training set. Step three, evaluate the learn model on both the training set and the testing set to obtain a training loss and a test loss, also called generalization loss.
Other metrics can also be used. A good model should have both low training loss, which indicates that it fits well to training data, and low test loss, which indicates that it generalizes well to unseen data.
The Mean Square Error (MSE) Loss Function:
In practice, the function mean_squared_error can be found in scikt-learn:
Two Types of Linear Regression:
Overfitting & How to assess it:
Overfitting & How to assess it:
Types of Linear Regression
• Simple Linear Regression: Involves one independent variable and one dependent variable.
• Multivariate Linear Regression: Extends to include multiple independent variables.
Polynomial Linear Regression: Allows for non-linear relationships between the dependent and independent variables by incorporating polynomial terms, enabling the model to capture non-linear patterns in the data
Regularized Linear Regression: A form of linear regression that addresses multicollinearity (high correlation between independent variables) and helps mitigate overfitting by adding a penalty term to the loss function that controls the complexity of the model.
Some common regularized linear regression and corresponding loss functions are:
• Ridge Regression:

• Lasso Regression:

• Elastic Net Regression:

where w represents the model parameters and r, 𝜆 are hyperparameters that determine the extent of penalization.
Oct-16-2025: 💻 Coding Demo: Loading the Data and Exploring the Data 💻
My Github code demo Downloads (please fork it to run in your Codespace in Github) (if you want to run in Github, to prevent from overwriting the original codes).
How to solve? Run on earlier versions. Here are what some idea from AI chatbots:
My solution / my point of view: the compatibility issues of the new version of Python 3.13.3:

Module 2 - Decision Trees and Introduction to Advanced Predictive Analytics and Random Forests
Module 3:- Introduction to unsupervised learning and clustering
03-Oct-2025: Module 1: Intro to Predictive Analytics & Regression - How to use Data - Specialization Intro

MOOC1:
MOOC2:
MOOC3:
=================================================================
Others:

14-Oct-2025: YouTube 木子AI研究所: 哪個AI做PPT最強？結果震驚了我！Skywork/Manus/Kimi/Gamma PPT 能力測評

Prompt: 我是一名高中語文老師,我需要教學《赤壁賦》, 現在需要製作課程ppt文件,附件為我的教案,請你根據這些信息,幫我生成一套ppt <測試模式說明 >
教案 also done by AI, before the test.
Skywork: 全球唯一能做 Deep Research (深度研究) Slides的AI ppt 工具
Source of Info (shown on the right):
Other Special Feature: Upload your own PPT template:
Not Free - Price Table: Basic USD 19.9
Manus: (04:06):
Kimi: (05:09): Totally free.:
Gamma: (06:20) - Poor and not free. USD 20 per month:
Conclusion for above four AI agents: