Intro to Predictive Analytics Using Python - University of Pennsylvania

 

Intro to Predictive Analytics Using Python

 - a course by the University of Pennsylvania 

My dairy about the journey of studies:





 

Intro to Predictive Analytics Using Python - 3 Modules


  • There are 3 modules in this course

    "Introduction to Predictive Analytics and Advanced Predictive Analytics Using Python" is specially designed to enhance your skills in building, refining, and implementing predictive models using Python. This course serves as a comprehensive introduction to predictive analytics, beginning with the fundamentals of linear and logistic regression. These models are the cornerstone of predictive analytics, enabling you to forecast future events by learning from historical data. We cover a bit of the theory behind these models, but in particular, their application in real-world scenarios​ and the process of evaluating their performance​ to ensure accuracy and reliability.​ As the course progresses, we delve deeper​ into the realm of machine learning​ with a focus on decision trees and random forests.​ These techniques represent a more advanced aspect​ of supervised learning, offering powerful tools​ for both classification and regression tasks.​ Through practical examples and hands-on exercises,​ you'll learn how to build these models,​ understand their intricacies, and apply them​ to complex datasets to identify patterns​ and make predictions. Additionally, we introduce the concepts​ of unsupervised learning and clustering, broadening your analytics toolkit,​ and providing you with the skills to tackle data without predefined labels or categories.​ By the end of this course, you'll not only have a thorough understanding​ of various predictive analytics techniques,​ but also be capable of applying these techniques to solve real-world problems,​ setting the stage for continued growth​ and exploration in the field of data analytics.


  •                



  •                



  •                

    ***** -------------------------------------------------------------------------------------------------------- *****
(Click the picture to enlarge)

 
    • MOOC3: 

  • 03-Oct-2025: Module 1: Intro to Predictive Analytics Using Python




    • About the instructor: 

    • Supervised Machine-Learning: 




    •  




  • Oct-10-2025: Module 1 - Lesson 2: Supervised Predictive Models
    • Typical Machine Learning Pipeline: 





  • Oct-13-2025: Module 1 - Week 1 - Linear Regression










    • Script: >> Recall the ML pipeline that ultimately produces a model f. In Linear Regression, we make an assumption that the model is a linear model. In other words, linear regression assumes a linear relation between some given inputs and the target output. For example, imagine we want to predict the age of a customer based on the number of orders they've made for a certain product. In this case, the number of orders they've made is the input variable and the age of the customer is the output variable. To perform linear regression, we gather data on various customers, including their orders and ages. We plot this data on a graph with the x-axis representing the number of orders and the y-axis representing the age. 
      Each data point is represented by a blue circle on the graph. The goal of linear regression is to find a line that best fits the data points. This line is represented by a red line on the graph. The position and slope of this line are determined through a mathematical process that minimizes the distance between the line and the data points. Once we have this line, we can use it to make predictions. For example, if we have the number of orders for a new customer, we can use the line to estimate their age. The line represents the learned model that allows us to predict the output variable age based on the input variable number of orders. 
      In linear regression, the model parameters refer to the values that determine the specific characteristics of the linear model. These parameters define the slope and intercept of the line that represents the relationship between the input variable and the output variable. Let's go back to the example of predicting customer ages based on their number of orders. In this case, the parameters of the linear regression model are as follows. The slope represents the change in the output variable age for a unit change in the input variable number of orders. The intercept is the point where the line intersects the y-axis. How do we find the most optimal model parameters? 
      If the training dataset has a significant trend, we can easily draw out a line that looks optimal rather quickly. But for noisy real-life datasets, finding the most optimal model is not as straightforward. During the training process of linear regression, the algorithm learns the optimal values for these parameters by minimizing the difference between the predicted values of the model and the actual values from the training data. We need a metric that measures how well the model performs. For machine learning problems, a loss function is usually defined to evaluate the model. Intuitively, the loss function would have small values if the predicted output is close to the desired output and large values otherwise. A commonly used loss function is the Mean Squared Error or MSE Loss Function. 
      This loss represents the average squared difference between the predicted output and the desired output across all samples in the training dataset. The optimal linear regression model minimizes the MSE loss. In practice, we can use the mean squared error function in the scikit-learn machine learning library to calculate the MSE. A simple linear regression involves one independent variable and one dependent variable. A multivariate linear regression extends to include multiple independent variables. Overfitting occurs when the learned model fits the training dataset very well but fails to generalize to new examples. The model on the right has no loss on the training data but does not actually capture the patterns of the dataset. 
      Overfitting is a common issue that can occur when using complex models. When a model becomes too complex, it may start to capture noise or random fluctuations in the training data, leading to poor generalization to new unseen data. To detect overfitting, a training test split protocol is typically used where we use holdout test data to estimate loss on new unseen data. Here's what the process looks like. Step one, randomly shuffle the data set and split the data set into a training set and testing set. Step two, train the model on the training set. Step three, evaluate the learn model on both the training set and the testing set to obtain a training loss and a test loss, also called generalization loss. 
      Other metrics can also be used. A good model should have both low training loss, which indicates that it fits well to training data, and low test loss, which indicates that it generalizes well to unseen data. 
    • The Mean Square Error (MSE) Loss Function: 


    • In practice, the function mean_squared_error can be found in scikt-learn


    • Two Types of Linear Regression: 


    • Overfitting & How to assess it: 



    • Overfitting & How to assess it:  

    • Types of Linear Regression
      Simple Linear Regression: Involves one independent variable and one dependent variable.
      Multivariate Linear Regression: Extends to include multiple independent variables.
    • Polynomial Linear Regression: Allows for non-linear relationships between the dependent and independent variables by incorporating polynomial terms, enabling the model to capture non-linear patterns in the data


    • Regularized Linear Regression: A form of linear regression that addresses multicollinearity (high correlation between independent variables) and helps mitigate overfitting by adding a penalty term to the loss function that controls the complexity of the model.
      Some common regularized linear regression and corresponding loss functions are:
      • Ridge Regression: 

      • Lasso Regression:   

      • Elastic Net Regression:  

      where w represents the model parameters and r, 𝜆 are hyperparameters that determine the extent of penalization.
    • My Github code demo Downloads (please fork it to run in your Codespace in Github) (if you want to run in Github, to prevent from overwriting the original codes).
    • How to solve? Run on earlier versions. Here are what some idea from AI chatbots: 







    • My solution / my point of view: the compatibility issues of the new version of Python 3.13.3: 











留言

這個網誌中的熱門文章

Get started with Python - Google

Intro to Data Science in Python

Data Analysis Using Python - University of Pennsylvania