Intro to Predictive Analytics Using Python - University of Pennsylvania

 

Intro to Predictive Analytics Using Python

 - a course by the University of Pennsylvania 

My dairy about the journey of studies:





 

Intro to Predictive Analytics Using Python - 3 Modules


  • There are 3 modules in this course

    "Introduction to Predictive Analytics and Advanced Predictive Analytics Using Python" is specially designed to enhance your skills in building, refining, and implementing predictive models using Python. This course serves as a comprehensive introduction to predictive analytics, beginning with the fundamentals of linear and logistic regression. These models are the cornerstone of predictive analytics, enabling you to forecast future events by learning from historical data. We cover a bit of the theory behind these models, but in particular, their application in real-world scenarios​ and the process of evaluating their performance​ to ensure accuracy and reliability.​ As the course progresses, we delve deeper​ into the realm of machine learning​ with a focus on decision trees and random forests.​ These techniques represent a more advanced aspect​ of supervised learning, offering powerful tools​ for both classification and regression tasks.​ Through practical examples and hands-on exercises,​ you'll learn how to build these models,​ understand their intricacies, and apply them​ to complex datasets to identify patterns​ and make predictions. Additionally, we introduce the concepts​ of unsupervised learning and clustering, broadening your analytics toolkit,​ and providing you with the skills to tackle data without predefined labels or categories.​ By the end of this course, you'll not only have a thorough understanding​ of various predictive analytics techniques,​ but also be capable of applying these techniques to solve real-world problems,​ setting the stage for continued growth​ and exploration in the field of data analytics.


  •                



  •                



  •                

    ***** -------------------------------------------------------------------------------------------------------- *****
(Click the picture to enlarge)

 
    • MOOC3: 

  • 03-Oct-2025: Module 1: Intro to Predictive Analytics Using Python




    • About the instructor: 

    • Supervised Machine-Learning: 




    •  




  • Oct-10-2025: Module 1 - Lesson 2: Supervised Predictive Models
    • Typical Machine Learning Pipeline: 





  • Oct-13-2025: Module 1 - Week 1 - Linear Regression










    • Script: >> Recall the ML pipeline that ultimately produces a model f. In Linear Regression, we make an assumption that the model is a linear model. In other words, linear regression assumes a linear relation between some given inputs and the target output. For example, imagine we want to predict the age of a customer based on the number of orders they've made for a certain product. In this case, the number of orders they've made is the input variable and the age of the customer is the output variable. To perform linear regression, we gather data on various customers, including their orders and ages. We plot this data on a graph with the x-axis representing the number of orders and the y-axis representing the age. 
      Each data point is represented by a blue circle on the graph. The goal of linear regression is to find a line that best fits the data points. This line is represented by a red line on the graph. The position and slope of this line are determined through a mathematical process that minimizes the distance between the line and the data points. Once we have this line, we can use it to make predictions. For example, if we have the number of orders for a new customer, we can use the line to estimate their age. The line represents the learned model that allows us to predict the output variable age based on the input variable number of orders. 
      In linear regression, the model parameters refer to the values that determine the specific characteristics of the linear model. These parameters define the slope and intercept of the line that represents the relationship between the input variable and the output variable. Let's go back to the example of predicting customer ages based on their number of orders. In this case, the parameters of the linear regression model are as follows. The slope represents the change in the output variable age for a unit change in the input variable number of orders. The intercept is the point where the line intersects the y-axis. How do we find the most optimal model parameters? 
      If the training dataset has a significant trend, we can easily draw out a line that looks optimal rather quickly. But for noisy real-life datasets, finding the most optimal model is not as straightforward. During the training process of linear regression, the algorithm learns the optimal values for these parameters by minimizing the difference between the predicted values of the model and the actual values from the training data. We need a metric that measures how well the model performs. For machine learning problems, a loss function is usually defined to evaluate the model. Intuitively, the loss function would have small values if the predicted output is close to the desired output and large values otherwise. A commonly used loss function is the Mean Squared Error or MSE Loss Function. 
      This loss represents the average squared difference between the predicted output and the desired output across all samples in the training dataset. The optimal linear regression model minimizes the MSE loss. In practice, we can use the mean squared error function in the scikit-learn machine learning library to calculate the MSE. A simple linear regression involves one independent variable and one dependent variable. A multivariate linear regression extends to include multiple independent variables. Overfitting occurs when the learned model fits the training dataset very well but fails to generalize to new examples. The model on the right has no loss on the training data but does not actually capture the patterns of the dataset. 
      Overfitting is a common issue that can occur when using complex models. When a model becomes too complex, it may start to capture noise or random fluctuations in the training data, leading to poor generalization to new unseen data. To detect overfitting, a training test split protocol is typically used where we use holdout test data to estimate loss on new unseen data. Here's what the process looks like. Step one, randomly shuffle the data set and split the data set into a training set and testing set. Step two, train the model on the training set. Step three, evaluate the learn model on both the training set and the testing set to obtain a training loss and a test loss, also called generalization loss. 
      Other metrics can also be used. A good model should have both low training loss, which indicates that it fits well to training data, and low test loss, which indicates that it generalizes well to unseen data. 
    • The Mean Square Error (MSE) Loss Function: 


    • In practice, the function mean_squared_error can be found in scikt-learn


    • Two Types of Linear Regression: 


    • Overfitting & How to assess it: 



    • Overfitting & How to assess it:  

    • Types of Linear Regression
      Simple Linear Regression: Involves one independent variable and one dependent variable.
      Multivariate Linear Regression: Extends to include multiple independent variables.
    • Polynomial Linear Regression: Allows for non-linear relationships between the dependent and independent variables by incorporating polynomial terms, enabling the model to capture non-linear patterns in the data


    • Regularized Linear Regression: A form of linear regression that addresses multicollinearity (high correlation between independent variables) and helps mitigate overfitting by adding a penalty term to the loss function that controls the complexity of the model.
      Some common regularized linear regression and corresponding loss functions are:
      • Ridge Regression: 

      • Lasso Regression:   

      • Elastic Net Regression:  

      where w represents the model parameters and r, 𝜆 are hyperparameters that determine the extent of penalization.
    • My Github code demo Downloads (please fork it to run in your Codespace in Github) (if you want to run in Github, to prevent from overwriting the original codes).
    • How to solve? Run on earlier versions. Here are what some idea from AI chatbots: 







    • My solution / my point of view: the compatibility issues of the new version of Python 3.13.3: 



    • Nov-05-2025:
    • Jupyter Notebook:   Linear Regression Coding Demo.ipynb - Video Lecture
    •      /GitHub/Penn-Py-Intro_Predic_analysis-_Regress/Demo_Codes/Linear Regression/


















    • Terminal command history records (for reference): 







  • 2026-Jan-22: Ch1. Python 初級:深度學習 + PyTorch 入門|Deep Learning|Neural Network|教學|廣東話 - By: kfsoft
    •  



    •  





    • Others: (Deep Learning):
    • 第4課 - RNN, LSTM, GRU, seq2seq    • Python 初級:深度學習,RNN 循環神經網絡,LSTM 長短期記憶模型,GRU...
    • 第5課 - Transformer, attention mechanism    • Python 初級:深度學習,Transformer,Attention,注意力機制...  
    •  (00: 01:50)


    •  (00:01:53) Demo (details)


    • 這是一部關於 Python 深度學習與 PyTorch 入門 的長篇教學影片(片長約 5 小時),內容非常紮實,從深度學習的基礎理論講起,一直到 PyTorch 的核心操作與實戰。

      以下是影片內容的詳細列表,已為您整理並標註大致的時間點:

      第一部分:深度學習基礎理論 (Deep Learning Concepts)

      • [00:02] 課程簡介:介紹深度學習、PyTorch、Tensor 的概念。

      • [02:36] 機器學習概論

        • AI vs Machine Learning vs Deep Learning 的關係。

        • 監督式學習 (Supervised Learning) vs 非監督式學習 (Unsupervised Learning)

        • 回歸 (Regression) vs 分類 (Classification) 的區別。

      • [07:30] 神經網絡 (Neural Network)

        • 神經網絡的運作原理、層次結構 (Input, Hidden, Output Layers)。

        • 全連接層 (Fully Connected Layer / Linear Layer) 的概念。

        • 深度 (Deep) 與 廣度 (Wide) 對模型的影響。

        • 訓練過程 (Training Process):Forward Pass、Loss Calculation、Backward Pass (Gradient Descent)。

      第二部分:鳶尾花分類專案預覽 (Iris Classification Preview)

      • [01:30:37] 實戰案例介紹:使用經典的 Iris (鳶尾花) 數據集進行分類。

      • 數據處理:Feature (花瓣/花萼長寬) 與 Label (品種) 的對應。

      • 模型架構:設計一個簡單的 MLP (Multi-Layer Perceptron) 來解決分類問題。

      • 訓練流程:Epochs、Batch Size、Optimizer (SGD) 的概念介紹。

      第三部分:PyTorch Tensor 基礎操作 (Tensor Basics)

      • [01:52:20] 環境安裝:如何安裝 PyTorch (CPU vs GPU 版本)。

      • [01:53:41] Tensor 簡介:什麼是 Tensor?它與 NumPy Array 的關係。

      • [01:54:47] 建立 Tensor

        • 使用 torch.tensor() 從 List 建立。

        • 數據類型 (dtype):Float, Int 等。

        • Tensor 的屬性:shape, device, dtype, requires_grad.

      • [02:12:35] Tensor 的維度與索引 (Indexing & Slicing)

        • 1D Vector 與 2D Matrix 的操作。

        • 如何存取與修改特定元素。

        • Slicing[start:end:step] 的用法。

      • [02:31:17] 特殊 Tensor 的建立

        • torch.empty(), torch.zeros(), torch.ones()

        • torch.arange(), torch.linspace()

        • torch.rand(), torch.randn() (常態分佈), torch.randint()

      第四部分:Tensor 形狀變換與進階操作 (Shape Manipulation)

      • [02:39:11] 複製與記憶體clone(), detach()

      • [02:40:06] 形狀變換

        • view() vs reshape():兩者的區別與記憶體連續性 (Contiguous) 的關係。

        • transpose():轉置矩陣。

      • [04:22:06] 維度調整

        • squeeze() (壓縮維度) 與 unsqueeze() (增加維度)。

      • [04:26:06] 排序與極值

        • sort():排序。

        • topk():取出最大/最小的前 K 個值。

      第五部分:數學運算與廣播機制 (Math & Broadcasting)

      • [03:27:49] 元素級運算 (Element-wise Operations):加減乘除。

      • [03:38:35] 廣播機制 (Broadcasting)

        • 當兩個 Tensor 形狀不同時,PyTorch 如何自動擴展維度進行運算。

        • 廣播的規則與條件 (Compatible dimensions)。

      • [03:42:24] 矩陣乘法 (Matrix Multiplication)

        • torch.matmul(), torch.mm(), @ 運算符。

        • Vector-Matrix, Matrix-Matrix 的乘法規則。

        • Dot Product (點積)。

      第六部分:比較與遮罩 (Comparison & Masking)

      • [03:53:13] 比較運算eq (等於), gt (大於) 等,回傳 Boolean Tensor。

      • [03:56:11] 邏輯運算all(), any()

      • [04:08:13] 布林遮罩 (Boolean Masking):使用條件來篩選或修改 Tensor 中的特定數值 (例如:將所有小於 0 的數變為 0)。

      • [04:14:08] Fancy Indexing:使用 Index array 來選取數據。

      第七部分:裝置管理 (Device Management - CPU vs GPU)

      • [04:32:20] CPU 與 GPU

        • 如何檢查 GPU 是否可用 (torch.cuda.is_available())。

        • .to(device):將 Tensor 移動到 CPU 或 GPU (CUDA)。

      • [04:35:08] 效能測試:實測 CPU 與 GPU 在大規模矩陣運算上的速度差異。

      第八部分:自動微分 (Autograd - Automatic Differentiation)

      • [04:46:35] Autograd 核心概念

        • requires_grad=True:追蹤運算歷史以計算梯度。

        • Computational Graph (計算圖) 的概念。

        • backward():反向傳播計算梯度。

        • .grad:查看計算出的梯度值。

      • [05:05:30] 停止追蹤梯度

        • detach() 的用法。

        • with torch.no_grad(): 的應用場景 (例如在評估模型時)。

      • 梯度累積:為何在訓練迴圈中需要 optimizer.zero_grad() (手動清空梯度)。

      這部影片非常適合作為 PyTorch 的入門字典,詳細解釋了每個函數的底層邏輯和記憶體運作方式。

    • 這部影片的內容非常豐富,涵蓋了 PyTorch 的基礎操作到實作一個完整的鳶尾花(Iris)分類模型。

      雖然我在搜尋結果中找到了作者的 GitHub 主頁(kfsoft),但沒有直接對應此教學影片的特定程式碼庫。因此,根據影片內容與逐字稿,我為您還原並生成了兩個核心的 Jupyter Notebook 範例程式碼(Demo),您可以直接複製到 Colab 或 Jupyter Notebook 中執行。

      檔案一:PyTorch 基礎與 Autograd (00:00:00 - 01:29:00)

      這部分對應影片前半段,介紹 Tensor 操作與自動微分。

      Python
      # 檔案名稱: 01_pytorch_basics.ipynb
      
      import torch
      import numpy as np
      
      # --- 1. Tensor 建立與屬性 ---
      print("=== Tensor Basics ===")
      # 建立 Tensor
      x = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.float32)
      print(f"Tensor x:\n{x}")
      print(f"Shape: {x.shape}")
      print(f"Type: {x.dtype}")
      print(f"Device: {x.device}")
      
      # 特殊 Tensor
      zeros = torch.zeros(2, 3)
      ones = torch.ones(2, 3)
      rand = torch.rand(2, 3)  # Uniform [0, 1)
      print(f"Random Tensor:\n{rand}")
      
      # --- 2. 形狀操作 (Reshape/View) ---
      print("\n=== Shape Manipulation ===")
      flat_x = x.view(6) # 變成 1D
      reshaped_x = x.reshape(3, 2) # 變成 3x2
      print(f"Reshaped (3x2):\n{reshaped_x}")
      
      # 增加/減少維度
      unsqueezed = x.unsqueeze(0) # 增加 batch 維度 -> [1, 2, 3]
      print(f"Unsqueezed shape: {unsqueezed.shape}")
      squeezed = unsqueezed.squeeze() # 壓縮維度 -> [2, 3]
      print(f"Squeezed shape: {squeezed.shape}")
      
      # --- 3. 自動微分 (Autograd) ---
      print("\n=== Autograd Demo ===")
      # 設 requires_grad=True 以追蹤梯度
      w = torch.tensor([1.0], requires_grad=True)
      x_in = torch.tensor([2.0])
      b = torch.tensor([3.0], requires_grad=True)
      
      # 定義計算圖: y = w * x + b
      y = w * x_in + b
      target = torch.tensor([10.0])
      
      # Loss function: L = (y - target)^2
      loss = (y - target)**2
      
      # 反向傳播
      loss.backward()
      
      print(f"Input x: {x_in.item()}")
      print(f"Weights w: {w.item()}, Bias b: {b.item()}")
      print(f"Loss: {loss.item()}")
      print(f"dL/dw (w.grad): {w.grad}") # 應為 2 * (wx+b - target) * x
      print(f"dL/db (b.grad): {b.grad}")
      

      檔案二:鳶尾花分類實戰 (Iris Classification) (01:30:00 - 結束)

      這部分對應影片後半段,實作一個完整的 MLP 神經網絡來分類鳶尾花品種。

      Python
      # 檔案名稱: 02_iris_classification.ipynb
      
      import torch
      import torch.nn as nn
      import torch.optim as optim
      from torch.utils.data import Dataset, DataLoader
      import pandas as pd
      from sklearn.model_selection import train_test_split
      from sklearn.preprocessing import LabelEncoder
      
      # --- 1. 準備數據 (Data Preparation) ---
      # 模擬 Iris 數據 (實際操作請上傳 iris.csv)
      url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
      column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
      df = pd.read_csv(url, header=None, names=column_names)
      
      # 將文字標籤轉換為數字 (Iris-setosa -> 0, etc.)
      encoder = LabelEncoder()
      df['species'] = encoder.fit_transform(df['species'])
      
      X = df.drop('species', axis=1).values.astype('float32')
      y = df['species'].values.astype('int64')
      
      # 切分訓練集與測試集
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
      
      # --- 2. 定義 Dataset 類別 ---
      class IrisDataset(Dataset):
          def __init__(self, features, labels):
              self.features = torch.tensor(features)
              self.labels = torch.tensor(labels)
              
          def __len__(self):
              return len(self.features)
          
          def __getitem__(self, idx):
              return self.features[idx], self.labels[idx]
      
      train_dataset = IrisDataset(X_train, y_train)
      test_dataset = IrisDataset(X_test, y_test)
      
      train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
      test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)
      
      # --- 3. 定義神經網絡模型 (MLP) ---
      class IrisNet(nn.Module):
          def __init__(self):
              super(IrisNet, self).__init__()
              # Input: 4 features, Hidden: 8 neurons, Output: 3 classes
              self.layer1 = nn.Linear(4, 8)
              self.relu = nn.ReLU()
              self.layer2 = nn.Linear(8, 3) 
              
          def forward(self, x):
              x = self.layer1(x)
              x = self.relu(x)
              x = self.layer2(x)
              return x
      
      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      model = IrisNet().to(device)
      
      # --- 4. 訓練迴圈 (Training Loop) ---
      criterion = nn.CrossEntropyLoss()
      optimizer = optim.SGD(model.parameters(), lr=0.01)
      
      print("開始訓練...")
      epochs = 100
      for epoch in range(epochs):
          model.train()
          running_loss = 0.0
          for inputs, labels in train_loader:
              inputs, labels = inputs.to(device), labels.to(device)
              
              optimizer.zero_grad()       # 清空梯度
              outputs = model(inputs)     # Forward pass
              loss = criterion(outputs, labels) # 計算 Loss
              loss.backward()             # Backward pass
              optimizer.step()            # 更新權重
              
              running_loss += loss.item()
              
          if (epoch+1) % 10 == 0:
              print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss/len(train_loader):.4f}")
      
      # --- 5. 評估模型 (Evaluation) ---
      model.eval()
      correct = 0
      total = 0
      with torch.no_grad():
          for inputs, labels in test_loader:
              inputs, labels = inputs.to(device), labels.to(device)
              outputs = model(inputs)
              _, predicted = torch.max(outputs.data, 1)
              total += labels.size(0)
              correct += (predicted == labels).sum().item()
      
      accuracy = 100 * correct / total
      print(f"測試集準確率: {accuracy:.2f}%")
      

      這段程式碼完全對應影片中關於 PyTorch 張量基礎與建立神經網絡分類器的教學流程。

    •  (00:02:45) 


    • AI <--- ML (Machine Learning)  <---  DL (Deep Learning)

    •  


    • (00:03:23) Machine Learning (Classifications)

    •  (00:04:29) Supervised Learning


    •  (00:07:25) Deep Learning  -->  NN (Neural Net)


    • Training Process  -->  Update the parameters (adjust weights) of NN
    • (00:10:45) Neural Network - Two-Layers




    •  (00:14:30) Expand


    •  (00:16:43) 


    • NN structure  >  Training  >  Prediction
    • (00:19:13)


    • Tensor - the data structure for Neural Network: Tensor - Input Data / Output Data.
    •  (00:22:09) Loss function is also tensor.











留言

這個網誌中的熱門文章

Get started with Python - Google

Data Analysis Using Python - University of Pennsylvania