Intro to Predictive Analytics Using Python - University of Pennsylvania
Intro to Predictive Analytics Using Python
- a course by the University of Pennsylvania
- Instructor: Brandon Krakowsky
- Coursera Course - Intro to Predictive Analytics Using Python
Intro to Predictive Analytics Using Python - 3 Modules
There are 3 modules in this course
"Introduction to Predictive Analytics and Advanced Predictive Analytics Using Python" is specially designed to enhance your skills in building, refining, and implementing predictive models using Python. This course serves as a comprehensive introduction to predictive analytics, beginning with the fundamentals of linear and logistic regression. These models are the cornerstone of predictive analytics, enabling you to forecast future events by learning from historical data. We cover a bit of the theory behind these models, but in particular, their application in real-world scenarios and the process of evaluating their performance to ensure accuracy and reliability. As the course progresses, we delve deeper into the realm of machine learning with a focus on decision trees and random forests. These techniques represent a more advanced aspect of supervised learning, offering powerful tools for both classification and regression tasks. Through practical examples and hands-on exercises, you'll learn how to build these models, understand their intricacies, and apply them to complex datasets to identify patterns and make predictions. Additionally, we introduce the concepts of unsupervised learning and clustering, broadening your analytics toolkit, and providing you with the skills to tackle data without predefined labels or categories. By the end of this course, you'll not only have a thorough understanding of various predictive analytics techniques, but also be capable of applying these techniques to solve real-world problems, setting the stage for continued growth and exploration in the field of data analytics.
- 03-Oct-2025: Module 1: Intro to Predictive Analytics Using Python
- Oct-10-2025: Module 1 - Lesson 2: Supervised Predictive Models
- Typical Machine Learning Pipeline:
- Oct-13-2025: Module 1 - Week 1 - Linear Regression
- Script: >> Recall the ML pipeline that ultimately produces a model f. In Linear Regression, we make an assumption that the model is a linear model. In other words, linear regression assumes a linear relation between some given inputs and the target output. For example, imagine we want to predict the age of a customer based on the number of orders they've made for a certain product. In this case, the number of orders they've made is the input variable and the age of the customer is the output variable. To perform linear regression, we gather data on various customers, including their orders and ages. We plot this data on a graph with the x-axis representing the number of orders and the y-axis representing the age.Each data point is represented by a blue circle on the graph. The goal of linear regression is to find a line that best fits the data points. This line is represented by a red line on the graph. The position and slope of this line are determined through a mathematical process that minimizes the distance between the line and the data points. Once we have this line, we can use it to make predictions. For example, if we have the number of orders for a new customer, we can use the line to estimate their age. The line represents the learned model that allows us to predict the output variable age based on the input variable number of orders.In linear regression, the model parameters refer to the values that determine the specific characteristics of the linear model. These parameters define the slope and intercept of the line that represents the relationship between the input variable and the output variable. Let's go back to the example of predicting customer ages based on their number of orders. In this case, the parameters of the linear regression model are as follows. The slope represents the change in the output variable age for a unit change in the input variable number of orders. The intercept is the point where the line intersects the y-axis. How do we find the most optimal model parameters?If the training dataset has a significant trend, we can easily draw out a line that looks optimal rather quickly. But for noisy real-life datasets, finding the most optimal model is not as straightforward. During the training process of linear regression, the algorithm learns the optimal values for these parameters by minimizing the difference between the predicted values of the model and the actual values from the training data. We need a metric that measures how well the model performs. For machine learning problems, a loss function is usually defined to evaluate the model. Intuitively, the loss function would have small values if the predicted output is close to the desired output and large values otherwise. A commonly used loss function is the Mean Squared Error or MSE Loss Function.This loss represents the average squared difference between the predicted output and the desired output across all samples in the training dataset. The optimal linear regression model minimizes the MSE loss. In practice, we can use the mean squared error function in the scikit-learn machine learning library to calculate the MSE. A simple linear regression involves one independent variable and one dependent variable. A multivariate linear regression extends to include multiple independent variables. Overfitting occurs when the learned model fits the training dataset very well but fails to generalize to new examples. The model on the right has no loss on the training data but does not actually capture the patterns of the dataset.Overfitting is a common issue that can occur when using complex models. When a model becomes too complex, it may start to capture noise or random fluctuations in the training data, leading to poor generalization to new unseen data. To detect overfitting, a training test split protocol is typically used where we use holdout test data to estimate loss on new unseen data. Here's what the process looks like. Step one, randomly shuffle the data set and split the data set into a training set and testing set. Step two, train the model on the training set. Step three, evaluate the learn model on both the training set and the testing set to obtain a training loss and a test loss, also called generalization loss.Other metrics can also be used. A good model should have both low training loss, which indicates that it fits well to training data, and low test loss, which indicates that it generalizes well to unseen data.
- Types of Linear Regression• Simple Linear Regression: Involves one independent variable and one dependent variable.• Multivariate Linear Regression: Extends to include multiple independent variables.
- Polynomial Linear Regression: Allows for non-linear relationships between the dependent and independent variables by incorporating polynomial terms, enabling the model to capture non-linear patterns in the data
- Regularized Linear Regression: A form of linear regression that addresses multicollinearity (high correlation between independent variables) and helps mitigate overfitting by adding a penalty term to the loss function that controls the complexity of the model.Some common regularized linear regression and corresponding loss functions are:where w represents the model parameters and r, 𝜆 are hyperparameters that determine the extent of penalization.
- My Github code demo Downloads (please fork it to run in your Codespace in Github) (if you want to run in Github, to prevent from overwriting the original codes).
- Nov-05-2025:
- Jupyter Notebook: Linear Regression Coding Demo.ipynb - Video Lecture
- /GitHub/Penn-Py-Intro_Predic_analysis-_Regress/Demo_Codes/Linear Regression/
- Module 2 - Decision Trees and Introduction to Advanced Predictive Analytics and Random Forests
- =================================================================
Others:
- 14-Oct-2025: YouTube 木子AI研究所: 哪個AI做PPT最強?結果震驚了我!Skywork/Manus/Kimi/Gamma PPT 能力測評
- Prompt: 我是一名高中語文老師,我需要教學《赤壁賦》, 現在需要製作課程ppt文件,附件為我的教案,請你根據這些信息,幫我生成一套ppt <測試模式說明 >
- 教案 also done by AI, before the test.
- Skywork: 全球唯一能做 Deep Research (深度研究) Slides的AI ppt 工具
- Nov-18-2025: NoteBookLM - 2025最佳AI學習神器|語音/影片摘要/會議報告/學習卡/心智圖 By MrBoris科技站
- Nov-21st-2025: YouTube: 免費用NotebookLM,在10分鐘內,100%掌握任何最新知識 - by: 漢克蔡 | Ai 集
- E.g. To learn "Agent Builder"
- Go to YouTube to search the hot videos of "Agent Builder":
- Nov-20th-2025: YouTube - NotebookLM免费版来源太少?教你从50干到500+!- by: 爱听书的程序员阿超
- 在 AI 知识库越来越火的今天,NotebookLM 这样的工具可以大大提升我们的学习效率。 🎉
- 你在使用 NotebookLM 免费版的时候,有没有遇到被单个笔记本限制 50 个来源的痛苦 😵💫
- 本期视频教你使用合并文件的方式,一招解决~
- Note #1. 单个文件 :别超 50 万个英文单词(约等于 100 万个汉字) 并且文件小于 200 MB,否则上传会失败。
- Note #2. 同主题合并:把同一 UP 主、同一系列放一份;乱拼可能会让回答跑偏。
- Note #3. 更新技巧:有新视频就把字幕追加到 `合并文件 A.txt` → 覆盖上传即可,记得再做一次字数/体积检查。
- YouTube: AI界最强CP曝光!NotebookLM×Gemini组合拳,让你1人干出小团队战斗力 - by: 嗯哌 Npie
- Combining NoteBookLM x Gemini = the most powerful AI tools?:
- NLP = Natural Language Processing
- Bloomberg NLP: The rise of NLP enhanced solutions for financial professionals
- Natural Language Processing (NLP) technology embedded into the Instant Bloomberg (IB) messaging service is automating the real-time recognition and deployment of actionable insights contained in messages that enter trading professionals’ inboxes. This is enabling them to work with more counterparts, increase business generation, further streamline their workflows and reduce operational risks.
- 2025-Dec-03:
- Use SVG animation to create an animation video to explain the principles of charging & discharging Li-Ion batteries:
- 2025-Dec-05: AI 危機逼近?電力唔夠、晶片缺貨、數據用盡 — 全球AI大崩盤會唔會「突然發生」?| 電力、晶片、錢、監管,哪個會先爆?#36 Henry 平行偉論 - By 【Henry平行偉論】
- 2026-Jan-22: Ch1. Python 初級:深度學習 + PyTorch 入門|Deep Learning|Neural Network|教學|廣東話 - By: kfsoft
- Others: (Deep Learning):
- 第4課 - RNN, LSTM, GRU, seq2seq • Python 初級:深度學習,RNN 循環神經網絡,LSTM 長短期記憶模型,GRU...
- 第5課 - Transformer, attention mechanism • Python 初級:深度學習,Transformer,Attention,注意力機制...
這是一部關於 Python 深度學習與 PyTorch 入門 的長篇教學影片(片長約 5 小時),內容非常紮實,從深度學習的基礎理論講起,一直到 PyTorch 的核心操作與實戰。
以下是影片內容的詳細列表,已為您整理並標註大致的時間點:
第一部分:深度學習基礎理論 (Deep Learning Concepts)
[
] 課程簡介:介紹深度學習、PyTorch、Tensor 的概念。00:02 [
] 機器學習概論:02:36 AI vs Machine Learning vs Deep Learning 的關係。
監督式學習 (Supervised Learning) vs 非監督式學習 (Unsupervised Learning)。
回歸 (Regression) vs 分類 (Classification) 的區別。
[
] 神經網絡 (Neural Network):07:30 神經網絡的運作原理、層次結構 (Input, Hidden, Output Layers)。
全連接層 (Fully Connected Layer / Linear Layer) 的概念。
深度 (Deep) 與 廣度 (Wide) 對模型的影響。
訓練過程 (Training Process):Forward Pass、Loss Calculation、Backward Pass (Gradient Descent)。
第二部分:鳶尾花分類專案預覽 (Iris Classification Preview)
[
] 實戰案例介紹:使用經典的 Iris (鳶尾花) 數據集進行分類。01:30:37 數據處理:Feature (花瓣/花萼長寬) 與 Label (品種) 的對應。
模型架構:設計一個簡單的 MLP (Multi-Layer Perceptron) 來解決分類問題。
訓練流程:Epochs、Batch Size、Optimizer (SGD) 的概念介紹。
第三部分:PyTorch Tensor 基礎操作 (Tensor Basics)
[
] 環境安裝:如何安裝 PyTorch (CPU vs GPU 版本)。01:52:20 [
] Tensor 簡介:什麼是 Tensor?它與 NumPy Array 的關係。01:53:41 [
] 建立 Tensor:01:54:47 使用
torch.tensor()從 List 建立。數據類型 (
dtype):Float, Int 等。Tensor 的屬性:
shape,device,dtype,requires_grad.
[
] Tensor 的維度與索引 (Indexing & Slicing):02:12:35 1D Vector 與 2D Matrix 的操作。
如何存取與修改特定元素。
Slicing:
[start:end:step]的用法。
[
] 特殊 Tensor 的建立:02:31:17 torch.empty(),torch.zeros(),torch.ones()。torch.arange(),torch.linspace()。torch.rand(),torch.randn()(常態分佈),torch.randint()。
第四部分:Tensor 形狀變換與進階操作 (Shape Manipulation)
[
] 複製與記憶體:02:39:11 clone(),detach()。[
] 形狀變換:02:40:06 view()vsreshape():兩者的區別與記憶體連續性 (Contiguous) 的關係。transpose():轉置矩陣。
[
] 維度調整:04:22:06 squeeze()(壓縮維度) 與unsqueeze()(增加維度)。
[
] 排序與極值:04:26:06 sort():排序。topk():取出最大/最小的前 K 個值。
第五部分:數學運算與廣播機制 (Math & Broadcasting)
[
] 元素級運算 (Element-wise Operations):加減乘除。03:27:49 [
] 廣播機制 (Broadcasting):03:38:35 當兩個 Tensor 形狀不同時,PyTorch 如何自動擴展維度進行運算。
廣播的規則與條件 (Compatible dimensions)。
[
] 矩陣乘法 (Matrix Multiplication):03:42:24 torch.matmul(),torch.mm(),@運算符。Vector-Matrix, Matrix-Matrix 的乘法規則。
Dot Product (點積)。
第六部分:比較與遮罩 (Comparison & Masking)
[
] 比較運算:03:53:13 eq(等於),gt(大於) 等,回傳 Boolean Tensor。[
] 邏輯運算:03:56:11 all(),any()。[
] 布林遮罩 (Boolean Masking):使用條件來篩選或修改 Tensor 中的特定數值 (例如:將所有小於 0 的數變為 0)。04:08:13 [
] Fancy Indexing:使用 Index array 來選取數據。04:14:08
第七部分:裝置管理 (Device Management - CPU vs GPU)
[
] CPU 與 GPU:04:32:20 如何檢查 GPU 是否可用 (
torch.cuda.is_available())。.to(device):將 Tensor 移動到 CPU 或 GPU (CUDA)。
[
] 效能測試:實測 CPU 與 GPU 在大規模矩陣運算上的速度差異。04:35:08
第八部分:自動微分 (Autograd - Automatic Differentiation)
[
] Autograd 核心概念:04:46:35 requires_grad=True:追蹤運算歷史以計算梯度。Computational Graph (計算圖) 的概念。
backward():反向傳播計算梯度。.grad:查看計算出的梯度值。
[
] 停止追蹤梯度:05:05:30 detach()的用法。with torch.no_grad():的應用場景 (例如在評估模型時)。
梯度累積:為何在訓練迴圈中需要
optimizer.zero_grad()(手動清空梯度)。
這部影片非常適合作為 PyTorch 的入門字典,詳細解釋了每個函數的底層邏輯和記憶體運作方式。
這部影片的內容非常豐富,涵蓋了 PyTorch 的基礎操作到實作一個完整的鳶尾花(Iris)分類模型。
雖然我在搜尋結果中找到了作者的 GitHub 主頁(
kfsoft),但沒有直接對應此教學影片的特定程式碼庫。因此,根據影片內容與逐字稿,我為您還原並生成了兩個核心的 Jupyter Notebook 範例程式碼(Demo),您可以直接複製到 Colab 或 Jupyter Notebook 中執行。檔案一:PyTorch 基礎與 Autograd (00:00:00 - 01:29:00)
這部分對應影片前半段,介紹 Tensor 操作與自動微分。
Python# 檔案名稱: 01_pytorch_basics.ipynb import torch import numpy as np # --- 1. Tensor 建立與屬性 --- print("=== Tensor Basics ===") # 建立 Tensor x = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.float32) print(f"Tensor x:\n{x}") print(f"Shape: {x.shape}") print(f"Type: {x.dtype}") print(f"Device: {x.device}") # 特殊 Tensor zeros = torch.zeros(2, 3) ones = torch.ones(2, 3) rand = torch.rand(2, 3) # Uniform [0, 1) print(f"Random Tensor:\n{rand}") # --- 2. 形狀操作 (Reshape/View) --- print("\n=== Shape Manipulation ===") flat_x = x.view(6) # 變成 1D reshaped_x = x.reshape(3, 2) # 變成 3x2 print(f"Reshaped (3x2):\n{reshaped_x}") # 增加/減少維度 unsqueezed = x.unsqueeze(0) # 增加 batch 維度 -> [1, 2, 3] print(f"Unsqueezed shape: {unsqueezed.shape}") squeezed = unsqueezed.squeeze() # 壓縮維度 -> [2, 3] print(f"Squeezed shape: {squeezed.shape}") # --- 3. 自動微分 (Autograd) --- print("\n=== Autograd Demo ===") # 設 requires_grad=True 以追蹤梯度 w = torch.tensor([1.0], requires_grad=True) x_in = torch.tensor([2.0]) b = torch.tensor([3.0], requires_grad=True) # 定義計算圖: y = w * x + b y = w * x_in + b target = torch.tensor([10.0]) # Loss function: L = (y - target)^2 loss = (y - target)**2 # 反向傳播 loss.backward() print(f"Input x: {x_in.item()}") print(f"Weights w: {w.item()}, Bias b: {b.item()}") print(f"Loss: {loss.item()}") print(f"dL/dw (w.grad): {w.grad}") # 應為 2 * (wx+b - target) * x print(f"dL/db (b.grad): {b.grad}")檔案二:鳶尾花分類實戰 (Iris Classification) (01:30:00 - 結束)
這部分對應影片後半段,實作一個完整的 MLP 神經網絡來分類鳶尾花品種。
Python# 檔案名稱: 02_iris_classification.ipynb import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import Dataset, DataLoader import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder # --- 1. 準備數據 (Data Preparation) --- # 模擬 Iris 數據 (實際操作請上傳 iris.csv) url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'] df = pd.read_csv(url, header=None, names=column_names) # 將文字標籤轉換為數字 (Iris-setosa -> 0, etc.) encoder = LabelEncoder() df['species'] = encoder.fit_transform(df['species']) X = df.drop('species', axis=1).values.astype('float32') y = df['species'].values.astype('int64') # 切分訓練集與測試集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # --- 2. 定義 Dataset 類別 --- class IrisDataset(Dataset): def __init__(self, features, labels): self.features = torch.tensor(features) self.labels = torch.tensor(labels) def __len__(self): return len(self.features) def __getitem__(self, idx): return self.features[idx], self.labels[idx] train_dataset = IrisDataset(X_train, y_train) test_dataset = IrisDataset(X_test, y_test) train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False) # --- 3. 定義神經網絡模型 (MLP) --- class IrisNet(nn.Module): def __init__(self): super(IrisNet, self).__init__() # Input: 4 features, Hidden: 8 neurons, Output: 3 classes self.layer1 = nn.Linear(4, 8) self.relu = nn.ReLU() self.layer2 = nn.Linear(8, 3) def forward(self, x): x = self.layer1(x) x = self.relu(x) x = self.layer2(x) return x device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = IrisNet().to(device) # --- 4. 訓練迴圈 (Training Loop) --- criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(model.parameters(), lr=0.01) print("開始訓練...") epochs = 100 for epoch in range(epochs): model.train() running_loss = 0.0 for inputs, labels in train_loader: inputs, labels = inputs.to(device), labels.to(device) optimizer.zero_grad() # 清空梯度 outputs = model(inputs) # Forward pass loss = criterion(outputs, labels) # 計算 Loss loss.backward() # Backward pass optimizer.step() # 更新權重 running_loss += loss.item() if (epoch+1) % 10 == 0: print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss/len(train_loader):.4f}") # --- 5. 評估模型 (Evaluation) --- model.eval() correct = 0 total = 0 with torch.no_grad(): for inputs, labels in test_loader: inputs, labels = inputs.to(device), labels.to(device) outputs = model(inputs) _, predicted = torch.max(outputs.data, 1) total += labels.size(0) correct += (predicted == labels).sum().item() accuracy = 100 * correct / total print(f"測試集準確率: {accuracy:.2f}%")這段程式碼完全對應影片中關於 PyTorch 張量基礎與建立神經網絡分類器的教學流程。
- AI <--- ML (Machine Learning) <--- DL (Deep Learning)
- (00:03:23) Machine Learning (Classifications)
- Training Process --> Update the parameters (adjust weights) of NN
- NN structure > Training > Prediction
- (00:19:13)
- Tensor - the data structure for Neural Network: Tensor - Input Data / Output Data.
留言
張貼留言