Using Python to access web data by AI - Recording Notes 01a

AI Course to learn: Using Python to access the website data



Ref: https://www.coursera.org/learn/python-network-data/supplement/2N3oS/python-textbook

The Python Textbook



Printed copies of "Python for Everybody: Exploring Data In Python 3" are available from Amazon and on Kindle:

Here are free copies of the book in various formats available.

You can download all of the sample Python code from the book as well as licensed course materials.

All of the book materials are available under a Creative Commons Attribution-NonCommercial 3.0 Unported License. The slides, audio, assignments, auto grader and all course materials other than the book are available from the PY4E website under the more flexible Creative Commons Attribution 3.0 Unported License. If you are curious as to why the "NC" variant of Creative Commons was used, see Appendix D of the textbook or search through my blog posts

for the string "copyright". 


Ref.: https://www.coursera.org/learn/python-network-data/lecture/5LN6R/11-2-extracting-data

   find.all()

(Click the picture to enlarge it)




But with greediness, it pushes outward, and so it goes as far as it can unless when you do want to be greedy so you get this. If you made this Non-Greedy, you would get d@u. So, that also kind of helps you understand how greediness and Non-Greedy wants. 


Python code to extract the host address from a string:
atpos = data.find('@')
data.find(' ',atpos)







Match non-blank character:


Even Cooler Regex Version:


Assignment: Spam Confidence (12:11 / 15:39)


Escape Character (14:35 / 15:39)


 Documentation Docs Ref.: https://docs.python.org/3/howto/regex.html

 

===== Quiz 1 =====
   1) Which of the following best describes "Regular Expressions"?
  • a) A small programming language unto itself
  • b) The way Python handles and recovers from errors that would otherwise cause a traceback
  • c) A way to solve Algebra formulas for the unknown value
  • d) A way to calculate mathematical values paying attention to operator precedence
The correct answer is: (a)
  • Regular expressions (regex) are a domain-specific language used to define search patterns for text manipulation (e.g., matching, extracting, or replacing strings). They have their own syntax and rules, functioning independently of the host programming language (like Python).
b) Refers to error handling (e.g., try/except blocks), not regex.
c) Describes solving algebraic equations, unrelated to regex.
d) Refers to operator precedence in arithmetic calculations, not regex.
  • Regular expressions are a powerful tool for text processing, not error handling, algebra, or math calculations.
   2) Which of the following is the way we match the "start of a line" in a regular expression?
  • a) ^
  • b) str.startswith()
  • c) \linestart
  • d) String.startsWith()
  • e) variable[0:1]
    • The answer is (a)

   3) What would the following mean in a regular expression? [a-z0-9]
  • a) Match a lowercase letter or a digit
  • b) Match any text that is surrounded by square braces
  • c) Match an entire line as long as it is lowercase letters or digits
  • d) Match any number of lowercase letters followed by any number of digits
  • e) Match anything but a lowercase letter or digit
    • The answer is (a) --- the square brackets indicate a character class, meaning it will match any single character that is either a lowercase letter (a to z) or a digit (0 to 9).
   4) What is the type of the return value of the re.findall() method?
  • a) An integer
  • b) A boolean
  • c) A string
  • d) A single character
  • e) A list of strings
    • The answer is (e).
   5) What is the "wild card" character in a regular expression (i.e., the character that matches any character)?**
  • a) .
  • b) *
  • c) $
  • d) ^
  • e) +
  • f) ?
    • The answer is (a).

   6) What is the difference between the "+" and "*" character in regular expressions?**
  • a) The "+" matches at least one character and the "*" matches zero or more characters
  • b) The "+" matches upper case characters and the "*" matches lowercase characters
  • c) The "+" matches the beginning of a line and the "*" matches the end of a line
  • d) The "+" matches the actual plus character and the "*" matches any character
  • e) The "+" indicates "start of extraction" and the "*" indicates the "end of extraction"
    • The answer is (a). The "+" matches at least one character and the "*" matches zero or more characters.

   7) What does the "[0-9]+" match in a regular expression?**
  • a) Several digits followed by a plus sign
  • b) Any number of digits at the beginning of a line
  • c) Any mathematical expression
  • d) Zero or more digits
  • e) One or more digits
    • The answer is (e). One or more digits.

   8) What does the following Python sequence print out?**
x = 'From: Using the : character'
y = re.findall('^F.+:', x)
print(y)
  • a) :
  • b) ^F.+:
  • c) ['From:']
  • d) From:
  • e) ['From: Using the :']
    • The answer is (e). 
    • ^F: This part matches the beginning of the string, specifically looking for 'F'.
    • .+: This matches one or more of any character that follows 'F'. It will continue to match characters until it encounters the last part of the regex.
    • :: This matches a colon.
===== ===== =====

Ch12, 12.1 Network Data - Networking Technology

 - Python to access web pages


Ch12 Quiz on Python Network Data
   Q1) What do we call it when a browser uses the HTTP protocol to load a file or page from a server and display it in the browser?
  • a) SMTP
  • b) Internet Protocol (IP)
  • c) IMAP
  • d) The Request/Response Cycle
  • e) DECNET
    • The answer is (d) - The Request/Response Cycle.
  • Q2) Which ... most similar to a TCP port number? A telephone extension (the last one).

  • Q3) What must you do in Python before opening a socket? ans.: import socket

  • Q4) In a client-server application... come up first? The server.

  • Q5) Which... is most like an open socket in an application? Ans.: an "-in-progress" phone conversation.


  • Q7) What is an important aspect of an Application Layer protocol like HTTP? The answer is: Which application talks first? The client or server?

  • Q8) What are the three parts of this URL (Uniform Resource Locator)? ... The answer is: Protocol, host, and document --- Protocol is http. The host address is www.dr-chuck.com/page1.htm, and the web page is page1.htm.

  • Q9) When you click on an anchor tag... what HTTP request is sent to the server? GET
  • Q10) Which organization publishes Internet...? IETF.

Assigned Lab


It can be running in Jupyter Lab. Click the triangle "Run" button:
(Click to enlarge the above picture)



import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/intro-short.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if len(data) < 1:
        break
    print(data.decode(),end='')

mysock.close()




-----   -----   ------

Python Network Data Part 3 - Unicode Characters & Strings

Module 4 - chapter 12.3 ascii

  • The python function ord() gives the numeric value of a simple ASCII character.





  • In python3, all strings are unicode. 
  • Type vs Class

  • Python type:   bytes  vs  Str   vs  unicode


  • Python3 and unicode: all strings internally are unicode.
  • Talking to networks needs to encode & decode data (usually to UTF-8).

  • Python strings to/from bytes: data.decode( ) makes data from bytes to unicode.

  • Http request in Python: encode( ) converts the unicode UTF-8 to bytes before sending the data via network.


Check Python documents: encode( ), decode( ) at the website: https://docs.python.org





The python network programs: socket  >  connect  >  send / receive

-----   -----   -----

Python Network Data Part 4 - Using urllib in Python

Module 4 - chapter 12.4 urllib





  • Managing this like a file:

  • Reading Web Pages Html files:


  • The first lines of Python code @ Google? (tiny version)
  •    ==>  Python Web Crawler using a database


----   -----   -----

Worked Example Using Urllib( )  (ch12)

  • for loop reads line by line:


  • Using urlwords.py


Python Demo Run shown below (click the picture to enlarge)



-----   -----   -----

Python Network Data Part 5 - Parsing web pages urllinks.py & BeautifulSoup

Module 4 - chapter 12.5 urllinks.py & BeautifulSoup



  • The easy way to go: BeautifulSoup













13.1-13.3 Parsing XML with Python

Ref.: Book Pyt, Chapter 13.1-13.3





13.5 JSON (JavaScript Object Notation)

Ref.: Book Pyt Chapter 13.5
Below is a demo python program of JSON reading with json1.py from the book, with the tree with the 1st-level elements: name, phone and email.

Working example demo of json1.py:

Running json1.py:


Working example demo of json2.py:


Running json2.py:



13.6 Service Oriented Approach

Ref.: Book Pyt Chapter 13.6 - Integrating multiple systems





13.7 Using APIs Application Programming Interfaces


An example of Open-source Geoapify Location Platform (Geo API 5 geocoding): https://www.geoapify.com

Via Cloudflare to access py4e data: (indirect access)








XML and JSON are serialization formats:



Quiz:
Q6: If the following JSON were parsed and put into the variable x,

{

    "users": [

        {

            "status": {

                "text""@jazzychad I just bought one .__.",

             },

             "location""San Francisco, California",

             "screen_name""leahculver",

             "name""Leah Culver",

         },

   ...

Which what Python code would extract "Leah Culver" from the JSON?

    • a)       x->name
    • b)      x["users"][0]["name"]
    • c)       x[0]["name"]
    • d)      x["users"]["name"]
    • e)      x["name"]
  • The correct Python code to extract "Leah Culver" from the JSON is:  b) x["users"][0]["name"]

This accesses the first user in the "users" list and retrieves the "name" associated with that user.












---

----- ----- -----
Library Some Free Resources in PolyU campus: DataCamp


Career certification is divided into:
  1. Data Analyst
  2. Data Scientist
  3. Data Engineer
  4. AI Engineer to Data Scientists
DataCamp Beginning at:  Introduction to Python: https://campus.datacamp.com/courses/intro-to-python-for-data-science/chapter-1-python-basics?ex=1



-----
Python Short Videos




Python Classes in 50 seconds









留言

這個網誌中的熱門文章

Intro to Data Science in Python

Get started with Python - Google

AI Learning Roadmap from Beginners to Experts - Getting Started from 2025 - Phase 2: Programming Fundamentals - 01a: Linux Training Academy