AI Course to learn: Using Python to access the website data

Ref: https://www.coursera.org/learn/python-network-data/supplement/2N3oS/python-textbook

The Python Textbook

Printed copies of "Python for Everybody: Exploring Data In Python 3" are available from Amazon and on Kindle:

Amazon / Kindle

Here are free copies of the book in various formats available.

You can download all of the sample Python code from the book as well as licensed course materials.

All of the book materials are available under a Creative Commons Attribution-NonCommercial 3.0 Unported License. The slides, audio, assignments, auto grader and all course materials other than the book are available from the PY4E website under the more flexible Creative Commons Attribution 3.0 Unported License. If you are curious as to why the "NC" variant of Creative Commons was used, see Appendix D of the textbook or search through my blog posts

for the string "copyright".

Python YouTube Playlist for all
The Python study materials - python for everybody from the university of Michigan
Video Lectures
Audio Lectures
- iTunes Audio
- Google Play Audio
Lecture Slides and Handouts
Sample Codes ZIPPED (Individual Files)
Free Textbook eBook
The course content and autograder software of the Python book is available on Github (under a Creative Commons or Apache 2.0 license).

Ref.: https://www.coursera.org/learn/python-network-data/lecture/5LN6R/11-2-extracting-data

find.all()

(Click the picture to enlarge it)

But with greediness, it pushes outward, and so it goes as far as it can unless when you do want to be greedy so you get this. If you made this Non-Greedy, you would get d@u. So, that also kind of helps you understand how greediness and Non-Greedy wants.

Python code to extract the host address from a string:

atpos = data.find('@')
data.find(' ',atpos)

Match non-blank character:

Even Cooler Regex Version:

Assignment: Spam Confidence (12:11 / 15:39)

Escape Character (14:35 / 15:39)

Documentation Docs Ref.: https://docs.python.org/3/howto/regex.html

===== Quiz 1 =====

1) Which of the following best describes "Regular Expressions"?

a) A small programming language unto itself
b) The way Python handles and recovers from errors that would otherwise cause a traceback
c) A way to solve Algebra formulas for the unknown value
d) A way to calculate mathematical values paying attention to operator precedence

The correct answer is: (a)

Regular expressions (regex) are a domain-specific language used to define search patterns for text manipulation (e.g., matching, extracting, or replacing strings). They have their own syntax and rules, functioning independently of the host programming language (like Python).

b) Refers to error handling (e.g., try/except blocks), not regex.
c) Describes solving algebraic equations, unrelated to regex.
d) Refers to operator precedence in arithmetic calculations, not regex.

Regular expressions are a powerful tool for text processing, not error handling, algebra, or math calculations.

2) Which of the following is the way we match the "start of a line" in a regular expression?

a) ^
b) str.startswith()
c) \linestart
d) String.startsWith()
e) variable[0:1]

The answer is (a)

3) What would the following mean in a regular expression? [a-z0-9]

a) Match a lowercase letter or a digit
b) Match any text that is surrounded by square braces
c) Match an entire line as long as it is lowercase letters or digits
d) Match any number of lowercase letters followed by any number of digits
e) Match anything but a lowercase letter or digit

The answer is (a) --- the square brackets indicate a character class, meaning it will match any single character that is either a lowercase letter (a to z) or a digit (0 to 9).

4) What is the type of the return value of the re.findall() method?

a) An integer
b) A boolean
c) A string
d) A single character
e) A list of strings

The answer is (e).

5) What is the "wild card" character in a regular expression (i.e., the character that matches any character)?**

a) .
b) *
c) $
d) ^
e) +
f) ?

The answer is (a).

6) What is the difference between the "+" and "*" character in regular expressions?**

a) The "+" matches at least one character and the "*" matches zero or more characters
b) The "+" matches upper case characters and the "*" matches lowercase characters
c) The "+" matches the beginning of a line and the "*" matches the end of a line
d) The "+" matches the actual plus character and the "*" matches any character
e) The "+" indicates "start of extraction" and the "*" indicates the "end of extraction"

The answer is (a). The "+" matches at least one character and the "*" matches zero or more characters.

7) What does the "[0-9]+" match in a regular expression?**

a) Several digits followed by a plus sign
b) Any number of digits at the beginning of a line
c) Any mathematical expression
d) Zero or more digits
e) One or more digits

The answer is (e). One or more digits.

8) What does the following Python sequence print out?**

x = 'From: Using the : character'
y = re.findall('^F.+:', x)
print(y)

a) :
b) ^F.+:
c) ['From:']
d) From:
e) ['From: Using the :']

The answer is (e).
^F: This part matches the beginning of the string, specifically looking for 'F'.
.+: This matches one or more of any character that follows 'F'. It will continue to match characters until it encounters the last part of the regex.
:: This matches a colon.

===== ===== =====

Ch12, 12.1 Network Data - Networking Technology

- Python to access web pages

Ch12 Quiz on Python Network Data

Q1) What do we call it when a browser uses the HTTP protocol to load a file or page from a server and display it in the browser?

a) SMTP
b) Internet Protocol (IP)
c) IMAP
d) The Request/Response Cycle
e) DECNET

The answer is (d) - The Request/Response Cycle.

Q2) Which ... most similar to a TCP port number? A telephone extension (the last one).

Q3) What must you do in Python before opening a socket? ans.: import socket

Q4) In a client-server application... come up first? The server.

Q5) Which... is most like an open socket in an application? Ans.: an "-in-progress" phone conversation.

Q7) What is an important aspect of an Application Layer protocol like HTTP? The answer is: Which application talks first? The client or server?

Q8) What are the three parts of this URL (Uniform Resource Locator)? ... The answer is: Protocol, host, and document --- Protocol is http. The host address is www.dr-chuck.com/page1.htm, and the web page is page1.htm.

Q9) When you click on an anchor tag... what HTTP request is sent to the server? GET

Q10) Which organization publishes Internet...? IETF.

Assigned Lab

It can be running in Jupyter Lab. Click the triangle "Run" button:

(Click to enlarge the above picture)

import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/intro-short.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
data = mysock.recv(512)
if len(data) < 1:
break
print(data.decode(),end='')

mysock.close()

----- ----- ------

Python Network Data Part 3 - Unicode Characters & Strings

Module 4 - chapter 12.3 ascii

The python function ord() gives the numeric value of a simple ASCII character.

In python3, all strings are unicode.
Type vs Class

Python type: bytes vs Str vs unicode

Python3 and unicode: all strings internally are unicode.
Talking to networks needs to encode & decode data (usually to UTF-8).

Python strings to/from bytes: data.decode( ) makes data from bytes to unicode.

Http request in Python: encode( ) converts the unicode UTF-8 to bytes before sending the data via network.

Check Python documents: encode( ), decode( ) at the website: https://docs.python.org

Pyt Python Documentation Website

The python network programs: socket > connect > send / receive

----- ----- -----

Python Network Data Part 4 - Using urllib in Python

Module 4 - chapter 12.4 urllib

Managing this like a file:

Reading Web Pages Html files:

The first lines of Python code @ Google? (tiny version)
==> Python Web Crawler using a database

---- ----- -----

Worked Example Using Urllib( ) (ch12)

for loop reads line by line:

Using urlwords.py

Python Demo Run shown below (click the picture to enlarge)

----- ----- -----

Python Network Data Part 5 - Parsing web pages urllinks.py & BeautifulSoup

Module 4 - chapter 12.5 urllinks.py & BeautifulSoup

The easy way to go: BeautifulSoup

13.1-13.3 Parsing XML with Python

Ref.: Book Pyt, Chapter 13.1-13.3

https://www.coursera.org/learn/python-network-data/lecture/xVcE1/13-4-parsing-xml

13.5 JSON (JavaScript Object Notation)

Ref.: Book Pyt Chapter 13.5

URL: https://www.coursera.org/learn/python-network-data/lecture/BZD1x/13-5-javascript-object-notation-json

Below is a demo python program of JSON reading with json1.py from the book, with the tree with the 1st-level elements: name, phone and email.

Working example demo of json1.py:

Running json1.py:

Working example demo of json2.py:

Running json2.py:

13.6 Service Oriented Approach

Ref.: Book Pyt Chapter 13.6 - Integrating multiple systems

URL: https://www.coursera.org/learn/python-network-data/lecture/xj3Rb/13-6-service-oriented-approach

13.7 Using APIs Application Programming Interfaces

Ref.: Book Pyt Chapter 13.7 -

URL: https://www.coursera.org/learn/python-network-data/lecture/ez6wT/13-7-using-application-programming-interfaces

An example of Open-source Geoapify Location Platform (Geo API 5 geocoding): https://www.geoapify.com

Via Cloudflare to access py4e data: (indirect access)

XML and JSON are serialization formats:

Quiz:

Q6: If the following JSON were parsed and put into the variable x,
{
    "users": [
        {
            "status": {
                "text": "@jazzychad I just bought one .__.",
             },
             "location": "San Francisco, California",
             "screen_name": "leahculver",
             "name": "Leah Culver",
         },
   ...
Which what Python code would extract "Leah Culver" from the JSON?