Using Python to access web data by AI - Recording Notes 01a
AI Course to learn: Using Python to access the website data
Ref: https://www.coursera.org/learn/python-network-data/supplement/2N3oS/python-textbook
The Python Textbook
Printed copies of "Python for Everybody: Exploring Data In Python 3" are available from Amazon and on Kindle:
Here are free copies of the book in various formats available.
You can download all of the sample Python code from the book as well as licensed course materials.
All of the book materials are available under a Creative Commons Attribution-NonCommercial 3.0 Unported License. The slides, audio, assignments, auto grader and all course materials other than the book are available from the PY4E website under the more flexible Creative Commons Attribution 3.0 Unported License. If you are curious as to why the "NC" variant of Creative Commons was used, see Appendix D of the textbook or search through my blog posts
for the string "copyright".
- Python YouTube Playlist for all
- The Python study materials - python for everybody from the university of Michigan
- Video Lectures
- Audio Lectures
- Lecture Slides and Handouts
- Sample Codes ZIPPED (Individual Files)
- Free Textbook eBook
- The course content and autograder software of the Python book is available on Github (under a Creative Commons or Apache 2.0 license).
Ref.: https://www.coursera.org/learn/python-network-data/lecture/5LN6R/11-2-extracting-data
find.all()
But with greediness, it pushes outward, and so it goes as far as it can unless when you do want to be greedy so you get this. If you made this Non-Greedy, you would get d@u. So, that also kind of helps you understand how greediness and Non-Greedy wants.
atpos = data.find('@')data.find(' ',atpos)
Match non-blank character:
Even Cooler Regex Version:
Assignment: Spam Confidence (12:11 / 15:39)
Escape Character (14:35 / 15:39)
Documentation Docs Ref.: https://docs.python.org/3/howto/regex.html
- a) A small programming language unto itself
- b) The way Python handles and recovers from errors that would otherwise cause a traceback
- c) A way to solve Algebra formulas for the unknown value
- d) A way to calculate mathematical values paying attention to operator precedence
The correct answer is: (a)
- Regular expressions (regex) are a domain-specific language used to define search patterns for text manipulation (e.g., matching, extracting, or replacing strings). They have their own syntax and rules, functioning independently of the host programming language (like Python).
b) Refers to error handling (e.g., try/except blocks), not regex.c) Describes solving algebraic equations, unrelated to regex.d) Refers to operator precedence in arithmetic calculations, not regex.
- Regular expressions are a powerful tool for text processing, not error handling, algebra, or math calculations.
- a) ^
- b) str.startswith()
- c) \linestart
- d) String.startsWith()
- e) variable[0:1]
- The answer is (a)
- a) Match a lowercase letter or a digit
- b) Match any text that is surrounded by square braces
- c) Match an entire line as long as it is lowercase letters or digits
- d) Match any number of lowercase letters followed by any number of digits
- e) Match anything but a lowercase letter or digit
- The answer is (a) --- the square brackets indicate a character class, meaning it will match any single character that is either a lowercase letter (a to z) or a digit (0 to 9).
- a) An integer
- b) A boolean
- c) A string
- d) A single character
- e) A list of strings
- The answer is (e).
- a) .
- b) *
- c) $
- d) ^
- e) +
- f) ?
- The answer is (a).
- a) The "+" matches at least one character and the "*" matches zero or more characters
- b) The "+" matches upper case characters and the "*" matches lowercase characters
- c) The "+" matches the beginning of a line and the "*" matches the end of a line
- d) The "+" matches the actual plus character and the "*" matches any character
- e) The "+" indicates "start of extraction" and the "*" indicates the "end of extraction"
- The answer is (a). The "+" matches at least one character and the "*" matches zero or more characters.
- a) Several digits followed by a plus sign
- b) Any number of digits at the beginning of a line
- c) Any mathematical expression
- d) Zero or more digits
- e) One or more digits
- The answer is (e). One or more digits.
x = 'From: Using the : character'y = re.findall('^F.+:', x)print(y)
- a) :
- b) ^F.+:
- c) ['From:']
- d) From:
- e) ['From: Using the :']
- The answer is (e).
- ^F: This part matches the beginning of the string, specifically looking for 'F'.
- .+: This matches one or more of any character that follows 'F'. It will continue to match characters until it encounters the last part of the regex.
- :: This matches a colon.
Ch12, 12.1 Network Data - Networking Technology
- a) SMTP
- b) Internet Protocol (IP)
- c) IMAP
- d) The Request/Response Cycle
- e) DECNET
- The answer is (d) - The Request/Response Cycle.
- Q2) Which ... most similar to a TCP port number? A telephone extension (the last one).
- Q3) What must you do in Python before opening a socket? ans.: import socket
- Q4) In a client-server application... come up first? The server.
- Q5) Which... is most like an open socket in an application? Ans.: an "-in-progress" phone conversation.
- Q7) What is an important aspect of an Application Layer protocol like HTTP? The answer is: Which application talks first? The client or server?
- Q8) What are the three parts of this URL (Uniform Resource Locator)? ... The answer is: Protocol, host, and document --- Protocol is http. The host address is www.dr-chuck.com/page1.htm, and the web page is page1.htm.
- Q9) When you click on an anchor tag... what HTTP request is sent to the server? GET
- Q10) Which organization publishes Internet...? IETF.
Assigned Lab
import socketmysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)mysock.connect(('data.pr4e.org', 80))cmd = 'GET http://data.pr4e.org/intro-short.txt HTTP/1.0\r\n\r\n'.encode()mysock.send(cmd)while True:data = mysock.recv(512)if len(data) < 1:breakprint(data.decode(),end='')mysock.close()
Python Network Data Part 3 - Unicode Characters & Strings
- The python function ord() gives the numeric value of a simple ASCII character.
- Python3 and unicode: all strings internally are unicode.
- Talking to networks needs to encode & decode data (usually to UTF-8).
- Python strings to/from bytes: data.decode( ) makes data from bytes to unicode.
- Http request in Python: encode( ) converts the unicode UTF-8 to bytes before sending the data via network.
Python Network Data Part 4 - Using urllib in Python
Worked Example Using Urllib( ) (ch12)
Python Network Data Part 5 - Parsing web pages urllinks.py & BeautifulSoup
13.5 JSON (JavaScript Object Notation)
13.6 Service Oriented Approach
An example of Open-source Geoapify Location Platform (Geo API 5 geocoding): https://www.geoapify.com
Quiz:
Q6: If the following JSON were parsed and put into the variable x,{
"users": [
{
"status": {
"text": "@jazzychad I just bought one .__.",
},
"location": "San Francisco, California",
"screen_name": "leahculver",
"name": "Leah Culver",
},
...
Which what Python code would extract "Leah Culver" from the JSON?
- a) x->name
- b) x["users"][0]["name"]
- c) x[0]["name"]
- d) x["users"]["name"]
- e) x["name"]
- The correct Python code to extract "Leah Culver" from the JSON is: b) x["users"][0]["name"]
This accesses the first user in the "users" list and retrieves the "name" associated with that user.
- Data Analyst
- Data Scientist
- Data Engineer
- AI Engineer to Data Scientists
留言
張貼留言