Skip to main content

004 - HTML Scraping with Beautiful Soup

Stream Our Mistakes screenshotStream Our Mistakes EP 004


In this episode, Matt walks us through html/web scraping using the popular python library, Beautiful Soup.



Here's the code snippet from the session and links:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Created for Stream Our Mistakes 
# https://streamourmistakes.blogspot.com/

# Reference:
# https://docs.python.org/3/library/urllib.request.html
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/

from bs4 import BeautifulSoup
import urllib.request

''' 
# local html to play with from documentation Uncomment to enable 
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
'''

# Get the html from the web.
f = urllib.request.urlopen('https://en.wikiquote.org/wiki/Aristotle')

# Load the html into the parser.
soup = BeautifulSoup(f.read(), 'html.parser')

# Show the whole raw 
# print(soup.prettify())

# Access a single element.
# print(soup.title)

# Find all a tags in the html doc and print some information.
links = soup.find_all('a')

for link in links:
    print(link.get('href'))

print(len(links))

links:
https://docs.python.org/3/library/urllib.request.html
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Subscribe to the podcast on apple podcastsgoogle play,  stitcher

matt
site: http://octon.io/
github: https://github.com/mmdempsey

eddyizm
site: http://eddyizm.com
twitter: http://twitter.com/eddyizm
github: https://github.com/eddyizm

perry
github: https://github.com/apk29

---
**youtube live broadcast:**
https://youtube.com/user/eddyizm/live

Subscribe to our channel and follow my twitter feed to be notified of our next live broadcast and feel free to leave us comments and suggestions on what you want to see.

Comments

Popular posts from this blog

Sending SMS via code (.Net, c#)

Stream Our Mistakes EP 001 --- In this episode, we will work on sending SMS messages via code using twilio. The example said it could be done in 30 seconds but it took us closer to an hour. Browse the code, fork the repo to get ready for our next steps. https://github.com/mmdempsey/SMS-Automation Subscribe to the podcast on itunes , google play ,   stitcher | eddyizm site: http://eddyizm.com twitter: http://twitter.com/eddyizm github: https://github.com/eddyizm matt site: http://octon.io/ github: https://github.com/mmdempsey --- **Tool of the week:** https://technet.microsoft.com/en-us/sysinternals/zoomit.aspx --- **youtube live broadcast:** https://youtube.com/user/eddyizm/live Subscribe to our channel and follow my twitter feed to be notified of our next live broadcast and feel free to leave us comments and suggestions on what you want to see. ---

RSS feed is retired

Well, It seems this idea never really took off. So I'll just be posting directly to youtube and twitch. Here's the stream my mistakes playlist that I will add technical videos to: https://www.youtube.com/playlist?list=PLO7cwisTbrI0r238JqxcTZdFpf6d847YE