Skip to main content

004 - HTML Scraping with Beautiful Soup

Stream Our Mistakes screenshotStream Our Mistakes EP 004


In this episode, Matt walks us through html/web scraping using the popular python library, Beautiful Soup.



Here's the code snippet from the session and links:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Created for Stream Our Mistakes 
# https://streamourmistakes.blogspot.com/

# Reference:
# https://docs.python.org/3/library/urllib.request.html
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/

from bs4 import BeautifulSoup
import urllib.request

''' 
# local html to play with from documentation Uncomment to enable 
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
'''

# Get the html from the web.
f = urllib.request.urlopen('https://en.wikiquote.org/wiki/Aristotle')

# Load the html into the parser.
soup = BeautifulSoup(f.read(), 'html.parser')

# Show the whole raw 
# print(soup.prettify())

# Access a single element.
# print(soup.title)

# Find all a tags in the html doc and print some information.
links = soup.find_all('a')

for link in links:
    print(link.get('href'))

print(len(links))

links:
https://docs.python.org/3/library/urllib.request.html
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Subscribe to the podcast on apple podcastsgoogle play,  stitcher

matt
site: http://octon.io/
github: https://github.com/mmdempsey

eddyizm
site: http://eddyizm.com
twitter: http://twitter.com/eddyizm
github: https://github.com/eddyizm

perry
github: https://github.com/apk29

---
**youtube live broadcast:**
https://youtube.com/user/eddyizm/live

Subscribe to our channel and follow my twitter feed to be notified of our next live broadcast and feel free to leave us comments and suggestions on what you want to see.

Comments

Popular posts from this blog

Insta_Delete

Earlier this year I began playing with an open source instagram bot, InstaPy, after hearing the creator talk about web automation with selenium. I was very impressed and frankly amazed with what was possible with selenium and wanted to see how I could incorporate it in both my personal projects and professional career.

Since 2010, I had posted more images on IG than Flickr, 8,200+ with a good chunk of them "throw away" images. I already backed up my photos so now came the challenge to clean/delete my feed. Spring cleaning, pruning the hedges, whatever you want to call it, IG doesn't make it easy.

Enter Insta_Delete (no relation to the app that kept popping up in my google searches).

I decided to build myself a bot glorified script to first scroll as far back as possible on my feed, then scrape the page for URL's, parse and find the href links, save them to a file, log in with a mobile emulated browser and delete those old posts.







I wrote a script that is working now, …

Data Visualization with Python

Scatter plots with Matplotlib  I'm in the middle of taking a 6 week Data Visualization course at Code Academy so I guess you might call this a midterm project. In this jupyter notebook project, we have use real world space data (celestial star location ) for the Orion constellation and output a 3D scatter plot. This was fun but because it is an intro course, the project didn't even get to labeling the actual star which I thought was bunk. 
At the end of the project, they offer you a link to some star data and challenge you to plot some local stars. So I decided to publish my results that I will end up turning in for the extra credit portion of the project. (code below)
I picked a few stars, starting with some familiar ones, like Sirius, and started plotting it out. It took a while to get the labels on correctly, for some reason, I thought it was going to be easy but it definitely took some searching as the 3D portion of it made finding the examples far more challenging than t…