The Problem
Today, if I want to know when my garbage or recycling will be picked up, I would have to open a web browser, navigate to the city of Milwaukee's recycling website, type in my address, and hit submit. While the garbage pickup is fairly regular each week, the recycling pickup is on a schedule I do not understand. Those 60 seconds of work almost weekly are not only annoying, but they are not a good use of my time. (My desire to automate the boring stuff in my life is another post.)
Ultimately, I want to be able to ask Alexa when the recycling (or garbage) will be picked up and get a response. The first step is to get the relevant information from the city's website. That's what I'll be demonstrating today.
The Setup
If you're not familiar with Python, you should start to dabble. It's a great data analysis and data science tool, but also helps you automate those tedious tasks you hate. Oh, and it's free! And it integrates with lots of awesome tools, such as Tableau and Alteryx.
- Install Anaconda (this installs the Python language and lots of great packages)
- Install packages pandas, bs4, and selenium (some of them might already come with Anaconda)
- Download Chrome driver (this will be needed for Selenium)
The Code
I'm still pretty new to Python and I've learned that while everyone structures their code differently, there are certain best practices. This is a caveat that I'm 100% sure I do not follow the best practices. But I organize my code in a way that is helpful for me.
Import Packages
Selenium is used to replicate what a human might experience by opening a web browser, navigating, entering an address, and hitting a submit button. Automating those actions requires Selenium and Selenium needs the Chrome web driver. BeautifulSoup retrieves and parses the information provided by the website. Finally, Pandas is used to store the resulting information. It's possible Pandas isn't needed in the future.
from selenium import webdriver from selenium.webdriver.common.by import By from bs4 import BeautifulSoup import pandas as pd
Set Variables
The URL is the website I'm interested in getting information from and the driver is where you downloaded the Chrome web driver. You'll also need to set the street address. The address fields correspond directly to the website I'm using.
# what's the url and where is your Selenium Chrome driver saved? url = "https://city.milwaukee.gov/sanitation/GarbageRecyclingSchedules" driver = webdriver.Chrome('C:/Users/bbeals/Selenium/chromedriver') driver.implicitly_wait(30) driver.get(url) # what's your street address? save it here! address = '3536' direction = 'W' street = 'FOND DU LAC' streettype = 'AV'
Identify Form Elements
You'll figure out pretty fast that web scraping relies heavily on a website's format and structure. If that structure changes, you'll have to update your code. Fun stuff. This code segment finds the elements of the form on the website so we can refer to them when we enter the address. A computer doesn't see things the way humans do, instead they see the code that lives behind the scenes of a website. You can find this code by hitting F12 in Chrome. Alternatively, you can right-click on the element you're interested in a select 'Inspect'. In this case, I'm finding form elements by it's name.
# this particular website has an embedded iframe so select that iframe first iframe = driver.find_element(By.CSS_SELECTOR, 'iframe') driver.switch_to.frame(iframe) # now find the various elements where you need to enter your address address_element = driver.find_element(By.NAME,'laddr') direction_element = driver.find_element(By.NAME,'sdir') street_element = driver.find_element(By.NAME,'sname') streettype_element = driver.find_element(By.NAME,'stype')
Enter Address
At this point, we have set variables with the information to be entered into the form and identified the elements of the form, but we have yet to actually do anything. Here's where the magic happens. For each form element, I will select the element by simulating a click and then enter the appropriate information. Finally, I will submit the form to retrieve the results.
# street number first address_element.click() address_element.send_keys(address) # then street direction info direction_element.click() direction_element.send_keys(direction) # then what street you live on street_element.click() street_element.send_keys(street) # then fill in the street type streettype_element.click() streettype_element.send_keys(streettype) # finally, find and click the Submit button submit = driver.find_element(By.NAME, 'Submit') submit.click()
Scrape Results
Once the form has been submitted, the website now displays the results and this is where BeautifulSoup comes in. The display needs to be captured and parsed. Given the format of the website, I know the type of pickup (garbage or recycling) is a header (called h2 in HTML-speak). Similarly, the pickup dates are in bold, which is one of the only differentiators I have found that allow me to identify that info. In the code segment below, I'm saving a screenshot just because, capturing all the HTML, and finding all h2 and bolded (also called strong) elements.
# save a screenshot (for funsies) driver.save_screenshot('image.png') # get the html to parse later html = driver.page_source # create soup object soup = BeautifulSoup(html, 'html.parser') # save header text to list category = [] for text in soup.find_all('h2'): category.append(text.get_text()) # save bold text to list date = [] for text in soup.find_all('strong'): date.append(text.get_text())
Format Results
Once the details I'm interested in have been downloaded, I want to format the results into a table.
# keep 2nd and 4th item in the list # these correspond to the 1st and 3rd 0-indexed items # so keep items, stepping by 2 date = date[1:4:2] df = pd.DataFrame(list(zip(category, date)), columns = ['Category','Date'])
Quit
driver.quit()
The Results
After all that code, I'm left with a 2x2 matrix.
The Future
This code was a fun little project and does a decent job at retrieving the information I need. So, what's next?
- Modify code to solve for complexities such as the garbage being picked up on the same day as the code is run (I've found that if this situation occurs, extra bolding is included)
- Modify code to solve for leaf pickup during autumn months (maybe I want to know about this)
- Set up an Alexa skill to comprehend what I'm asking for (garbage or recycling)
- Create an AWS Lambda function to run this Python script on demand
- Leverage AWS services to read the results to me