Web Scraping Tutorial | Texting Yourself With Code: How to Automate a Web Scraping Script and Get Notified via SMS

Siena Duplan
4 min readFeb 17, 2021

--

“”It’s like when you finally get a text back and it’s just your [code]” — Fletcher

Perhaps the only thing more embarrassing than the thrill of a new notification only to find it’s a text from your mom is the false excitement over a text notification from your own code.

This tutorial will teach you how to integrate software like cron jobs, Twilio’s REST API, and Selenium WebDrivers to send yourself notifications of updates on whatever websites you may be keeping tabs on whenever you want.

You will need

Scrape

I regularly listen to the Snacks podcast and noticed the hosts always disclose ownership of shares in companies they discuss on the show. I wanted a way to corral all those disclosures into a list and get notified when a new episode or newsletter airs with a new disclosure.

The code below uses Selenium and a chromedriver to visit the first few pages of Snacks newsletters and search for the paragraph element in the html where disclosures are… disclosed. I am using a lot of try/except statements because after all, I won’t be running my code in a terminal and debugging, but getting a text notification so it’s important any failures are communicated via text. And with constantly-changing and dynamically-updating elements across the web, web-scraping scripts usually have a short lifespan before they need to be repaired.

from bs4 import BeautifulSoup as bs
import requests
from selenium import webdriver # may not need if bs4 sufficient
from twilio.rest import Client
article_links_list = []
disclosures = []
text_to_send = ''
apple = False
apple_string = ''
# get list of URLs to visit and plunder!
try:
for page in range(1,5):
url = 'https://snacks.robinhood.com/newsletters/page/%s/'%page
r = requests.get(url)
soup = bs(r.content, 'html.parser')
for a in soup.find_all('a',attrs={'class':'css-k4v2t0'}):
print('grabbing ',url.split('/newsletters')[0]+a['href'])
article_links_list.append(url.split('/newsletters')[0]+a['href'])
except:
text_to_send = '🚨 Failed on bs4 execution'
# visit URLs and dig for gold!
try:
for url in article_links_list:
driver = webdriver.Chrome(executable_path='path/to/chromedriver')
driver.get(url)
p_element = (driver.find_element_by_xpath('//*[@id="__next"]/div[1]/div/div/div[10]/p[1]')).text
driver.quit()
if 'own' in p_element.lower().split(' '):
companies = p_element.lower().split(':')[1]
disclosures.append(companies)
except:
text_to_send = '🚨 Failed on chromedriver execution'

# ugly but efficient way to clean the list in a one-liner
disclosures_clean
= sorted(list(set([item for i in disclosures for item in i.replace(',','').split(' ') if item != 'and' and item != ''])))
print(disclosures_clean)if 'apple' in disclosures_clean: apple = True
if apple: apple_string = 'Found 🍎 in list!\n\n'
text_to_send = "Hello from Siena's bot 🤖 \n\n" +
'Here is the latest list of companies the Snacks hosts are talking about and own shares of 💰\n\n' +
apple_string + '\n'.join(disclosures_clean) + '\n\nThis script used to generate this message runs every Monday-Friday at 8:08AM PST. Cool.'

There are many ways to grab elements from html; in this example, I found using the xpath to be the most direct way to get the text I was looking for on the page. I found the xpath by selecting the text on the page with Chrome’s Dev Tools > right-clicking on the corresponding html tag > Copy > XPath.

The next step is to connect to your Twilio account and instantiate a message. Depending on your Twilio credentials, you may have to pay a very small fee per message (like $0.0075/message as of this writing), so you may want to use them sparingly in testing.

account_sid = "[insert account ID here]"
auth_token = "[insert auth token here]"
try:
client = Client(account_sid, auth_token)
message = client.messages.create(body=text_to_send, to='+your mobile #', from_='+your twilio #')
except:
print('🚨 Error connecting to Twilio client')

Automate

Next, create a cron job on your computer scheduled to run the python script above on a regular cadence. I want the script to run every Monday-Friday morning so using cron nomenclature, I inserted 8 8 * * 1-5 python3 /path/to/script.py 1>/tmp/stdout.log 2>/tmp/stderr.log using vim. In my terminal, I ran crontab -l to list any current jobs I had running, and crontab -e to edit a new job.

The keyboard commands in my terminal looked something like this:

crontab -ei8 8 * * 1-5 python3 /path/to/script.py 1>/tmp/stdout.log 2>/tmp/stderr.logesc:wqcrontab -l
https://crontab.guru/ if you are confused

For my own sake, I added the log outputs in case there were any complications — and of course, there were. At this point, the most likely obstacle you will have is that your terminal has not been granted full disk access. I granted my terminal with full disk access by following the steps outlined here.

That’s it. That’s all you have to do to automate texting yourself when you want to receive updates about anything on the world wide 🕸.🤘.

--

--

Siena Duplan

data scientist • statistics, philosophy, design, humor, technology, data • www.siena.io