Automating the boring stuff is part of my manifesto. I’ve decided to automate my daily work as much as possible with Python. Have you ever tracked how many times you lost by checking the same item at online shops over and over in the hope to spot a good deal? For me, it is 10 minutes per day. It’s not that much. But in one month it would be 300 minutes. Spending around 5 hours per month on routine actions is stupid when you know to code.
Why Python for automation?
Straightforward – I know a little bit of Python. The Python language includes modules and libraries that can automate your daily processes.
And most important, the Python language is simple.
1. Grab a link’s data with the requests library, pass it to the variable. The data of the variable contain raw html+js code.
2. Parse the variable’s data with the bs4 library(BeautifulSoup module).
3. Parsing is the process when your program finds HTML attributes from the variable’s data. Now you have the data – you can write it down to the text document, send it to the database, etc. Or as did I – send data to my Telegram messenger.
4. And the “cherry on top of the cake” – run your Python scrapper on a server(VPS, for example), schedule it as a cron job and get messages anytime, anywhere, any amount you want. I have set up a cron job to run my “Shop automation” Python code – 3 times per day, every day of the week. It is very comfortable now, I don’t need to visit those websites and spend my precious time. Recently I wrote a post about how you can properly schedule a cron job on the Ubuntu server, please take a look before we go to the next step.
Object of automation
Each day I’m checking the same 2 websites(local shops), checking if there are any good deals for Python books(seriously, I’m in love with tech and programming literature). I brought one of my very first programming books there and it was the fire I needed, I have started this blog because of that affair.
Workaround
Setting up the browser. Websites have their own rules for bots, crawlers, robots that’s why it’s not enough just to run requests. get command and grab the content. You need a “real” browser to programmatically grab the information from the web. Selenium with “Geckodriver” – is the right choice. I’m running my code on both: Windows and Linux machines, geckodriver is cross-platform(it supports Mac as well).
Download the geckodriver from officials, unpack it to the python’s project folder.
If you’re on Linux -> Add the driver to your PATH so other tools can find it: export PATH=$PATH:/path-to-extracted-file/
&please don’t forget to take a look at the final code, you must declare imports of modules and install requirements.txt libraries.
1) Perform the OS check:if "Linux" in platform.system():
path = "geckodriver"
else:
path = "geko/geckodriver.exe"
2) I prefer Firefox geckodriver. Add options for browser startup, if you want to run the browser silently(I recommend):
headless = Trueoptions = webdriver.FirefoxOptions()
service = Service(executable_path=path)
options.headless = True
if headless:
options.add_argument('-headless')
options.add_argument("-disable-dev-shm-usage")
options.add_argument("-no-sandbox")
options.add_argument('-disable-gpu')
No GPU needed, headless(run in the background), possible to run on low RAM.
3) When you send out a request to an HTML website within your browser – you also send data about your system, browser. It calls user-agent or {headers}. We can write headers into the variable and pass it as a parameter to the driver object, but it would be not enough. Each time we run the web driver – the user-agent should be regenerated. Python has a module: fake-useragent, we will use the UserAgent().firefox attribute, so the headers will be related to the web driver. profile = webdriver.FirefoxProfile()
profile.set_preference("general.useragent.override", useragent)
driver = webdriver.Firefox(firefox_profile=profile,service=service, keep_alive=False, options=options)
driver.implicitly_wait(30)
Browser function summary
def startbrowser():
if "Linux" in platform.system():
path = "geckodriver"
else:
path = "geko/geckodriver.exe"
headless = True
options = webdriver.FirefoxOptions()
service = Service(executable_path=path)
options.headless = True
if headless:
options.add_argument('-headless')
options.add_argument("-disable-dev-shm-usage")
options.add_argument("-no-sandbox")
options.add_argument('-disable-gpu')
useragent = UserAgent().firefox
print(useragent)
profile = webdriver.FirefoxProfile()
profile.set_preference("general.useragent.override", useragent)
driver =
webdriver.Firefox(firefox_profile=profile,service=service,
keep_alive=False, options=options)
driver.implicitly_wait(30)
return driver
The website scraper and data parser
Declare the URL of the website you want to grab data from:atb_url = "https://zakaz.atbmarket.com/catalog/1016/411/"
Start the scrapper function by invoking a browser and sending a get request to the desired URL:driver = startbrowser()
try:
driver.get(url)
time.sleep(3)
except:
print("can't run webdriver atb")
if driver:
driver.quit()
pass
Within successful run of driver.get(url)
– the driver object will contain all the raw HTML+JS data from the URL webpage. The next thing you want to do is to parse data:
source = driver.page_source driver.quit() page = Soup(source, features='html.parser') articles = page.select("article") products = len(articles) books_str= "" now = datetime.datetime.now().replace(microsecond=0)+datetime.timedelta(hours=2)
The now
variable is the actual timestamp when code has been set to run.
Parse the data with for loop:
for _ in range(products):
try:
title = articles[_].select(".catalog-list .catalog-item__title")
title_text = title[0].text.strip()
product_link = title[0].a["href"]
price = articles[_].select(".catalog-list .catalog-item__bottom .product-price__top")
price_text = price[0].attrs["value"].strip()
price_text_str = f'<a href="https://zakaz.atbmarket.com{product_link}"><i>{_+1}){title_text}:{price_text} грн.</i></a>\n'
books_str=books_str + f'{price_text_str}'
except:
pass
if _ == products:
break
atb_str = f"<pre>Книги АТБ від: {now}</pre>\n{books_str}"
print(atb_str)
return atb_str
The atb_str
variable contains all the data that I want to collect(book’s name, price and product link) and it’s in the right format now(Telegram markup)
Optional:
3) Create a bot and send the updates via Telegram. You need to have a telegram bot, the bot should be in the group or channel with you. You must have the bot’s token and group/channel id.
def telegram_api(telegram_message):
token = "XXX"
chat_id = "-XXX"
url_req = "https://api.telegram.org/bot" + token + "/sendMessage" + "?chat_id=" + chat_id + "&parse_mode=HTML" + "&disable_web_page_preview=true" + "&text=" + telegram_message
try:
results = requests.get(url_req)
print(results.json())
except:
print("can't run requests post query")
pass
return print("grab done")
take a look at the url_req
variable, the data that will be sent from the bot contained in the telegram_message
variable
4) Run the scraper and telegram bot functions:#run atb scrapper:
atb_message = atb_grabber(atb_url)
#run telegram_api function with configured telegram message
telegram_api(atb_message )
Suggestions
Edit and run this code with your link, it’s suitable for most of the web pages. And don’t forget – do not spam with it, websites firewalls can block you out. Use carefully.
Results
I’m getting beautifully formated Telegram messages, three times per day, the code scheduled to run via cron on my Ubuntu VPS server if you wonder which – the cheapest one from VULTR(my ref link)
Resources
full code on GitHub
Python project’s requirements.txt
geckodriver official
Leave a Reply