I would automate everything with Python from now on.

Jan 7, 2022

—

Automating the boring stuff is part of my manifesto. I’ve decided to automate my daily work as much as possible with Python. Have you ever tracked how many times you lost by checking the same item at online shops over and over in the hope to spot a good deal? For me, it is 10 minutes per day. It’s not that much. But in one month it would be 300 minutes. Spending around 5 hours per month on routine actions is stupid when you know to code.

Why Python for automation?

Straightforward – I know a little bit of Python. The Python language includes modules and libraries that can automate your daily processes.
And most important, the Python language is simple.
1. Grab a link’s data with the requests library, pass it to the variable. The data of the variable contain raw html+js code.
2. Parse the variable’s data with the bs4 library(BeautifulSoup module).
3. Parsing is the process when your program finds HTML attributes from the variable’s data. Now you have the data – you can write it down to the text document, send it to the database, etc. Or as did I – send data to my Telegram messenger.
4. And the “cherry on top of the cake” – run your Python scrapper on a server(VPS, for example), schedule it as a cron job and get messages anytime, anywhere, any amount you want. I have set up a cron job to run my “Shop automation” Python code – 3 times per day, every day of the week. It is very comfortable now, I don’t need to visit those websites and spend my precious time. Recently I wrote a post about how you can properly schedule a cron job on the Ubuntu server, please take a look before we go to the next step.

Object of automation

Each day I’m checking the same 2 websites(local shops), checking if there are any good deals for Python books(seriously, I’m in love with tech and programming literature). I brought one of my very first programming books there and it was the fire I needed, I have started this blog because of that affair.

Workaround

Setting up the browser. Websites have their own rules for bots, crawlers, robots that’s why it’s not enough just to run requests. get command and grab the content. You need a “real” browser to programmatically grab the information from the web. Selenium with “Geckodriver” – is the right choice. I’m running my code on both: Windows and Linux machines, geckodriver is cross-platform(it supports Mac as well).
Download the geckodriver from officials, unpack it to the python’s project folder.

If you’re on Linux -> Add the driver to your PATH so other tools can find it: export PATH=$PATH:/path-to-extracted-file/
&please don’t forget to take a look at the final code, you must declare imports of modules and install requirements.txt libraries.

1) Perform the OS check:
if "Linux" in platform.system(): path = "geckodriver" else: path = "geko/geckodriver.exe"
2) I prefer Firefox geckodriver. Add options for browser startup, if you want to run the browser silently(I recommend):
headless = True
options = webdriver.FirefoxOptions() service = Service(executable_path=path) options.headless = True if headless: options.add_argument('-headless') options.add_argument("-disable-dev-shm-usage") options.add_argument("-no-sandbox") options.add_argument('-disable-gpu')
No GPU needed, headless(run in the background), possible to run on low RAM.
3) When you send out a request to an HTML website within your browser – you also send data about your system, browser. It calls user-agent or {headers}. We can write headers into the variable and pass it as a parameter to the driver object, but it would be not enough. Each time we run the web driver – the user-agent should be regenerated. Python has a module: fake-useragent, we will use the UserAgent().firefox attribute, so the headers will be related to the web driver.
profile = webdriver.FirefoxProfile() profile.set_preference("general.useragent.override", useragent) driver = webdriver.Firefox(firefox_profile=profile,service=service, keep_alive=False, options=options) driver.implicitly_wait(30)

Browser function summary

def startbrowser(): if "Linux" in platform.system(): path = "geckodriver" else: path = "geko/geckodriver.exe" headless = True options = webdriver.FirefoxOptions() service = Service(executable_path=path) options.headless = True if headless: options.add_argument('-headless') options.add_argument("-disable-dev-shm-usage") options.add_argument("-no-sandbox") options.add_argument('-disable-gpu') useragent = UserAgent().firefox print(useragent) profile = webdriver.FirefoxProfile() profile.set_preference("general.useragent.override", useragent) driver = webdriver.Firefox(firefox_profile=profile,service=service, keep_alive=False, options=options) driver.implicitly_wait(30) return driver

The website scraper and data parser

Declare the URL of the website you want to grab data from:
atb_url = "https://zakaz.atbmarket.com/catalog/1016/411/"
Start the scrapper function by invoking a browser and sending a get request to the desired URL:
driver = startbrowser() try: driver.get(url) time.sleep(3) except: print("can't run webdriver atb") if driver: driver.quit() pass

Within successful run of driver.get(url) – the driver object will contain all the raw HTML+JS data from the URL webpage. The next thing you want to do is to parse data:

source = driver.page_source driver.quit() page = Soup(source, features='html.parser') articles = page.select("article") products = len(articles) books_str= "" now = datetime.datetime.now().replace(microsecond=0)+datetime.timedelta(hours=2)
The now variable is the actual timestamp when code has been set to run.
Parse the data with for loop:

for _ in range(products): try: title = articles[_].select(".catalog-list .catalog-item__title") title_text = title[0].text.strip() product_link = title[0].a["href"] price = articles[_].select(".catalog-list .catalog-item__bottom .product-price__top") price_text = price[0].attrs["value"].strip() price_text_str = f'<a href="https://zakaz.atbmarket.com{product_link}"><i>{_+1}){title_text}:{price_text} грн.</i></a>\n' books_str=books_str + f'{price_text_str}' except: pass if _ == products: break atb_str = f"<pre>Книги АТБ від: {now}</pre>\n{books_str}" print(atb_str) return atb_str

The atb_str variable contains all the data that I want to collect(book’s name, price and product link) and it’s in the right format now(Telegram markup)

Optional:

3) Create a bot and send the updates via Telegram. You need to have a telegram bot, the bot should be in the group or channel with you. You must have the bot’s token and group/channel id.

def telegram_api(telegram_message): token = "XXX" chat_id = "-XXX" url_req = "https://api.telegram.org/bot" + token + "/sendMessage" + "?chat_id=" + chat_id + "&parse_mode=HTML" + "&disable_web_page_preview=true" + "&text=" + telegram_message try: results = requests.get(url_req) print(results.json()) except: print("can't run requests post query") pass return print("grab done")

take a look at the url_req variable, the data that will be sent from the bot contained in the telegram_message variable

4) Run the scraper and telegram bot functions:
#run atb scrapper: atb_message = atb_grabber(atb_url) #run telegram_api function with configured telegram message telegram_api(atb_message )

Suggestions

Edit and run this code with your link, it’s suitable for most of the web pages. And don’t forget – do not spam with it, websites firewalls can block you out. Use carefully.

Results

I’m getting beautifully formated Telegram messages, three times per day, the code scheduled to run via cron on my Ubuntu VPS server if you wonder which – the cheapest one from VULTR(my ref link)

Resources

full code on GitHub
Python project’s requirements.txt
geckodriver official

Comments

4 responses to “I would automate everything with Python from now on.”

Properly schedule a crontab command for a Python script on any Linux system | BroPlanner

January 7, 2022

[…] article is part of a Workaround paragraph from the “Automate everything with Python” article, take a look if you did not saw it yet. I will explain how to save 5 hours per month by using […]

Reply
The Ultimate WordPress website optimization techniques | BroPlanner

January 7, 2022

[…] article is part of my total optimization movement(take a look at how I saved 5 hours per month of my life with this easy Python code, you can do it too), which I have declared and manifested since the start of 2022. WordPress is […]

Reply
Create a GitHub bot for automatic contributions | BroPlanner

February 17, 2022

[…] This piece of content is a part of my “automation” manifesto, which I declare here: I would automate everything with Python from now on. […]

Reply
Automatic WordPress analytics reports to the Twitter bot | BroPlanner

May 9, 2022

[…] The website scraper and data parser […]

Reply