Self-hosting a Web scraping Farm

Raspberry Pi Cluster DevOps Project Idea

TJ. Podobnik, @dorkamotorka
8 min readSep 2, 2022

Over the last half a year I worked on a pretty data-intensive project, where the goal was to scrape data from the web, preprocess it, train the deep learning model with it, and deploy it to production. These are quite hefty tasks to repetitively do by hand so the goal was to automate the process as much as possible.

Just to make things clear, web scraping is a program that crawls through the web page and scrapes data off of it. It extracts the underlying HTML code and can then replicate the entire website content elsewhere. That is used in a variety of digital businesses that rely on data harvesting like:

  • Scraping data off of a site, analyzing its content, and then ranking it
  • Auto-fetching prices and product descriptions for allied seller websites
  • Pulling data from forums and social media for purposes of sentiment analysis in marketing

That’s all fine and well and probably most of you already know all these things. But the primary goal of this post is to give you a high-level overview of the architecture that enables all of this to be automated. I’m talking about self-containerizing your code, reliably running your web-scraping script, easily deploying new software versions, keeping track of the patches, maintaining your self-hosted farm, and much more.

Complete architecture

Physical setup

It’s no secret that there are many great web articles to build your own Raspberry Pi Cluster, instead of purchasing a VM in the cloud or something similar. Not only do you end up with a cluster of cool machines and all these cool LEDs blinking in your room but you also learn a lot building it. This does come with its downsides, but that’s where the fun part begins — where you need to be a little bit creative to make your life easier.

If you already have your cluster setup that’s all fine and well. Mine cluster comprises the following components:

  • An 8-Port PoE switch
  • 6x CAT5 cables
  • 5x Raspberry Pis 4 with 2GB RAM
  • 5x PoE Raspberry Pi Hats

As long as you have a functional cluster, with more than two Raspberry Pis, you’ll get along well with this tutorial. You can also continue with less than two, but then keep in my mind there is no such thing as a High-Availability cluster for you, while you can still simulate the rest of the tasks.

But why so many (Raspberry) Pis? 🤔

Actually, I did initially start with only one Raspberry Pi, connected to the web that runs a cronjob. But the project I was working on had strict requirements. Not only I couldn’t afford to lose any data in case of Raspberry Pi failures, but also each consecutive file needed to be generated in fixed intervals and stored in ascending order. And not only I encountered many issues related to Raspberry Pi failures during this project but also from experience I knew a Pi can go down on average every 2–3 weeks or so. To give you a little bit of context, I think the problem actually lies in the SD card and I could potentially use a more reliable drive, but let’s leave this mystery for some time else.

K3s High-Availability (HA) Cluster

So, we are operating using multiple Pis, and to make them work as one, I deployed a K3s HA cluster on top of it, which constructs a hierarchy of multiple masters and slaves, where the masters delegate the tasks and slaves are usually the ones that execute them. I said usually because masters are also capable of executing the task —which can be adjusted differently in the Kubernetes configuration. It’s important to realize that in case all masters are down, the cluster is dysfunctional because nobody can delegate tasks. That’s also the reason why I deployed multiple of them, which is also the reason why it’s called High-Availability.

Scraping never felt easier

Sooo where was I… Web scraping right? Usually, the scripts that scrape data off of the web are triggered by an event or at a fixed time interval. One thing to emphasize here is that you don’t want to run the script too often since you might get out of storage space, but not running is something you probably also don’t want. In the end, it all boils down to the nature of the problem.

Besides that, it’s important to self-containerize your script inside a container. Not only you can then execute it on all of the machines that support this container primitive, but you also make sure your code is packaged with all of the dependencies. I wrapped the script in a Docker container, which masters execute on the available slaves every 3 minutes.

Now comes the fun part…

  • How can I deploy a new version of the web-scraping script to the Kubernetes cluster in “no” time with almost zero hustle?
  • How can figure out which Kubernetes node is down and restart it, potentially flash a fresh image on the SD card and configure it to run the required software right off the bat?
  • I’m nowhere near the cluster, but I want to test whether the cluster is functional and whether tasks are successfully executed
  • What if the new software patch breaks the latest Docker image running on the cluster? How do I revert the changes? How can I prevent this from affecting my Kubernetes cluster performance?

Automation at its finest

With great things, there come great challenges. About one of the semi-simpler ones I already extensively talked about in one of my previous posts. And it’s a basic problem of determining which Raspberry Pi doesn’t work. When you have one Raspberry Pi, that’s stupid to ask yourself, but when you have multiple it’s a bit trickier than you might think. Imagine either Kubernetes Cluster or through some other monitoring tools, you’re informed that Raspberry Pi with an IP X.X.X.X is down. You cannot SSH to it to restart so you have to do it either manually or through the PoE switch if that’s something you can do. Here comes my question —

How do you visually know which Raspberry Pi has IP X.X.X.X? 👀

There are a couple of options — you can set static IPs to each Pi and stick a label with the IP on top of it, or you can color them in some way and configure the hostname accordingly (e.g. bluepi, redpi, etc.). But there’s an even easier thing that you can do — have a look at it.

Another one I want to briefly mention is related to automating the configuration of newly added nodes or reconfiguring the stall ones, so one can get things up and running by just plugging the power cable inside the Raspberry Pi and connecting it to the web. For that, I created an Ansible playbook, which among setting the hostname and installing some apt packages, also configures the NFS for K3s.

And because I’m really really lazy, that was not it. I really wanted to achieve a state where I only need to commit the changes on my GitHub, and in the next couple of minutes, I should already see the new version of the software running on my Kubernetes cluster. Sounds good, right?

Well, that can be done and it’s actually not that complicated. The solution is comprised of two steps:

  • GitHub Actions runs unit tests, builds a Docker image, and pushes it on DockerHub
  • ArgoCDchecks for new Docker images and deploys them to the cluster

In general terms, this is usually referred to as the Continous Integration and Continous Delivery model (CI/CD). We’re obviously just scratching the tip of the iceberg, but you get the idea.

In the beginning, I also mentioned it’s possible for the new Docker images to fail — no shit Sherlock. To countermeasure that we need to utilize the power of Kubernetes, namely the Blue/Green Deployment, which gradually redirects traffic to the new software version, and in case container execution fails, the traffic gets easily redirected back to the old version.

Source: Octopus Deploy

Access it like it’s next door

I think I don’t have to emphasize that as a human being you should move your ass around — and you’re not going to always be present when something goes wrong with your cluster. You might as well just be curious about what’s going on with your web scraping tasks and you want to login into your cluster. Basically, you need to set up remote access. Either it’s VPN or the port forwarding method — something that will enable you to debug what’s going on with your farm and act accordingly.

Is there more?

I’m actually not done yet, but I’ll call it THE END here. There are actually quite a few more things I could write about — just to name a few:

  • Run GitHub Actions on your self-hosted Kubernetes in case you’re running out of free-tier minutes
  • Stop using SD cards and Netboot your Raspberry Pi Cluster
  • Restart the Raspberry Pi from the remote in case of failures and not being able to SSH to it, which is possible in case you can access your PoE switch configuration
  • Debugging your Raspberry Pi using Serial Console in case you have boot-related issues
  • Easily apply changes to your Raspberry Pi Base image by emulating Raspberry Pi on your PC

Thanks for reading! 😎 If you enjoyed this article, hit that clap button below 👏

Would mean a lot to me and it helps other people see the story. Say Hello on Linkedin | Twitter

Do you want to start reading exclusive stories on Medium? Use this referral link 🔗

If you liked my post you can buy me a Hot dog 🌭

Are you an enthusiastic Engineer, who lacks the ability to compile compelling and inspiring technical content about it? Hire me on Upwork 🛠️

Checkout the rest of my content on Teodor J. Podobnik, @dorkamotorka and follow me for more, cheers!