Web Scraping with Firebase Cloud Functions and Cloud Firestore

Patric Steiner · November 12, 2019

In the previous post we saw how to build a script that scrapes the website of a local supermarket for beer promotions and returns the data as a clean JSON array.

Remember, the goal is to build a system that lets people setup a subscription so they can be notified about actual promotions, which allows them to profit from cheap beer prices.

TODOs:

  • Build a script that scrapes the supermarkets website and returns all promotions as JSON array (covered in the previous post)
  • Deploy the script, schedule it to run in the cloud and persist the data (topic of this post)
  • Build a frontend that lets users subscribe to promotions (we will look at this in the next post)

Preparing to use firebase cloud 🔥☁️

Allright, let’s get to it. To make our script available to the world, we could build a webserver that hosts the script and executes it whenever a users visits the page. But hey, it’s 2019, we don’t necessarily need a webserver for this simple use case. We can just deploy our script as a cloud function so that we don’t have to worry about any webserver. I personally like to use firebase for this, not only because of its simplicity, but also for it’s generous free tier. In addition, firebase also offers a database called firestore that we can use to persist the scraped data - perfect!

After creating a firebase project, we can setup our environment locally and initialize the project. We need to install the firebase-tools using npm install -g firebase-tools and then login to our account with firebase login.

Writing a cloud function λ

Now we are ready to setup the project locally: mkdir myAwesomeProject && cd myAwesomeProject && firebase init functions. We can choose either JavaScript or TypeScript to write the function - I prefer TypeScript. There are various kinds of functions. For now we will just use a normal https function.

In the index.ts, which was generated by the firebase initcommand, we need to import firebase-functions and also firebase-admin, to interact with the database:

import * as admin from 'firebase-admin';
import * as functions from 'firebase-functions';

admin.initializeApp(); // needed to initialize the admin sdk

Let’s also import the scrapePromotions function that we created in the previous post (assuming it’s located and exported in a file called scrape.ts):

import { scrapePromotions } from './scrape';

export const scrapePromotions = functions
    .runWith({ timeoutSeconds: 30, memory: "1GB" }) // assure we have enough ressources
    .region('europe-west1') // select a region that is close to your target audience
    .https.onRequest(async (req, res) => {
        const promotions = await scrapePromotions();

        // TODO persist the promotions

        res.write(promotions); // the JSON array is the body of the https response
    });

Deployment 🚀

This is all the code we need for our function to work! Let’s deploy it using the command line:

firebase deploy --only functions

When we now navigate to https://europe-west1-myAwesomeProject.cloudfunctions.net/scrapePromotions it takes a couple seconds (because the scraping function needs to control a headless browser and wait until all content is loaded), but after that we receive our desired output:

[
    {
        "imageUrl": "//contentimages.coop.ch/aktionenimages/images/6.336.920_Anker_Lager_Bier_15x33cl_ZTGWPS_252943_XL_DE.png",
        "oldPrice": "23.90",
        "price": "11.95",
        "title": "Anker Lagerbier, 2 x 15 x 33 cl (100 cl = 1.21)"
    },
    // ...
]

Storing the promotions to firestore 💾

Almost done 😍! The only part that’s left is persisting (or rather caching) the information in the firestore database, so that we have a faster way of accessing the data.

You won’t believe how easy that is:

 await admin.firestore().doc('shops/mySupermarket').set({
        updatedAt: new Date(),
        promotions
 }, { merge: true });

Note that shops is used as collection name and mySupermarket is the document id we chose for our supermarket. Firestore is a schemaless, document-oriented DB and we are free to use whatever names we want. We also store a field updatedAt with the current date, so that we always know when the last scrape took place.

Scheduling the function with cloud Pub/Sub

Since we already setup a function that store the scraped data to a database, we might as well setup a job that regularly fetches the newest promotions. We can do that with cloud Pub/Sub. We can simply change our https function to a pubsub function with an schedule. Note that the schedule parameter accepts not only a cronjob string, but we can even use plain english to declare the interval!

export const scrapePromotions = functions
    .runWith({ timeoutSeconds: 30, memory: "1GB" })
    .region('europe-west1')
    .pubsub
    .schedule('every 4 hours').onRun(async context => {        
        // function body...
    });

Deploy again firebase deploy --only functions… Aaaand that’s a wrap! We made the scraping function available and everyone on the world can now access the data. The last part will be to build a nice frontend that users can use to be informed about current promotions. See you soon 👋😃!

Full code of the firebase cloud function:

import * as admin from 'firebase-admin';
import * as functions from 'firebase-functions';
import { scrapePromotions } from './scrape';

admin.initializeApp(); // needed to initialize the admin sdk

export const scrapePromotions = functions
    .runWith({ timeoutSeconds: 30, memory: "1GB" }) // assure we have enough ressources
    .region('europe-west1') // select a region that is close to your target audience
    .pubsub
    .schedule('every 4 hours').onRun(async context => {  
        const promotions = await scrapePromotions();

        // write the data to firestore
        await admin.firestore().doc('shops/mySupermarket').set({
              updatedAt: new Date(),
              promotions
        }, { merge: true });
        
        res.write(promotions); // the JSON array is the body of the https response
    });