Web Scraping with Firebase Cloud Functions and Cloud Firestore

Patric Steiner · November 12, 2019

web-scraping firebase

In the previous post we saw how to build a script that scrapes the website of a local supermarket for beer promotions and returns the data as a clean JSON array.

Remember, the goal is to build a system that lets people setup a subscription so they can be notified about actual promotions, which allows them to profit from cheap beer prices.

TODOs:

Build a script that scrapes the supermarkets website and returns all promotions as JSON array (covered in the previous post)
Deploy the script, schedule it to run in the cloud and persist the data (topic of this post)
Build a frontend that lets users subscribe to promotions (we will look at this in the next post)

Preparing to use firebase cloud 🔥☁️

Allright, let’s get to it. To make our script available to the world, we could build a webserver that hosts the script and executes it whenever a users visits the page. But hey, it’s 2019, we don’t necessarily need a webserver for this simple use case. We can just deploy our script as a cloud function so that we don’t have to worry about any webserver. I personally like to use firebase for this, not only because of its simplicity, but also for it’s generous free tier. In addition, firebase also offers a database called firestore that we can use to persist the scraped data - perfect!

After creating a firebase project, we can setup our environment locally and initialize the project. We need to install the firebase-tools using npm install -g firebase-tools and then login to our account with firebase login.

Writing a cloud function λ

Now we are ready to setup the project locally: mkdir myAwesomeProject && cd myAwesomeProject && firebase init functions. We can choose either JavaScript or TypeScript to write the function - I prefer TypeScript. There are various kinds of functions. For now we will just use a normal https function.

In the index.ts, which was generated by the firebase initcommand, we need to import firebase-functions and also firebase-admin, to interact with the database:

import * as admin from 'firebase-admin';
import * as functions from 'firebase-functions';

admin.initializeApp(); // needed to initialize the admin sdk

Let’s also import the scrapePromotions function that we created in the previous post (assuming it’s located and exported in a file called scrape.ts):

import { scrapePromotions } from './scrape';

export const scrapePromotions = functions
    .runWith({ timeoutSeconds: 30, memory: "1GB" }) // assure we have enough ressources
    .region('europe-west1') // select a region that is close to your target audience
    .https.onRequest(async (req, res) => {
        const promotions = await scrapePromotions();

        // TODO persist the promotions

        res.write(promotions); // the JSON array is the body of the https response
    });

Deployment 🚀

This is all the code we need for our function to work! Let’s deploy it using the command line:

firebase deploy --only functions

When we now navigate to https://europe-west1-myAwesomeProject.cloudfunctions.net/scrapePromotions it takes a couple seconds (because the scraping function needs to control a headless browser and wait until all content is loaded), but after that we receive our desired output:

[
    {
        "imageUrl": "//contentimages.coop.ch/aktionenimages/images/6.336.920_Anker_Lager_Bier_15x33cl_ZTGWPS_252943_XL_DE.png",
        "oldPrice": "23.90",
        "price": "11.95",
        "title": "Anker Lagerbier, 2 x 15 x 33 cl (100 cl = 1.21)"
    },
    // ...
]

Storing the promotions to firestore 💾

Almost done 😍! The only part that’s left is persisting (or rather caching) the information in the firestore database, so that we have a faster way of accessing the data.

You won’t believe how easy that is:

 await admin.firestore().doc('shops/mySupermarket').set({
        updatedAt: new Date(),
        promotions
 }, { merge: true });

Note that shops is used as collection name and mySupermarket is the document id we chose for our supermarket. Firestore is a schemaless, document-oriented DB and we are free to use whatever names we want. We also store a field updatedAt with the current date, so that we always know when the last scrape took place.

Scheduling the function with cloud Pub/Sub

Since we already setup a function that store the scraped data to a database, we might as well setup a job that regularly fetches the newest promotions. We can do that with cloud Pub/Sub. We can simply change our https function to a pubsub function with an schedule. Note that the schedule parameter accepts not only a cronjob string, but we can even use plain english to declare the interval!

export const scrapePromotions = functions
    .runWith({ timeoutSeconds: 30, memory: "1GB" })
    .region('europe-west1')
    .pubsub
    .schedule('every 4 hours').onRun(async context => {        
        // function body...
    });

Deploy again firebase deploy --only functions… Aaaand that’s a wrap! We made the scraping function available and everyone on the world can now access the data. The last part will be to build a nice frontend that users can use to be informed about current promotions. See you soon 👋😃!

Full code of the firebase cloud function:

import * as admin from 'firebase-admin';
import * as functions from 'firebase-functions';
import { scrapePromotions } from './scrape';

admin.initializeApp(); // needed to initialize the admin sdk

export const scrapePromotions = functions
    .runWith({ timeoutSeconds: 30, memory: "1GB" }) // assure we have enough ressources
    .region('europe-west1') // select a region that is close to your target audience
    .pubsub
    .schedule('every 4 hours').onRun(async context => {  
        const promotions = await scrapePromotions();

        // write the data to firestore
        await admin.firestore().doc('shops/mySupermarket').set({
              updatedAt: new Date(),
              promotions
        }, { merge: true });
        
        res.write(promotions); // the JSON array is the body of the https response
    });