Replacing Firefox live bookmarks

metadata

keywords:
- software
- Firefox
- Python
- lxml
- PyGuymer3
- Requests
- Atom
- RSS
published:2018-12-11
updated:2024-02-10
Atom Feed

I use Firefox as my web browser and I use it’s live bookmarks feature to subscribe to RSS (and Atom) feeds across the internet. When Firefox 64 was released Mozilla announced that they would remove the live bookmarks feature rather than continue to support it (apparently it used it’s own XML parser).

I realise that RSS (and Atom) feeds are dying a slow death on the internet (because content providers are slowly realising that they allow users to access content whilst side-stepping tracking cookies and avoiding advertising revenue) but I feel that Mozilla could have put a little bit more effort into maintaining this feature. Mozilla should not care about how other websites publish their content - by removing the feature in Firefox they will contribute to the downfall of RSS (and Atom) feeds as fewer people will use them. Anyway, Mozilla did produce an article about how to migrate your data to a different app or a Firefox add-on: What happened to my live bookmarks?

Personally, I decided that this wasn’t the path for me. I was already inconvenienced by the Firefox app on Android/iOS not supporting live bookmarks anyway so I decided to write a Python script to check my feeds and email me about new articles. This way I would always get notified no matter what device I was using (as long as I could access my emails). Additionally, the Python script keeps a record of which articles it has already emailed me about so that it doesn’t email me about them all again just because it was run again (this means that the Python script can be run as part of a Cron job).


#!/usr/bin/env python3

# Use the proper idiom in the main module ...
# NOTE: See https://docs.python.org/3.11/library/multiprocessing.html#the-spawn-and-forkserver-start-methods
if __name__ == "__main__":
    # Import standard modules ...
    import email
    import email.message
    import html
    import json
    import mimetypes
    import shutil
    import subprocess
    import sys
    import time

    # Import special modules ...
    try:
        import lxml
        import lxml.etree
    except:
        raise Exception("\"lxml\" is not installed; run \"pip install --user lxml\"") from None

    # Import my modules ...
    try:
        import pyguymer3
    except:
        raise Exception("\"pyguymer3\" is not installed; you need to have the Python module from https://github.com/Guymer/PyGuymer3 located somewhere in your $PYTHONPATH") from None

    # Check that "ssmtp" is installed ...
    if shutil.which("ssmtp") is None:
        raise Exception("\"ssmtp\" is not installed") from None

    # Define settings ...
    path = "/path/to/rss_checker.json"

    # Define function ...
    def construct_email(emailIn, feedTitleIn, postTitleIn, dateIn, linkIn, contentIn, thumbnailIn, sessIn, /):
        # Check inputs ...
        if feedTitleIn is None:
            print("WARNING: \"feedTitleIn\" is None")
            return False
        if postTitleIn is None:
            print("WARNING: \"postTitleIn\" is None")
            return False
        if dateIn is None:
            print("WARNING: \"dateIn\" is None")
            return False

        # Create email ...
        message = email.message.EmailMessage()
        message["To"] = emailIn
        message["Subject"] = f"New post in \"{feedTitleIn.text.strip()}\" feed"
        message["From"] = "you@example.com"

        # Create content ...
        contentOut = f"Post Title: {postTitleIn.text.strip()}\n"
        contentOut += f"Post Date: {dateIn.text.strip()}\n"
        contentOut += f"Post Link: {linkIn}\n"

        # Check if there is an article description ...
        if contentIn is not None:
            # Add the article description ...
            contentOut += f"Post Description:\n{html.unescape(contentIn.text.strip())}\n"

        # Set content ...
        message.set_content(contentOut)

        # Check if there is an article thumbnail ...
        if thumbnailIn is not None:
            # Obtain the thumbnail URL ...
            url = thumbnailIn.attrib.get("url", "ERROR")

            # Check that there is a thumbnail URL ...
            if url != "ERROR":
                # Download the thumbnail ...
                cont = pyguymer3.download_stream(sessIn, url)

                # Determine MIME type ...
                ftype = mimetypes.guess_type(url, strict = True)[0]
                if ftype is None:
                    ftype = "image/jpg"

                # Create short-hands ...
                maintype, subtype = ftype.split("/")

                # Add the article thumbnail ...
                message.add_attachment(
                    cont,
                    maintype = maintype,
                     subtype = subtype,
                    filename = f"thumbnail.{subtype}",
                )

        # Return the answer ...
        return message

    # Load data file as JSON ...
    with open(path, "rt", encoding = "utf-8") as fObj:
        data = json.load(fObj)

    # Initialize counter and set limit ...
    n = 0                                                                       # [#]
    nlim = 30                                                                   # [#]

    # Start session ...
    with pyguymer3.start_session() as sess:
        # Loop over feeds ...
        for feed in data:
            print(f"Processing \"{feed}\" ...")

            # Download Atom/RSS (as a byte stream) ...
            src = pyguymer3.download_stream(sess, feed)
            if src is False:
                print("WARNING: Failed to download the Atom/RSS feed.")
                continue
            if len(src) == 0:
                print("WARNING: The Atom/RSS feed is empty.")
                continue

            # Parse Atom/RSS as XML with error handling ...
            # NOTE: Atom/RSS feeds have a habit of being illegal XML. For
            #       example:
            #           <title>Cinthie's 'Soul, Strings & Samples' Mini Mix</title>
            #       ... should be:
            #           <title>Cinthie&apos;s &apos;Soul, Strings &amp; Samples&apos; Mini Mix</title>
            #       ... therefore, I no longer use "xml.etree.ElementTree" but
            #       rather "lxml.etree" as it supports recovery of illegally
            #       specified characters.
            root = lxml.etree.fromstring(src, parser = lxml.etree.XMLParser(recover = True))

            # Determine the feed format ...
            if root.tag  == "{http://www.w3.org/2005/Atom}feed":
                print("  It is an Atom feed")

                # Loop over all entry tags in the feed ...
                for entry in root.findall("{http://www.w3.org/2005/Atom}entry"):
                    # Find the link to the article ...
                    post = entry.find("{http://www.w3.org/2005/Atom}id").text.strip()
                    if not post.startswith("http"):
                        post = entry.find("{http://www.w3.org/2005/Atom}link").get("href").strip()
                        if not post.startswith("http"):
                            raise Exception("cannot find a post that starts with \"http\"") from None

                    # Correct for common bugs ...
                    post = post.replace("www.FreeBSD.org", "www.freebsd.org")
                    post = post.replace("www.freebsd.org//", "www.freebsd.org/")

                    # Skip this article if it has already been emailed ...
                    if post in data[feed]["posts"]:
                        continue

                    # Construct email ...
                    inp = construct_email(
                        data[feed]["email"],
                        root.find("{http://www.w3.org/2005/Atom}title"),
                        entry.find("{http://www.w3.org/2005/Atom}title"),
                        entry.find("{http://www.w3.org/2005/Atom}updated"),
                        post,
                        entry.find("{http://www.w3.org/2005/Atom}summary"),
                        entry.find("{http://search.yahoo.com/mrss/}thumbnail"),
                        sess,
                    )
                    if inp is False:
                        continue

                    # Send email and increment counter ...
                    subprocess.run(
                        ["ssmtp", data[feed]["email"]],
                           check = True,
                        encoding = "utf-8",
                           input = inp.as_string(),
                         timeout = 60.0,
                    )
                    n += 1                                                      # [#]

                    print(f"  Sent email about \"{post}\"")

                    # Save article so that it is not sent again ...
                    data[feed]["posts"] = sorted(list(set(data[feed]["posts"] + [post])))
                    with open(path, "wt", encoding = "utf-8") as fObj:
                        json.dump(
                            data,
                            fObj,
                            ensure_ascii = False,
                                  indent = 4,
                               sort_keys = True,
                        )

                    # Stop sending emails or wait so that this script does not
                    # spam the server ...
                    if n >= nlim:
                        print("Finishing cleanly; sent too many emails.")
                        sys.exit()
                    time.sleep(2.0)
            elif root.tag  == "rss":
                print("  It is an RSS feed")

                # Loop over all item tags in the first channel tag of the feed ...
                for item in root.find("channel").findall("item"):
                    # Find the link to the article ...
                    post = item.find("link").text.strip()
                    if not post.startswith("http"):
                        raise Exception("cannot find a post that starts with \"http\"") from None

                    # Correct for common bugs ...
                    post = post.replace("www.FreeBSD.org", "www.freebsd.org")
                    post = post.replace("www.freebsd.org//", "www.freebsd.org/")

                    # Skip this article if it has already been emailed ...
                    if post in data[feed]["posts"]:
                        continue

                    # Construct email ...
                    inp = construct_email(
                        data[feed]["email"],
                        root.find("channel").find("title"),
                        item.find("title"),
                        item.find("pubDate"),
                        post,
                        item.find("description"),
                        item.find("{http://search.yahoo.com/mrss/}thumbnail"),
                        sess,
                    )
                    if inp is False:
                        continue

                    # Send email and increment counter ...
                    subprocess.run(
                        ["ssmtp", data[feed]["email"]],
                           check = True,
                        encoding = "utf-8",
                           input = inp.as_string(),
                         timeout = 60.0,
                    )
                    n += 1                                                      # [#]

                    print(f"  Sent email about \"{post}\"")

                    # Save article so that it is not sent again ...
                    data[feed]["posts"] = sorted(list(set(data[feed]["posts"] + [post])))
                    with open(path, "wt", encoding = "utf-8") as fObj:
                        json.dump(
                            data,
                            fObj,
                            ensure_ascii = False,
                                  indent = 4,
                               sort_keys = True,
                        )

                    # Stop sending emails or wait so that this script does not
                    # spam the server ...
                    if n >= nlim:
                        print("Finishing cleanly; sent too many emails.")
                        sys.exit()
                    time.sleep(2.0)
            else:
                raise Exception(f"\"{root.tag}\" is an unrecognized feed format") from None

You may also download “rss_checker.py” directly or view “rss_checker.py” on GitHub Gist (you may need to manually checkout the “main” branch).

You will see that the script has a few neat features, such as:

it waits between sending emails (so as to avoid spamming the server);
it limits how many emails it sends in one go (so as to avoid spamming the server);
it saves a list of which articles have been emailed after every email so that it doesn’t duplicate output even if it crashes; and
it can handle both RSS and Atom feeds.

The JSON file that it uses as a database is shown below.


{
    "http://updating.kojevnikov.com/atom/ports": {
        "email" : "you@example.com",
        "posts" : [
            "http://updating.kojevnikov.com/entry/8e1ed6e2b38be47eab3e4d433073b80477fb9190d77192e6b62e411feb70d790",
            "http://updating.kojevnikov.com/entry/9b3c43c14a569550fef365b746a01f38d6c93d7bf98adfd02514dfa2a04ab3cf",
            "http://updating.kojevnikov.com/entry/75aaf4e5d7e7d88ebaa1786841cfda27f18c1e12517816f8b405ecbc07aa0d23",
            "http://updating.kojevnikov.com/entry/847d0f9a75de85d2207b02963a701ebfa56dbd7bb9bd86c384a5caf94da3f760",
            "http://updating.kojevnikov.com/entry/3c5d0cf2c3b838278980e3b1fbce4ac70171814e897211636ee273c59beb28ef",
            "http://updating.kojevnikov.com/entry/be1f0245edfaf5a53c6c49d889584dd637426d4109e6e19c84f3864212dbb196",
            "http://updating.kojevnikov.com/entry/9092af7b0b3ebe89eb374cef69debb511d6a3f6ff1e93bb6af98183275e1bdcc",
            "http://updating.kojevnikov.com/entry/13ad26498c2d4b2d9a3f94ccb58117f7ddfa65d8a38b950e805f079cc20de3f1",
            "http://updating.kojevnikov.com/entry/abb15a178d2dce63c2abc6995e7ebba1df1f80ba96bdc1cda1281d8fc171682b",
            "http://updating.kojevnikov.com/entry/d29c267b899c7f18381b8393d2a1e0a883fcb6dc4c5982ca60fdb131e8ee659c"
        ]
    },
    "https://what-if.xkcd.com/feed.atom": {
        "email" : "you@example.com",
        "posts" : [
            "https://what-if.xkcd.com/157/",
            "https://what-if.xkcd.com/156/",
            "https://what-if.xkcd.com/155/",
            "https://what-if.xkcd.com/154/",
            "https://what-if.xkcd.com/153/"
        ]
    },
    "https://xkcd.com/atom.xml": {
        "email" : "you@example.com",
        "posts" : [
            "https://xkcd.com/2089/",
            "https://xkcd.com/2088/",
            "https://xkcd.com/2087/",
            "https://xkcd.com/2086/"
        ]
    },
    "https://blog.xkcd.com/feed/atom/": {
        "email" : "you@example.com",
        "posts" : [
            "https://blog.xkcd.com/?p=847",
            "https://blog.xkcd.com/?p=840",
            "https://blog.xkcd.com/?p=823",
            "https://blog.xkcd.com/?p=805",
            "https://blog.xkcd.com/?p=801",
            "https://blog.xkcd.com/?p=797",
            "https://blog.xkcd.com/?p=774",
            "https://blog.xkcd.com/?p=728",
            "https://blog.xkcd.com/?p=768",
            "https://blog.xkcd.com/?p=746"
        ]
    },
    "https://bodhi.fedoraproject.org/rss/updates/?type=security": {
        "email" : "you@example.com",
        "posts" : [
            "https://bodhi.fedoraproject.org/updates/tinc-1.0.35-1.fc28",
            "https://bodhi.fedoraproject.org/updates/vcftools-0.1.16-1.fc28",
            "https://bodhi.fedoraproject.org/updates/vcftools-0.1.16-1.el7",
            "https://bodhi.fedoraproject.org/updates/leptonica-1.77.0-1.fc28%20mingw-leptonica-1.77.0-1.fc28",
            "https://bodhi.fedoraproject.org/updates/leptonica-1.77.0-1.fc29%20mingw-leptonica-1.77.0-1.fc29",
            "https://bodhi.fedoraproject.org/updates/mingw-podofo-0.9.6-5.fc29%20podofo-0.9.6-3.fc29",
            "https://bodhi.fedoraproject.org/updates/wordpress-5.0.2-1.fc29",
            "https://bodhi.fedoraproject.org/updates/wordpress-5.0.2-1.el7",
            "https://bodhi.fedoraproject.org/updates/wordpress-5.0.2-1.el6",
            "https://bodhi.fedoraproject.org/updates/wordpress-5.0.2-1.fc28",
            "https://bodhi.fedoraproject.org/updates/openjpeg2-2.3.0-10.fc29%20mingw-openjpeg2-2.3.0-6.fc29",
            "https://bodhi.fedoraproject.org/updates/openjpeg2-2.3.0-10.fc28%20mingw-openjpeg2-2.3.0-6.fc28",
            "https://bodhi.fedoraproject.org/updates/mingw-poppler-0.67.0-2.fc29",
            "https://bodhi.fedoraproject.org/updates/mingw-poppler-0.62.0-2.fc28",
            "https://bodhi.fedoraproject.org/updates/php-pear-1.10.7-2.fc29",
            "https://bodhi.fedoraproject.org/updates/php-pear-1.10.7-2.fc28",
            "https://bodhi.fedoraproject.org/updates/krb5-1.16.1-22.fc28",
            "https://bodhi.fedoraproject.org/updates/krb5-1.16.1-22.fc29",
            "https://bodhi.fedoraproject.org/updates/terminology-1.3.2-1.fc29",
            "https://bodhi.fedoraproject.org/updates/terminology-1.3.2-1.fc28"
        ]
    },
    "https://fedoramagazine.org/feed/": {
        "email" : "you@example.com",
        "posts" : [
            "https://fedoramagazine.org/best-2018-fedora-system-administrators/",
            "https://fedoramagazine.org/how-to-build-a-netboot-server-part-3/",
            "https://fedoramagazine.org/best-2018-articles-command-line/",
            "https://fedoramagazine.org/4-try-copr-december-2018/",
            "https://fedoramagazine.org/best-2018-articles-desktop-users/",
            "https://fedoramagazine.org/how-to-build-a-netboot-server-part-2/",
            "https://fedoramagazine.org/dash-dock-extenstion/",
            "https://fedoramagazine.org/fedora-classroom-containers-101-podman/",
            "https://fedoramagazine.org/secure-nfs-home-directories-kerberos/",
            "https://fedoramagazine.org/fedora-27-end-of-life/"
        ]
    },
    "https://vuxml.freebsd.org/freebsd/rss.xml": {
        "email" : "you@example.com",
        "posts" : [
            "https://www.vuxml.org/freebsd/70b774a8-05bc-11e9-87ad-001b217b3468.html",
            "https://www.vuxml.org/freebsd/b80f039d-579e-4b82-95ad-b534a709f220.html",
            "https://www.vuxml.org/freebsd/4f8665d0-0465-11e9-b77a-6cc21735f730.html",
            "https://www.vuxml.org/freebsd/fa6a4a69-03d1-11e9-be12-a4badb2f4699.html"
        ]
    },
    "https://www.freebsd.org/news/rss.xml": {
        "email" : "you@example.com",
        "posts" : [
            "https://www.FreeBSD.org/news/newsflash.html#event20181224:01",
            "https://www.FreeBSD.org/news/newsflash.html#event20181211:01",
            "https://www.FreeBSD.org/news/newsflash.html#event20181211:02",
            "https://www.FreeBSD.org/news/newsflash.html#event20181201:01",
            "https://www.FreeBSD.org/news/newsflash.html#event20181125:01",
            "https://www.FreeBSD.org/news/newsflash.html#event20181117:01",
            "https://www.FreeBSD.org/news/newsflash.html#event20181110:01",
            "https://www.FreeBSD.org/news/newsflash.html#event20181103:01",
            "https://www.FreeBSD.org/news/newsflash.html#event20181027:01",
            "https://www.FreeBSD.org/news/newsflash.html#event20181020:01"
        ]
    },
    "https://www.freebsd.org/security/rss.xml": {
        "email" : "you@example.com",
        "posts" : [
            "https://security.FreeBSD.org/advisories/FreeBSD-SA-18:15.bootpd.asc",
            "https://security.FreeBSD.org/advisories/FreeBSD-SA-18:14.bhyve.asc",
            "https://security.FreeBSD.org/advisories/FreeBSD-SA-18:13.nfs.asc",
            "https://security.FreeBSD.org/advisories/FreeBSD-SA-18:12.elf.asc",
            "https://security.FreeBSD.org/advisories/FreeBSD-SA-18:11.hostapd.asc",
            "https://security.FreeBSD.org/advisories/FreeBSD-SA-18:10.ip.asc",
            "https://security.FreeBSD.org/advisories/FreeBSD-SA-18:09.l1tf.asc",
            "https://security.FreeBSD.org/advisories/FreeBSD-SA-18:08.tcp.asc",
            "https://security.FreeBSD.org/advisories/FreeBSD-SA-18:07.lazyfpu.asc",
            "https://security.FreeBSD.org/advisories/FreeBSD-SA-18:06.debugreg.asc"
        ]
    }
}

You may also download “rss_checker.json” directly or view “rss_checker.json” on GitHub Gist (you may need to manually checkout the “main” branch).

You can see that it is a simple dictionary (or associative array) where each key is the URL of the RSS (or Atom) feed and each value is a list of the article URLs that have been emailed to-date. If you want to add a new RSS (or Atom) feed to the script then you can just define the key with an empty list and run the Python script again.