đź§ą Cleaning Up Job Descriptions in Django: Removing Tags and URLs

In maintaining a job listing platform, data cleanliness is key to providing a consistent and professional user experience. One recurring issue in our platform involved job descriptions (Job.rewrite) filled with unnecessary anchor (<a>) tags and external URLs—often copied directly from third-party sources.

This post walks through the Django management command we used to clean up these descriptions. It also reflects on what went well and what could be improved.

đź›  The Problem

Some job descriptions included:

  • <a> tags linking to external sites (sometimes broken).
  • Raw URLs pasted directly in the content.
  • Malformed or missing data in Job.rewrite.

We needed a simple, repeatable way to sanitize these fields without manually editing thousands of records.

🔄 The Approach

We built a custom Django management command that:

  1. Loops over all jobs with status = 0.
  2. Removes all <a> tags using BeautifulSoup.
  3. Strips out URLs using a regular expression.
  4. Handles edge cases like None or empty fields.
  5. Logs any traceback if something goes wrong.

Here’s the command script we used:


import re
import traceback
from bs4 import BeautifulSoup

from django.core.management.base import BaseCommand
from job.models import Job


class Command(BaseCommand):
    help = "Cleans job descriptions by removing <a> tags and URLs."

    def handle(self, *args, **options):
        jobs = Job.objects.filter(status__exact=0)
        for job in jobs:
            try:
                self.clean_job_description(job)
            except Exception as e:
                print(f'Error processing job {job.id} - {job.title}')
                traceback.print_exc()

    def clean_job_description(self, job):
        if not job.rewrite:
            print(f"Skipping job {job.id} - empty rewrite field")
            return

        try:
            text_without_tags = self.remove_tags(job.rewrite)
        except Exception as e:
            print(f"Error in remove_tags() for job {job.id}")
            traceback.print_exc()
            return

        try:
            text_without_urls = self.remove_urls(text_without_tags)
        except Exception as e:
            print(f"Error in remove_urls() for job {job.id}")
            traceback.print_exc()
            return

        job.rewrite = text_without_urls
        job.save()
        print(f"Cleaned job ID {job.id}")

    def remove_tags(self, html):
        if html is None:
            raise ValueError("HTML content is None")
        soup = BeautifulSoup(html, 'html.parser')
        for a_tag in soup.find_all('a'):
            a_tag.decompose()
        return str(soup)

    def remove_urls(self, text):
        url_pattern = r'https?://\S+|www\.\S+'
        return re.sub(url_pattern, '', text)
        
        
   or maybe 
   
   	def remove_urls(self, text):
    # This pattern matches:
    # - http://, https://, ftp://
    # - www.example.com
    # - example.com or sub.example.co.uk
    # - domain with paths, query strings, fragments, or ports
    url_pattern = r"""(?xi)
        \b                                # Word boundary
        (                                 # Capture group
          (?:http|https|ftp)://         # Match http, https, or http2
          [\w.-]+                         # Domain or IP
          (?:\:\d+)?                      # Optional port
          (?:/[^\s]*)?                    # Optional path
        |
          www\.[\w.-]+(?:/[^\s]*)?        # www.example.com with optional path
        |
          [\w.-]+\.(?:[a-z]{2,})(?:/[^\s]*)? # example.com, sub.example.org/path
        )
    """

    return re.sub(url_pattern, '', text)

  

âś… What Went Well

  • Separation of logic: Each task (tag removal, URL removal) lives in its own function.
  • Resilience: We added try/except blocks and traceback.print_exc() for detailed error logs.
  • Simplicity: This command is easy to run on demand without extra setup.

🔍 What Could Be Better

  • Logging: Use Django's logging instead of print for production-readiness.
  • Unit Testing: Add tests for remove_tags() and remove_urls().
  • Performance: For large datasets, consider batch processing or asynchronous execution with Celery.
  • Better Filtering: Exclude empty or null descriptions directly in the query:
    Job.objects.filter(status=0).exclude(rewrite__isnull=True).exclude(rewrite__exact='')
  • HTML Tidy-up: Add post-cleanup formatting like whitespace normalization or HTML validation.

📎 Final Thoughts

Management commands in Django are perfect for these types of maintenance tasks. If you work with user-generated content, don’t wait for a manual cleanup—automate it, track it, and iterate. A clean database makes everyone’s job easier—from search indexing to frontend rendering.

Comments