In maintaining a job listing platform, data cleanliness is key to providing a consistent and professional user experience.
One recurring issue in our platform involved job descriptions (Job.rewrite) filled with unnecessary anchor
(<a>) tags and external URLs—often copied directly from third-party sources.
This post walks through the Django management command we used to clean up these descriptions. It also reflects on what went well and what could be improved.
đź› The Problem
Some job descriptions included:
<a>tags linking to external sites (sometimes broken).- Raw URLs pasted directly in the content.
- Malformed or missing data in
Job.rewrite.
We needed a simple, repeatable way to sanitize these fields without manually editing thousands of records.
🔄 The Approach
We built a custom Django management command that:
- Loops over all jobs with
status = 0. - Removes all
<a>tags using BeautifulSoup. - Strips out URLs using a regular expression.
- Handles edge cases like
Noneor empty fields. - Logs any traceback if something goes wrong.
Here’s the command script we used:
import re
import traceback
from bs4 import BeautifulSoup
from django.core.management.base import BaseCommand
from job.models import Job
class Command(BaseCommand):
help = "Cleans job descriptions by removing <a> tags and URLs."
def handle(self, *args, **options):
jobs = Job.objects.filter(status__exact=0)
for job in jobs:
try:
self.clean_job_description(job)
except Exception as e:
print(f'Error processing job {job.id} - {job.title}')
traceback.print_exc()
def clean_job_description(self, job):
if not job.rewrite:
print(f"Skipping job {job.id} - empty rewrite field")
return
try:
text_without_tags = self.remove_tags(job.rewrite)
except Exception as e:
print(f"Error in remove_tags() for job {job.id}")
traceback.print_exc()
return
try:
text_without_urls = self.remove_urls(text_without_tags)
except Exception as e:
print(f"Error in remove_urls() for job {job.id}")
traceback.print_exc()
return
job.rewrite = text_without_urls
job.save()
print(f"Cleaned job ID {job.id}")
def remove_tags(self, html):
if html is None:
raise ValueError("HTML content is None")
soup = BeautifulSoup(html, 'html.parser')
for a_tag in soup.find_all('a'):
a_tag.decompose()
return str(soup)
def remove_urls(self, text):
url_pattern = r'https?://\S+|www\.\S+'
return re.sub(url_pattern, '', text)
or maybe
def remove_urls(self, text):
# This pattern matches:
# - http://, https://, ftp://
# - www.example.com
# - example.com or sub.example.co.uk
# - domain with paths, query strings, fragments, or ports
url_pattern = r"""(?xi)
\b # Word boundary
( # Capture group
(?:http|https|ftp):// # Match http, https, or http2
[\w.-]+ # Domain or IP
(?:\:\d+)? # Optional port
(?:/[^\s]*)? # Optional path
|
www\.[\w.-]+(?:/[^\s]*)? # www.example.com with optional path
|
[\w.-]+\.(?:[a-z]{2,})(?:/[^\s]*)? # example.com, sub.example.org/path
)
"""
return re.sub(url_pattern, '', text)
âś… What Went Well
- Separation of logic: Each task (tag removal, URL removal) lives in its own function.
- Resilience: We added
try/exceptblocks andtraceback.print_exc()for detailed error logs. - Simplicity: This command is easy to run on demand without extra setup.
🔍 What Could Be Better
- Logging: Use Django's
logginginstead ofprintfor production-readiness. - Unit Testing: Add tests for
remove_tags()andremove_urls(). - Performance: For large datasets, consider batch processing or asynchronous execution with Celery.
- Better Filtering: Exclude empty or null descriptions directly in the query:
Job.objects.filter(status=0).exclude(rewrite__isnull=True).exclude(rewrite__exact='') - HTML Tidy-up: Add post-cleanup formatting like whitespace normalization or HTML validation.
📎 Final Thoughts
Management commands in Django are perfect for these types of maintenance tasks. If you work with user-generated content, don’t wait for a manual cleanup—automate it, track it, and iterate. A clean database makes everyone’s job easier—from search indexing to frontend rendering.
Comments
Post a Comment