Tired of copy-pasting code that doesn't work?
Write Python code 3.2x faster than ChatGPT.
Document comparison is a critical task in many business workflows, from legal contract reviews to version control in content management. Manually comparing documents is time-consuming, error-prone, and inefficient. Fortunately, Python offers powerful libraries and tools that can automate this process, saving countless hours while ensuring accuracy.
In this comprehensive guide, we'll explore how to automate document comparison in Python using practical code examples that you can implement immediately.
Why Automate Document Comparison?
Before diving into the code, let's understand why automation matters. Document comparison automation helps organizations track changes between file versions, identify discrepancies in contracts, maintain compliance documentation, and streamline collaboration workflows. Whether you're working with text files, Word documents, or PDFs, Python provides the tools to handle these tasks efficiently.
Getting Started with Python Document Comparison
The first step in automating document comparison is choosing the right libraries. Python's ecosystem offers several excellent options depending on your document format and comparison requirements.
Comparing Plain Text Files
For basic text file comparison, Python's built-in difflib
module provides a robust solution without requiring external dependencies. Here's a practical example:
python
import difflib
def compare_text_files(file1_path, file2_path):
# Read the contents of both files
with open(file1_path, 'r', encoding='utf-8') as f1:
file1_lines = f1.readlines()
with open(file2_path, 'r', encoding='utf-8') as f2:
file2_lines = f2.readlines()
# Create a Differ object
differ = difflib.Differ()
# Compare the files
diff = list(differ.compare(file1_lines, file2_lines))
# Display the differences
for line in diff:
if line.startswith('+ '):
print(f"Added: {line[2:]}")
elif line.startswith('- '):
print(f"Removed: {line[2:]}")
elif line.startswith('? '):
print(f"Changed: {line[2:]}")
return diff
# Usage
compare_text_files('document_v1.txt', 'document_v2.txt')
This function reads two text files and displays line-by-line differences, clearly marking additions, deletions, and modifications. The difflib
module is particularly useful for generating unified diffs similar to version control systems.
Advanced Text Comparison with HTML Output
For more sophisticated comparisons with visual output, you can generate HTML diff reports:
python
import difflib
def generate_html_diff(file1_path, file2_path, output_path='diff_report.html'):
# Read file contents
with open(file1_path, 'r', encoding='utf-8') as f1:
file1_content = f1.readlines()
with open(file2_path, 'r', encoding='utf-8') as f2:
file2_content = f2.readlines()
# Generate HTML diff
html_diff = difflib.HtmlDiff()
html_output = html_diff.make_file(
file1_content,
file2_content,
fromdesc='Original Document',
todesc='Modified Document'
)
# Save to HTML file
with open(output_path, 'w', encoding='utf-8') as output:
output.write(html_output)
print(f"HTML diff report generated: {output_path}")
# Usage
generate_html_diff('contract_v1.txt', 'contract_v2.txt')
This creates a color-coded HTML report that makes it easy to visualize changes, perfect for sharing comparison results with team members or clients.
Comparing Word Documents
For Word document comparison, the python-docx
library enables you to extract and compare content from DOCX files:
python
from docx import Document
def compare_word_documents(doc1_path, doc2_path):
# Load both documents
doc1 = Document(doc1_path)
doc2 = Document(doc2_path)
# Extract text from paragraphs
doc1_text = [para.text for para in doc1.paragraphs]
doc2_text = [para.text for para in doc2.paragraphs]
# Use difflib for comparison
differ = difflib.Differ()
diff = list(differ.compare(doc1_text, doc2_text))
# Count changes
additions = sum(1 for line in diff if line.startswith('+ '))
deletions = sum(1 for line in diff if line.startswith('- '))
print(f"Total additions: {additions}")
print(f"Total deletions: {deletions}")
# Display differences
for line in diff:
if line.startswith('+ ') or line.startswith('- '):
print(line)
return diff
# Install with: pip install python-docx
# Usage
compare_word_documents('report_v1.docx', 'report_v2.docx')
Calculating Similarity Ratios
Sometimes you need to know how similar two documents are rather than seeing every change. The SequenceMatcher
class provides similarity metrics:
python
import difflib
def calculate_document_similarity(file1_path, file2_path):
# Read file contents
with open(file1_path, 'r', encoding='utf-8') as f1:
content1 = f1.read()
with open(file2_path, 'r', encoding='utf-8') as f2:
content2 = f2.read()
# Calculate similarity ratio
sequence_matcher = difflib.SequenceMatcher(None, content1, content2)
similarity_ratio = sequence_matcher.ratio()
print(f"Similarity: {similarity_ratio * 100:.2f}%")
print(f"Difference: {(1 - similarity_ratio) * 100:.2f}%")
return similarity_ratio
# Usage
calculate_document_similarity('policy_v1.txt', 'policy_v2.txt')
This function returns a value between 0 and 1, where 1 means the documents are identical. This is useful for quick assessments or filtering documents that require detailed review.
Automating Batch Comparisons
For processing multiple document pairs, create a batch comparison function:
python
import os
import difflib
def batch_compare_documents(directory, file_pairs):
results = []
for file1, file2 in file_pairs:
file1_path = os.path.join(directory, file1)
file2_path = os.path.join(directory, file2)
with open(file1_path, 'r', encoding='utf-8') as f1:
content1 = f1.read()
with open(file2_path, 'r', encoding='utf-8') as f2:
content2 = f2.read()
similarity = difflib.SequenceMatcher(None, content1, content2).ratio()
results.append({
'pair': (file1, file2),
'similarity': similarity
})
return results
# Usage
pairs = [('doc1_v1.txt', 'doc1_v2.txt'), ('doc2_v1.txt', 'doc2_v2.txt')]
comparison_results = batch_compare_documents('/path/to/docs', pairs)
Conclusion
Automating document comparison in Python is straightforward and highly customizable. Whether you're comparing simple text files, Word documents, or generating visual diff reports, Python's libraries provide the tools you need. Start with the built-in difflib
module for basic comparisons, then expand to specialized libraries like python-docx
for specific document formats. By implementing these automation solutions, you'll save time, reduce errors, and improve your document management workflow significantly.
More Like This
Automating Spreadsheets with Python 101
How to tell the difference between a good and bad Python automation target.
10 Mistakes To Look Out For When Transitioning from Excel To Python
10 Common Mistakes for new programmers transitioning from Excel to Python
Research shows Mito speeds up by 400%
We're always on the hunt for tools that improve our efficiency at work. Tools that let us accomplish more with less time, money, and resources.
3 Rules for Choosing Between SQL and Python
Analysts at the world's top banks are automating their manual Excel work so they can spend less time creating baseline reports, and more time building new analyses that push the company forward.
Tired of copy-pasting code that doesn't work?
Write Python code 3.2x faster than ChatGPT.