Content Similarity Checker
Home/Tools/Content Optimization/Content Similarity Checker
✎ Content Optimization

Content Similarity Checker

v1.0 documentation

Detects duplicate or near-duplicate content across pages using MinHash/LSH for fast similarity and cosine similarity for precision.

URL inputFile inputXLSX export
content_similarity_checker.py120 lines5 paramsPython 3.8+
Quick start
1

Install

terminal
pip install -r requirements.txt
2

Run

terminal
python content_similarity_checker.py --urls https://a.com https://b.com https://c.com --threshold 0.7
terminal
python content_similarity_checker.py --files *.html --output duplicates.xlsx
3

Export

Add --output report.xlsx to save results as a spreadsheet.

Parameters
FlagDescription
--urlsUrls. Multiple values allowed
--filesFiles. Multiple values allowed
--thresholdThreshold (decimal)
--methodMethod. Options: minhash, cosine, both
--outputSave as XLSX
help
python content_similarity_checker.py --help
Use cases
Content audit
Pre-publish check
Competitive analysis

Analyze existing content to find what needs updating, merging, or removing. Export results and create a content maintenance plan.

Run before publishing new content to ensure it meets quality thresholds. Fix issues before they go live.

Compare your content against top-ranking competitors. Identify gaps and opportunities to improve.

Dependencies

Requires: beautifulsoup4, numpy, pandas, requests, scikit-learn. All included in requirements.txt.

Get all 154 Python SEO tools — $49

One-time payment. Lifetime access. No monthly fees.
Learn 25 tools and get 25% back. Earn from client work and get 50% back.

Get the full toolkit

AAIO Inc — aaioinc.com/tools/content_similarity_checker/