CYBERSIREN — URL ANALYSIS ML MODEL SPECIFICATION
Document: ML-SPEC-v1.1
Date: 2025-07-15
Scope: SVC-03 URL Analysis — Low-Latency Phishing Detection Model
Classification: Internal / Graduation Project
Context: This document specifies the machine learning model deployed inside SVC-03 (URL Analysis Service) as defined in ARCH-SPEC-v2.1. The model runs as a Python subprocess invoked by the Go service. It receives a raw URL string, extracts 28 structural features, and returns a phishing probability. No network calls or page downloads occur at inference time. The model file is 0.8 MB (joblib-compressed LightGBM).
Champion Model: LightGBM | Test Accuracy: 99.824% | Test MCC: 0.99645 | Test AUC-ROC: 0.99929 | False Positive Rate: 0.099% | False Negative Rate: 0.240% | Single-Inference Latency: 1,582 μs
1. Literature Foundation
Five peer-reviewed papers informed feature selection, model choice, and evaluation methodology. Every design decision traces to one or more of these sources.
| Ref |
Citation |
Year |
Dataset |
Contribution to This Work |
| [1] |
K.L. Chiew et al. "A new hybrid ensemble feature selection framework for ML-based phishing detection." Information Sciences 484, 153–166. |
2019 |
10,000 URLs, 48 features |
HEFS framework. CDF-g algorithm for feature cut-off. Identified 10 baseline features from 48 candidates. RF consensus best classifier. Features NumDash, NumNumericChars, NumSensitiveWords selected as HEFS baseline. |
| [2] |
M.A. Tamal et al. "Dataset of suspicious phishing URL detection." Frontiers in Computer Science 6:1308634. |
2024 |
247,950 URLs, 42 features |
10 novel features: Shannon entropy of URL/domain, repeated digits, subdomain statistics. URL-only design. IQR outlier removal methodology. |
| [3] |
R.M. Mohammad et al. "An Assessment of Features Related to Phishing Websites using an Automated Technique." ICITST-2012, IEEE. |
2012 |
2,500 phishing URLs |
Rule-based extraction thresholds. URL length ≥54 = suspicious (48.8%). HTTPS absence = 92.8%. IP address in URL = 22.8%. Feature frequency weights. |
| [4] |
R.S. Potpelwar et al. "LegitPhish: A large-scale annotated dataset for URL-based phishing detection." Data in Brief 63, 111972. |
2025 |
101,219 URLs, 17 features |
Manually verified dataset. Source for DS3 (LegitPhish). Features: tld_popularity, suspicious_file_extension, percentage_numeric_chars. |
| [5] |
A. Prasad, S. Chandra. "PhiUSIIL: A diverse security profile empowered phishing URL detection framework." Computers & Security 136, 103545. |
2024 |
235,795 URLs, 54 features |
Source for DS2 (PhiUSIIL). Novel derived features: URLCharProb (Eq. 1), CharContinuationRate, TLDLegitimateProb. LightGBM 99.99% in Table 3. MCC as primary metric (Eq. 7). Incremental learning architecture. Security profiles. |
2. Data Pipeline
2.1 Source Datasets
Two datasets selected. Criteria: raw URL strings present, continuous feature encoding, no pre-discretization.
| ID | Name | Rows | Cols | Source Paper | URL Column | Label Column | Native Encoding |
| DS2 | PhiUSIIL Phishing URL Dataset | 235,795 | 56 | [5] Prasad & Chandra (2024) | URL | label | 1=legit, 0=phish |
| DS3 | LegitPhish | 101,219 | 18 | [4] Potpelwar et al. (2025) | URL | ClassLabel | 1=legit, 0=phish |
2.2 Excluded Datasets (with reason)
| ID | Name | Rows | Reason for Exclusion |
| DS1 | Chiew (2019) Feature Evaluation | 10,000 | No raw URL column. 6 ternary-encoded features. Incompatible with continuous pipeline. |
| DS4 | Tamal (2024) Phishing Detection | 247,950 | No raw URL column. 118,172 exact duplicate rows (47.6%). |
| DS6 | UCI Phishing Websites (ARFF) | 11,055 | Ternary (-1/0/1) encoding. 5,206 duplicate rows. Pre-discretized — original values lost. |
| DS7 | UCI Phishing Websites (legacy) | 2,456 | Same ternary encoding as DS6. 740 duplicate rows. |
2.3 Label Unification
Both source datasets encode legitimate=1, phishing=0. CyberSiren unifies to standard convention: 0=legitimate, 1=phishing (positive class). Applied via map({1: 0, 0: 1}). This aligns with [5] Algorithm 3 where "prediction 1 means legitimate" and [4] Table 1 where "Phishing=0, Legitimate=1."
2.4 Quality Steps
| Step | Action | Rows Affected |
| 1 | Drop rows with null URL or null label | 1 |
| 2 | Drop URLs <10 characters or containing no dots | 1 |
| 3 | Strip whitespace from URL strings | — |
| 4 | Case-insensitive deduplication. On conflict: keep PhiUSIIL (larger, more recent) | 37,706 |
2.5 Final Dataset — PullDD
| Metric | Value |
| Total rows | 299,306 |
| Legitimate (label=0) | 135,295 (45.2%) |
| Phishing (label=1) | 164,011 (54.8%) |
| From PhiUSIIL | 235,370 |
| From LegitPhish | 63,936 |
| Metric | Value |
| Features extracted | 30 |
| Features used in model | 28 (2 pruned — zero importance) |
| Null cells | 0 |
| Infinite values | 0 |
| Output file | cybersiren_lowlatency_dataset.csv (43.8 MB) |
3. Lookup Tables
Three pre-computed tables are loaded at startup. Built from the Cisco Umbrella top-1M domains list.
| Table | Source | Paper | Build Method | Size |
| CHAR_PROB_TABLE |
Cisco Umbrella top-1M |
[5] PhiUSIIL §3.1.4, Eq. 1 |
For each char a–z and 0–9: count(char) / total_alphanumeric_chars. Corpus: 21,593,440 chars. Chars a,c,e,o,r,t higher in legitimate. Chars b,f,q,v,w,x,y,z and all digits higher in phishing. |
36 entries |
| TLD_LEGIT_PROB |
Cisco Umbrella top-1M |
[5] PhiUSIIL §3.1.4 |
For each TLD: count(TLD) / total_TLDs. Parsed via tldextract. Top: com=0.617, net=0.141, org=0.026, io=0.023, co.uk=0.010. |
1,319 entries |
| SENSITIVE_WORDS |
[1] Chiew (2019) feature #25 |
[1] §Appendix, Feature 25 |
Extended from Chiew base list: secure, account, webscr, login, ebayisapi, signin, banking, confirm, update, verify, password, suspend, paypal, authenticate, wallet, credential. |
16 words |
4. Feature Specification — 30 Defined / 28 Active
Extraction: All features extracted from the raw URL string at inference time. No network calls. No page downloads. No third-party API lookups. Feature extraction is vectorized using pandas string operations for batch processing (~60,000 URLs/sec). Single-URL extraction uses per-row functions.
Pruned features: has_ip_address (F04) and double_slash_in_path (F24) contributed 0 splits in the champion LightGBM model and have been removed from the active feature set. Both are defined below for completeness but are not passed to the model at inference time.
4.1 Tier 1 — Strongest Features (5/5 or 4/5 paper support, HEFS-validated)
| # | Feature | Type | Extraction | Papers | Evidence |
| F01 | url_length | int | len(url) | 5/5 | [3]: threshold ≥54, 48.8% of phishing. [1] feature #4. [2] F1. [4]. [5]. |
| F02 | num_dots | int | url.count('.') | 5/5 | [1] feature #1. [2] F2. [3] subdomain rule. [4] dot_count. [5]. |
| F03 | num_subdomains | int | Count parts in subdomain split by '.' | 5/5 | [3]: 3 dots in domain = suspicious, 44.4%. [2] F24. [5] NoOfSubDomain. [1] #2. |
| F04 | has_ip_address | bin | PRUNED — Anchored regex on parsed hostname: ^(\d{1,3}\.){3}\d{1,3}$ or hex IP. Not on full URL string. | 4/5 | [3]: 22.8%. [1] #17. [5] IsDomainIP. Removed: 0 splits in LightGBM. |
| F05 | num_hyphens_url | int | url.count('-') | 5/5 | [1]: HEFS baseline feature. [3]: 26.4%. [2] F6. |
| F06 | num_hyphens_hostname | int | hostname.count('-') | 5/5 | [1] feature #6 NumDashInHostname. |
| F07 | https_flag | bin | scheme == 'https' | 4/5 | [3]: 92.8% phishing lacks HTTPS. [5] IsHTTPS. |
| F08 | entropy_url | float | Shannon entropy: E = -Σ(p_i × log₂(p_i)) where p_i = freq(char)/len(url) | 3/5 | [2] F40. [4] url_entropy. [5] via URLCharProb. |
| F09 | num_numeric_chars | int | Count digits 0–9 in URL | 4/5 | [1]: HEFS baseline feature. [2] F4. [5]. |
| F10 | num_sensitive_words | int | Σ url.lower().count(w) for each word in 16-word list. Counts total occurrences, not just presence. | 1/5 | [1]: HEFS baseline feature. Selected by CDF-g + ensemble from 48 candidates. |
4.2 Tier 2 — Strong Features (2–3 papers, novel derived features)
| # | Feature | Type | Extraction | Papers | Evidence |
| F11 | hostname_length | int | len(parsed_hostname) | 3/5 | [1] #21. [4] domain_name_length. |
| F12 | path_length | int | len(parsed_path) | 3/5 | [1] #22. [2] F36. [4]. |
| F13 | url_char_prob | float | Σ prob_table[char_i] / n for alphanumeric chars. Prob table from 1M legit URLs. | 1/5 | [5] Eq. 1. Rank #1 in feature importance (2,697 splits, 15.5% cumulative). |
| F14 | char_continuation_rate | float | (max_alpha_seq + max_digit_seq + max_special_seq) / len(url) | 1/5 | [5] §3.1.4. Lower rate = more randomized = suspicious. |
| F15 | tld_legit_prob | float | Lookup TLD in frequency table built from top-1M sites. | 2/5 | [5] §3.1.4. [4] tld_popularity. Rank #4 in feature importance (2,081 splits). |
| F16 | entropy_domain | float | Shannon entropy on domain string only. | 2/5 | [2] F41. Novel feature. |
| F17 | num_query_params | int | len(query.split('&')) if query else 0. Counts raw delimiters, not unique keys. | 4/5 | [1] #11, #12. [2] F9, F10. Raw count catches ?id=1&id=2&id=3 as 3, not 1. |
| F18 | num_special_chars | int | Count of !@#$%^&*~`|\<>{} in URL. | 2/5 | [2] F5, F20. [5] NoOfObfuscatedChar. |
| F19 | at_symbol_present | bin | '@' in url | 4/5 | [3]: 3.6% frequency. Browser ignores everything before @. [1] #7. |
| F20 | pct_numeric_chars | float | num_numeric_chars / max(len(url), 1) | 1/5 | [4] percentage_numeric_chars. Ratio-normalized F09. |
4.3 Tier 3 — Useful Features (1–2 papers, lower importance)
| # | Feature | Type | Extraction | Papers |
| F21 | suspicious_file_ext | bin | Path ends in .exe/.zip/.rar/.scr/.bat/.cmd/.msi/.dll/.vbs/.js/.jar/.ps1/.wsf/.lnk/.7z/.cab | [4] |
| F22 | path_depth | int | path.count('/') - 1, min 0 | [1] #3. [2] F8. |
| F23 | num_underscores | int | url.count('_') | [1] #9. [2] F7. |
| F24 | double_slash_in_path | bin | PRUNED — '//' in path (after protocol) | [1] #24. Removed: 0 splits in LightGBM. |
| F25 | query_length | int | len(query_string) | [1] #23. |
| F26 | has_fragment | bin | Fragment component is non-empty. | [2] F38. [1] #13. |
| F27 | has_repeated_digits | bin | Regex: (\d)\1{2,} | [2] F3. Novel. |
| F28 | avg_subdomain_length | float | Mean length of subdomain parts. | [2] F27. Novel. |
| F29 | tld_length | int | len(tld_string) | [4]. |
| F30 | token_count | int | Split URL by /?.&=-_.:@#+~%, count non-empty tokens. | [4]. |
5. Feature Importance — Measured from Champion Model
Method: LightGBM feature_importances_ attribute. Values are split counts across all trees. Top 4 features account for 53.6% of total model importance. Two of four are novel derived features from [5]; two are Shannon entropy variants from [2].
| Rank | Feature | Splits | Cumul. % | Tier | Source Paper |
| 1 | url_char_prob | 2,697 | 15.5% | T2 | [5] Eq. 1 |
| 2 | entropy_domain | 2,293 | 28.7% | T2 | [2] F41 |
| 3 | entropy_url | 2,244 | 41.6% | T1 | [2] F40 |
| 4 | tld_legit_prob | 2,081 | 53.6% | T2 | [5] §3.1.4 |
| 5 | char_continuation_rate | 1,659 | 63.1% | T2 | [5] §3.1.4 |
| 6 | url_length | 1,171 | 69.9% | T1 | [1][2][3][4][5] |
| 7 | hostname_length | 844 | 74.7% | T2 | [1][4] |
| 8 | path_length | 708 | 78.8% | T2 | [1][2][4] |
| 9 | avg_subdomain_length | 692 | 82.8% | T3 | [2] F27 |
| 10 | pct_numeric_chars | 413 | 85.2% | T2 | [4] |
| ... remaining 18 active features account for 14.8% of importance. has_ip_address and double_slash_in_path contributed 0 splits and have been pruned. |
Observation: Features traditionally considered essential — IpAddress, AtSymbol, DoubleSlash — contributed zero or near-zero importance. This aligns with [1] §5.5: "some frequently promoted features in existing phishing detection studies are not chosen as baseline features... phishers are employing new schemes to evade detection."
6. Model Selection
6.1 Candidates Tested
10 models evaluated. Selection driven by paper evidence. Data split: 70/15/15 stratified ([1] §5.2 uses 70/30; we carve validation from the 30%).
| Model | Paper Rationale | Val MCC | Val Acc | Train Time | Single Latency |
| LightGBM | [5] Table 3: 99.99%, fastest training | 0.99564 | 0.99784 | 3.8s | 1,582 μs |
| VotingEnsemble | [5] §4.2.2: multi-model consensus | 0.99555 | 0.99779 | 61.3s | 62,462 μs |
| StackingEnsemble | [5] Table 3: 99.979% | 0.99564 | 0.99784 | 163.1s | 42,260 μs |
| XGBoost | [5] Table 3: 99.993% — highest in paper | 0.99542 | 0.99773 | 3.7s | 4,385 μs |
| CatBoost | [5] Table 3: 99.987% | 0.99528 | 0.99766 | 10.0s | 2,006 μs |
| RandomForest | [1] §5.3: consensus best. [5]: 99.982% | 0.99506 | 0.99755 | 28.7s | 55,746 μs |
| ExtraTrees | RF variant, random splits | 0.99447 | 0.99726 | 16.6s | 57,950 μs |
| DecisionTree | [1]: C4.5 at 94.37% | 0.99349 | 0.99677 | 2.0s | 1,323 μs |
| AdaBoost | [5] Table 3: 99.981% | 0.98131 | 0.99069 | 39.0s | 34,359 μs |
| LogisticRegression | [5] Table 3: 99.654% | 0.96525 | 0.98260 | 2.8s | 975 μs |
Ranking metric: MCC (Matthews Correlation Coefficient). [5] uses MCC as primary metric (Eq. 7). MCC accounts for all four confusion matrix quadrants. F1 ignores true negatives. Accuracy treats all errors equally. With 45.2/54.8 class split, MCC is the most informative single metric. Range: −1 to +1.
6.2 Why LightGBM over XGBoost
XGBoost achieved highest accuracy in [5] Table 3 (99.993%). In our benchmark, LightGBM ranks #1 by MCC (0.99564 vs 0.99542). Single-inference latency: 1,582 μs vs 4,385 μs. Training: 3.8s vs 3.7s. LightGBM is faster at inference with higher MCC. For a low-latency production service, this is the correct choice.
7. Final Results — Held-Out Test Set
Protocol: Test set touched exactly once. 44,896 URLs. Stratified. Positive rate: 0.5480. These are the official reported numbers.
7.1 Top 3 Models
| Model | Accuracy | F1 | MCC | AUC-ROC | Log Loss | FPR | FNR |
| LightGBM | 0.99824 | 0.99839 | 0.99645 | 0.99929 | 0.01026 | 0.00099 | 0.00240 |
| VotingEnsemble | 0.99820 | 0.99835 | 0.99636 | 0.99931 | 0.01069 | 0.00103 | 0.00244 |
| StackingEnsemble | 0.99817 | 0.99833 | 0.99631 | 0.99936 | 0.01090 | 0.00118 | 0.00236 |
7.2 Champion: LightGBM
| Accuracy | 99.824% | 79 errors out of 44,896 |
| Precision | 99.919% | Of flagged phishing, 99.919% correct |
| Recall | 99.760% | Of actual phishing, 99.760% caught |
| F1 Score | 99.839% | Harmonic mean of precision and recall |
| MCC | 0.99645 | Near-perfect across all quadrants |
| AUC-ROC | 0.99929 | Near-perfect class separation |
| Log Loss | 0.01026 | Calibrated probability estimates |
| FPR | 0.099% | 20 legit URLs blocked / 20,295 |
| FNR | 0.240% | 59 phishing URLs missed / 24,601 |
| Model size | 0.8 MB | Compressed joblib |
| Training time | 3.8 s | Kaggle 2-core CPU |
| Batch latency | 7.6 μs/URL | Vectorized batch prediction |
| Single latency | 1,582 μs/URL | Production API estimate |
| Feature extraction | ~60,000 URLs/sec | Vectorized pandas ops |
7.3 Confusion Matrix (Test Set)
| Predicted Legit | Predicted Phish |
| Actual Legit | 20,275 (TN) | 20 (FP) |
| Actual Phish | 59 (FN) | 24,542 (TP) |
8. Production Integration — SVC-03 Interface
8.1 Inference Flow
8.2 Confidence-Gated Routing
Follows [5] security profile design (§4.2.2). Different thresholds accommodate different risk tolerances. URLs in the uncertainty band are candidates for SVC-03's enrichment path (WHOIS, SSL, DNS) before final scoring.
| Probability | Risk Level | SVC-03 Action |
| 0.85 – 1.00 | DANGEROUS | Score = ml_score. Skip enrichment. Emit immediately. |
| 0.50 – 0.85 | SUSPICIOUS | Route to enrichment (WHOIS/SSL/DNS). Score = enriched model or ml_score. |
| 0.30 – 0.50 | UNCERTAIN | Route to enrichment (WHOIS/SSL/DNS). Score = enriched model or ml_score. |
| 0.00 – 0.30 | SAFE | Score = ml_score. Skip enrichment. Emit immediately. |
8.3 Exported Artifacts
| File | Size | Contents |
model.joblib | 0.8 MB | Trained LightGBM classifier (compress=3) |
config.json | ~50 KB | feature_names, char_prob_table, tld_legit_prob, sensitive_words |
metrics.json | <1 KB | Test set performance metrics |
9. Design Justifications
| Decision | Rationale |
| URL-only features | [2] deliberately excluded content features "to optimize speed and responsiveness." [5] §3.1.2: URL features have high potential but cannot be sole defense — hence two-tier architecture. Content features reserved for enrichment path. |
| LightGBM over XGBoost | LightGBM ranks #1 by MCC on both validation (0.99564) and test (0.99645). Single-inference: 1,582 μs vs 4,385 μs. Train: 3.8s vs 3.7s. Faster at inference with higher MCC. |
| MCC for ranking | [5] uses MCC as primary metric (Eq. 7). Accounts for all four confusion matrix quadrants. F1 ignores TN. Accuracy treats all errors equally. With 45/55 class split, MCC is most informative. |
| Strict hostname IP detection | Original regex matched against full URL string. False positive on http://192.168.1.1.example.com. Fix: anchored regex on parsed hostname only. Also guards TLD/domain/subdomain fallback parsers from treating IP octets as TLD components. |
| Raw delimiter query counting | parse_qs groups duplicate keys. ?id=1&id=2&id=3 returns length 1. [1] feature #11 counts total components. Raw '&' counting preserves this intent. |
| Occurrence-based sensitive words | url.count(w) not w in url. URL containing "login" twice (e.g. login.site.com/login) scores 2, not 1. Captures keyword stuffing. |
| Two source datasets only | DS2 (PhiUSIIL) and DS3 (LegitPhish) are the only datasets with raw URL strings and continuous encoding. Raw URLs required for consistent feature extraction. Pre-discretized datasets (DS1, DS6, DS7) cannot be combined. DS4 has no raw URLs and 47.6% duplicates. |
| 70/15/15 split | [1] §5.2 uses 70/30. We carve validation from the 30% for model comparison. Test set touched exactly once. |
| tldextract dependency | Fallback TLD parser failed on bare domains from Cisco Umbrella file (urlparse requires scheme prefix). Installing tldextract resolved this. Feature tld_legit_prob jumped to rank #4 in importance with 1,900 splits. |
10. Limitations
| # | Limitation | Impact | Mitigation |
| 1 | Training data from 2015–2023. Phishing tactics evolve. | Concept drift over time. | Periodic retraining. [5] incremental learning approach. |
| 2 | URL structure only. No page content analysis. | Sophisticated mimicry URLs evade detection. | Two-tier architecture. Enrichment path handles ambiguous cases. |
| 3 | English/Latin-script URL bias. | Non-Latin IDN URLs underrepresented. | Future dataset expansion. [4] acknowledges this. |
| 4 | has_ip_address and double_slash_in_path contributed 0 splits in the champion model. | Two features consumed memory and inference time without contributing signal. | Both pruned from active feature set in Phase 0 of the benchmark pipeline. 28 features now used at inference. |
| 5 | 45.2/54.8 class split does not reflect real-world traffic. | Production prevalence is ~1% phishing. Threshold calibration needed. | Adjust decision threshold post-deployment using production data. |
11. References
- K.L. Chiew, C.L. Tan, K. Wong, K.S.C. Yong, W.K. Tiong. "A new hybrid ensemble feature selection framework for machine learning-based phishing detection system." Information Sciences 484 (2019) 153–166. doi:10.1016/j.ins.2019.01.064
- M.A. Tamal, M.K. Islam, T. Bhuiyan, A. Sattar. "Dataset of suspicious phishing URL detection." Frontiers in Computer Science 6:1308634 (2024). doi:10.3389/fcomp.2024.1308634
- R.M. Mohammad, F. Thabtah, L. McCluskey. "An Assessment of Features Related to Phishing Websites using an Automated Technique." ICITST-2012, IEEE (2012).
- R.S. Potpelwar, U.V. Kulkarni, J.M. Waghmare. "LegitPhish: A large-scale annotated dataset for URL-based phishing detection." Data in Brief 63 (2025) 111972. doi:10.1016/j.dib.2025.111972
- A. Prasad, S. Chandra. "PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning." Computers & Security 136 (2024) 103545. doi:10.1016/j.cose.2023.103545