CYBERSIREN — URL ANALYSIS ML MODEL SPECIFICATION

Document: ML-SPEC-v1.1 Date: 2025-07-15 Scope: SVC-03 URL Analysis — Low-Latency Phishing Detection Model Classification: Internal / Graduation Project
Context: This document specifies the machine learning model deployed inside SVC-03 (URL Analysis Service) as defined in ARCH-SPEC-v2.1. The model runs as a Python subprocess invoked by the Go service. It receives a raw URL string, extracts 28 structural features, and returns a phishing probability. No network calls or page downloads occur at inference time. The model file is 0.8 MB (joblib-compressed LightGBM).
Champion Model: LightGBM  |  Test Accuracy: 99.824%  |  Test MCC: 0.99645  |  Test AUC-ROC: 0.99929  |  False Positive Rate: 0.099%  |  False Negative Rate: 0.240%  |  Single-Inference Latency: 1,582 μs

1. Literature Foundation

Five peer-reviewed papers informed feature selection, model choice, and evaluation methodology. Every design decision traces to one or more of these sources.

Ref Citation Year Dataset Contribution to This Work
[1] K.L. Chiew et al. "A new hybrid ensemble feature selection framework for ML-based phishing detection." Information Sciences 484, 153–166. 2019 10,000 URLs, 48 features HEFS framework. CDF-g algorithm for feature cut-off. Identified 10 baseline features from 48 candidates. RF consensus best classifier. Features NumDash, NumNumericChars, NumSensitiveWords selected as HEFS baseline.
[2] M.A. Tamal et al. "Dataset of suspicious phishing URL detection." Frontiers in Computer Science 6:1308634. 2024 247,950 URLs, 42 features 10 novel features: Shannon entropy of URL/domain, repeated digits, subdomain statistics. URL-only design. IQR outlier removal methodology.
[3] R.M. Mohammad et al. "An Assessment of Features Related to Phishing Websites using an Automated Technique." ICITST-2012, IEEE. 2012 2,500 phishing URLs Rule-based extraction thresholds. URL length ≥54 = suspicious (48.8%). HTTPS absence = 92.8%. IP address in URL = 22.8%. Feature frequency weights.
[4] R.S. Potpelwar et al. "LegitPhish: A large-scale annotated dataset for URL-based phishing detection." Data in Brief 63, 111972. 2025 101,219 URLs, 17 features Manually verified dataset. Source for DS3 (LegitPhish). Features: tld_popularity, suspicious_file_extension, percentage_numeric_chars.
[5] A. Prasad, S. Chandra. "PhiUSIIL: A diverse security profile empowered phishing URL detection framework." Computers & Security 136, 103545. 2024 235,795 URLs, 54 features Source for DS2 (PhiUSIIL). Novel derived features: URLCharProb (Eq. 1), CharContinuationRate, TLDLegitimateProb. LightGBM 99.99% in Table 3. MCC as primary metric (Eq. 7). Incremental learning architecture. Security profiles.

2. Data Pipeline

2.1 Source Datasets

Two datasets selected. Criteria: raw URL strings present, continuous feature encoding, no pre-discretization.

IDNameRowsColsSource PaperURL ColumnLabel ColumnNative Encoding
DS2PhiUSIIL Phishing URL Dataset235,79556[5] Prasad & Chandra (2024)URLlabel1=legit, 0=phish
DS3LegitPhish101,21918[4] Potpelwar et al. (2025)URLClassLabel1=legit, 0=phish

2.2 Excluded Datasets (with reason)

IDNameRowsReason for Exclusion
DS1Chiew (2019) Feature Evaluation10,000No raw URL column. 6 ternary-encoded features. Incompatible with continuous pipeline.
DS4Tamal (2024) Phishing Detection247,950No raw URL column. 118,172 exact duplicate rows (47.6%).
DS6UCI Phishing Websites (ARFF)11,055Ternary (-1/0/1) encoding. 5,206 duplicate rows. Pre-discretized — original values lost.
DS7UCI Phishing Websites (legacy)2,456Same ternary encoding as DS6. 740 duplicate rows.

2.3 Label Unification

Both source datasets encode legitimate=1, phishing=0. CyberSiren unifies to standard convention: 0=legitimate, 1=phishing (positive class). Applied via map({1: 0, 0: 1}). This aligns with [5] Algorithm 3 where "prediction 1 means legitimate" and [4] Table 1 where "Phishing=0, Legitimate=1."

2.4 Quality Steps

StepActionRows Affected
1Drop rows with null URL or null label1
2Drop URLs <10 characters or containing no dots1
3Strip whitespace from URL strings
4Case-insensitive deduplication. On conflict: keep PhiUSIIL (larger, more recent)37,706

2.5 Final Dataset — PullDD

MetricValue
Total rows299,306
Legitimate (label=0)135,295 (45.2%)
Phishing (label=1)164,011 (54.8%)
From PhiUSIIL235,370
From LegitPhish63,936
MetricValue
Features extracted30
Features used in model28 (2 pruned — zero importance)
Null cells0
Infinite values0
Output filecybersiren_lowlatency_dataset.csv (43.8 MB)

3. Lookup Tables

Three pre-computed tables are loaded at startup. Built from the Cisco Umbrella top-1M domains list.

TableSourcePaperBuild MethodSize
CHAR_PROB_TABLE Cisco Umbrella top-1M [5] PhiUSIIL §3.1.4, Eq. 1 For each char a–z and 0–9: count(char) / total_alphanumeric_chars. Corpus: 21,593,440 chars. Chars a,c,e,o,r,t higher in legitimate. Chars b,f,q,v,w,x,y,z and all digits higher in phishing. 36 entries
TLD_LEGIT_PROB Cisco Umbrella top-1M [5] PhiUSIIL §3.1.4 For each TLD: count(TLD) / total_TLDs. Parsed via tldextract. Top: com=0.617, net=0.141, org=0.026, io=0.023, co.uk=0.010. 1,319 entries
SENSITIVE_WORDS [1] Chiew (2019) feature #25 [1] §Appendix, Feature 25 Extended from Chiew base list: secure, account, webscr, login, ebayisapi, signin, banking, confirm, update, verify, password, suspend, paypal, authenticate, wallet, credential. 16 words

4. Feature Specification — 30 Defined / 28 Active

Extraction: All features extracted from the raw URL string at inference time. No network calls. No page downloads. No third-party API lookups. Feature extraction is vectorized using pandas string operations for batch processing (~60,000 URLs/sec). Single-URL extraction uses per-row functions.
Pruned features: has_ip_address (F04) and double_slash_in_path (F24) contributed 0 splits in the champion LightGBM model and have been removed from the active feature set. Both are defined below for completeness but are not passed to the model at inference time.

4.1 Tier 1 — Strongest Features (5/5 or 4/5 paper support, HEFS-validated)

#FeatureTypeExtractionPapersEvidence
F01url_lengthintlen(url)5/5[3]: threshold ≥54, 48.8% of phishing. [1] feature #4. [2] F1. [4]. [5].
F02num_dotsinturl.count('.')5/5[1] feature #1. [2] F2. [3] subdomain rule. [4] dot_count. [5].
F03num_subdomainsintCount parts in subdomain split by '.'5/5[3]: 3 dots in domain = suspicious, 44.4%. [2] F24. [5] NoOfSubDomain. [1] #2.
F04has_ip_addressbinPRUNED — Anchored regex on parsed hostname: ^(\d{1,3}\.){3}\d{1,3}$ or hex IP. Not on full URL string.4/5[3]: 22.8%. [1] #17. [5] IsDomainIP. Removed: 0 splits in LightGBM.
F05num_hyphens_urlinturl.count('-')5/5[1]: HEFS baseline feature. [3]: 26.4%. [2] F6.
F06num_hyphens_hostnameinthostname.count('-')5/5[1] feature #6 NumDashInHostname.
F07https_flagbinscheme == 'https'4/5[3]: 92.8% phishing lacks HTTPS. [5] IsHTTPS.
F08entropy_urlfloatShannon entropy: E = -Σ(p_i × log₂(p_i)) where p_i = freq(char)/len(url)3/5[2] F40. [4] url_entropy. [5] via URLCharProb.
F09num_numeric_charsintCount digits 0–9 in URL4/5[1]: HEFS baseline feature. [2] F4. [5].
F10num_sensitive_wordsintΣ url.lower().count(w) for each word in 16-word list. Counts total occurrences, not just presence.1/5[1]: HEFS baseline feature. Selected by CDF-g + ensemble from 48 candidates.

4.2 Tier 2 — Strong Features (2–3 papers, novel derived features)

#FeatureTypeExtractionPapersEvidence
F11hostname_lengthintlen(parsed_hostname)3/5[1] #21. [4] domain_name_length.
F12path_lengthintlen(parsed_path)3/5[1] #22. [2] F36. [4].
F13url_char_probfloatΣ prob_table[char_i] / n for alphanumeric chars. Prob table from 1M legit URLs.1/5[5] Eq. 1. Rank #1 in feature importance (2,697 splits, 15.5% cumulative).
F14char_continuation_ratefloat(max_alpha_seq + max_digit_seq + max_special_seq) / len(url)1/5[5] §3.1.4. Lower rate = more randomized = suspicious.
F15tld_legit_probfloatLookup TLD in frequency table built from top-1M sites.2/5[5] §3.1.4. [4] tld_popularity. Rank #4 in feature importance (2,081 splits).
F16entropy_domainfloatShannon entropy on domain string only.2/5[2] F41. Novel feature.
F17num_query_paramsintlen(query.split('&')) if query else 0. Counts raw delimiters, not unique keys.4/5[1] #11, #12. [2] F9, F10. Raw count catches ?id=1&id=2&id=3 as 3, not 1.
F18num_special_charsintCount of !@#$%^&*~`|\<>{} in URL.2/5[2] F5, F20. [5] NoOfObfuscatedChar.
F19at_symbol_presentbin'@' in url4/5[3]: 3.6% frequency. Browser ignores everything before @. [1] #7.
F20pct_numeric_charsfloatnum_numeric_chars / max(len(url), 1)1/5[4] percentage_numeric_chars. Ratio-normalized F09.

4.3 Tier 3 — Useful Features (1–2 papers, lower importance)

#FeatureTypeExtractionPapers
F21suspicious_file_extbinPath ends in .exe/.zip/.rar/.scr/.bat/.cmd/.msi/.dll/.vbs/.js/.jar/.ps1/.wsf/.lnk/.7z/.cab[4]
F22path_depthintpath.count('/') - 1, min 0[1] #3. [2] F8.
F23num_underscoresinturl.count('_')[1] #9. [2] F7.
F24double_slash_in_pathbinPRUNED'//' in path (after protocol)[1] #24. Removed: 0 splits in LightGBM.
F25query_lengthintlen(query_string)[1] #23.
F26has_fragmentbinFragment component is non-empty.[2] F38. [1] #13.
F27has_repeated_digitsbinRegex: (\d)\1{2,}[2] F3. Novel.
F28avg_subdomain_lengthfloatMean length of subdomain parts.[2] F27. Novel.
F29tld_lengthintlen(tld_string)[4].
F30token_countintSplit URL by /?.&=-_.:@#+~%, count non-empty tokens.[4].

5. Feature Importance — Measured from Champion Model

Method: LightGBM feature_importances_ attribute. Values are split counts across all trees. Top 4 features account for 53.6% of total model importance. Two of four are novel derived features from [5]; two are Shannon entropy variants from [2].
RankFeatureSplitsCumul. %TierSource Paper
1url_char_prob2,69715.5%T2[5] Eq. 1
2entropy_domain2,29328.7%T2[2] F41
3entropy_url2,24441.6%T1[2] F40
4tld_legit_prob2,08153.6%T2[5] §3.1.4
5char_continuation_rate1,65963.1%T2[5] §3.1.4
6url_length1,17169.9%T1[1][2][3][4][5]
7hostname_length84474.7%T2[1][4]
8path_length70878.8%T2[1][2][4]
9avg_subdomain_length69282.8%T3[2] F27
10pct_numeric_chars41385.2%T2[4]
... remaining 18 active features account for 14.8% of importance. has_ip_address and double_slash_in_path contributed 0 splits and have been pruned.
Observation: Features traditionally considered essential — IpAddress, AtSymbol, DoubleSlash — contributed zero or near-zero importance. This aligns with [1] §5.5: "some frequently promoted features in existing phishing detection studies are not chosen as baseline features... phishers are employing new schemes to evade detection."

6. Model Selection

6.1 Candidates Tested

10 models evaluated. Selection driven by paper evidence. Data split: 70/15/15 stratified ([1] §5.2 uses 70/30; we carve validation from the 30%).

ModelPaper RationaleVal MCCVal AccTrain TimeSingle Latency
LightGBM[5] Table 3: 99.99%, fastest training0.995640.997843.8s1,582 μs
VotingEnsemble[5] §4.2.2: multi-model consensus0.995550.9977961.3s62,462 μs
StackingEnsemble[5] Table 3: 99.979%0.995640.99784163.1s42,260 μs
XGBoost[5] Table 3: 99.993% — highest in paper0.995420.997733.7s4,385 μs
CatBoost[5] Table 3: 99.987%0.995280.9976610.0s2,006 μs
RandomForest[1] §5.3: consensus best. [5]: 99.982%0.995060.9975528.7s55,746 μs
ExtraTreesRF variant, random splits0.994470.9972616.6s57,950 μs
DecisionTree[1]: C4.5 at 94.37%0.993490.996772.0s1,323 μs
AdaBoost[5] Table 3: 99.981%0.981310.9906939.0s34,359 μs
LogisticRegression[5] Table 3: 99.654%0.965250.982602.8s975 μs
Ranking metric: MCC (Matthews Correlation Coefficient). [5] uses MCC as primary metric (Eq. 7). MCC accounts for all four confusion matrix quadrants. F1 ignores true negatives. Accuracy treats all errors equally. With 45.2/54.8 class split, MCC is the most informative single metric. Range: −1 to +1.

6.2 Why LightGBM over XGBoost

XGBoost achieved highest accuracy in [5] Table 3 (99.993%). In our benchmark, LightGBM ranks #1 by MCC (0.99564 vs 0.99542). Single-inference latency: 1,582 μs vs 4,385 μs. Training: 3.8s vs 3.7s. LightGBM is faster at inference with higher MCC. For a low-latency production service, this is the correct choice.

7. Final Results — Held-Out Test Set

Protocol: Test set touched exactly once. 44,896 URLs. Stratified. Positive rate: 0.5480. These are the official reported numbers.

7.1 Top 3 Models

ModelAccuracyF1MCCAUC-ROCLog LossFPRFNR
LightGBM0.998240.998390.996450.999290.010260.000990.00240
VotingEnsemble0.998200.998350.996360.999310.010690.001030.00244
StackingEnsemble0.998170.998330.996310.999360.010900.001180.00236

7.2 Champion: LightGBM

Classification Metrics
Accuracy99.824%79 errors out of 44,896
Precision99.919%Of flagged phishing, 99.919% correct
Recall99.760%Of actual phishing, 99.760% caught
F1 Score99.839%Harmonic mean of precision and recall
MCC0.99645Near-perfect across all quadrants
AUC-ROC0.99929Near-perfect class separation
Log Loss0.01026Calibrated probability estimates
Operational Metrics
FPR0.099%20 legit URLs blocked / 20,295
FNR0.240%59 phishing URLs missed / 24,601
Model size0.8 MBCompressed joblib
Training time3.8 sKaggle 2-core CPU
Batch latency7.6 μs/URLVectorized batch prediction
Single latency1,582 μs/URLProduction API estimate
Feature extraction~60,000 URLs/secVectorized pandas ops

7.3 Confusion Matrix (Test Set)

Predicted LegitPredicted Phish
Actual Legit20,275 (TN)20 (FP)
Actual Phish59 (FN)24,542 (TP)

8. Production Integration — SVC-03 Interface

8.1 Inference Flow

1
INPUT: Raw URL string from analysis.urls Kafka message. Example: https://secure-login.example.com/verify?token=abc123
2
FEATURE EXTRACTION: Extract 28 active features from URL string (30 defined; has_ip_address and double_slash_in_path excluded). No network calls. Uses pre-loaded lookup tables (CHAR_PROB_TABLE, TLD_LEGIT_PROB, SENSITIVE_WORDS). Parse URL components via regex. Compute entropy, char probabilities, continuation rate.
3
PREDICTION: Pass 28-feature vector to LightGBM. Returns predict_proba[:,1] as phishing probability (float 0.0–1.0). Binary label: probability ≥ 0.5 → phishing.
4
OUTPUT: ml_score: int 0–100 (probability × 100, rounded). Included in scores.url message per ARCH-SPEC-v2.1 Step 3a schema.

8.2 Confidence-Gated Routing

Follows [5] security profile design (§4.2.2). Different thresholds accommodate different risk tolerances. URLs in the uncertainty band are candidates for SVC-03's enrichment path (WHOIS, SSL, DNS) before final scoring.
ProbabilityRisk LevelSVC-03 Action
0.85 – 1.00DANGEROUSScore = ml_score. Skip enrichment. Emit immediately.
0.50 – 0.85SUSPICIOUSRoute to enrichment (WHOIS/SSL/DNS). Score = enriched model or ml_score.
0.30 – 0.50UNCERTAINRoute to enrichment (WHOIS/SSL/DNS). Score = enriched model or ml_score.
0.00 – 0.30SAFEScore = ml_score. Skip enrichment. Emit immediately.

8.3 Exported Artifacts

FileSizeContents
model.joblib0.8 MBTrained LightGBM classifier (compress=3)
config.json~50 KBfeature_names, char_prob_table, tld_legit_prob, sensitive_words
metrics.json<1 KBTest set performance metrics

9. Design Justifications

DecisionRationale
URL-only features[2] deliberately excluded content features "to optimize speed and responsiveness." [5] §3.1.2: URL features have high potential but cannot be sole defense — hence two-tier architecture. Content features reserved for enrichment path.
LightGBM over XGBoostLightGBM ranks #1 by MCC on both validation (0.99564) and test (0.99645). Single-inference: 1,582 μs vs 4,385 μs. Train: 3.8s vs 3.7s. Faster at inference with higher MCC.
MCC for ranking[5] uses MCC as primary metric (Eq. 7). Accounts for all four confusion matrix quadrants. F1 ignores TN. Accuracy treats all errors equally. With 45/55 class split, MCC is most informative.
Strict hostname IP detectionOriginal regex matched against full URL string. False positive on http://192.168.1.1.example.com. Fix: anchored regex on parsed hostname only. Also guards TLD/domain/subdomain fallback parsers from treating IP octets as TLD components.
Raw delimiter query countingparse_qs groups duplicate keys. ?id=1&id=2&id=3 returns length 1. [1] feature #11 counts total components. Raw '&' counting preserves this intent.
Occurrence-based sensitive wordsurl.count(w) not w in url. URL containing "login" twice (e.g. login.site.com/login) scores 2, not 1. Captures keyword stuffing.
Two source datasets onlyDS2 (PhiUSIIL) and DS3 (LegitPhish) are the only datasets with raw URL strings and continuous encoding. Raw URLs required for consistent feature extraction. Pre-discretized datasets (DS1, DS6, DS7) cannot be combined. DS4 has no raw URLs and 47.6% duplicates.
70/15/15 split[1] §5.2 uses 70/30. We carve validation from the 30% for model comparison. Test set touched exactly once.
tldextract dependencyFallback TLD parser failed on bare domains from Cisco Umbrella file (urlparse requires scheme prefix). Installing tldextract resolved this. Feature tld_legit_prob jumped to rank #4 in importance with 1,900 splits.

10. Limitations

#LimitationImpactMitigation
1Training data from 2015–2023. Phishing tactics evolve.Concept drift over time.Periodic retraining. [5] incremental learning approach.
2URL structure only. No page content analysis.Sophisticated mimicry URLs evade detection.Two-tier architecture. Enrichment path handles ambiguous cases.
3English/Latin-script URL bias.Non-Latin IDN URLs underrepresented.Future dataset expansion. [4] acknowledges this.
4has_ip_address and double_slash_in_path contributed 0 splits in the champion model.Two features consumed memory and inference time without contributing signal.Both pruned from active feature set in Phase 0 of the benchmark pipeline. 28 features now used at inference.
545.2/54.8 class split does not reflect real-world traffic.Production prevalence is ~1% phishing. Threshold calibration needed.Adjust decision threshold post-deployment using production data.

11. References

  1. K.L. Chiew, C.L. Tan, K. Wong, K.S.C. Yong, W.K. Tiong. "A new hybrid ensemble feature selection framework for machine learning-based phishing detection system." Information Sciences 484 (2019) 153–166. doi:10.1016/j.ins.2019.01.064
  2. M.A. Tamal, M.K. Islam, T. Bhuiyan, A. Sattar. "Dataset of suspicious phishing URL detection." Frontiers in Computer Science 6:1308634 (2024). doi:10.3389/fcomp.2024.1308634
  3. R.M. Mohammad, F. Thabtah, L. McCluskey. "An Assessment of Features Related to Phishing Websites using an Automated Technique." ICITST-2012, IEEE (2012).
  4. R.S. Potpelwar, U.V. Kulkarni, J.M. Waghmare. "LegitPhish: A large-scale annotated dataset for URL-based phishing detection." Data in Brief 63 (2025) 111972. doi:10.1016/j.dib.2025.111972
  5. A. Prasad, S. Chandra. "PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning." Computers & Security 136 (2024) 103545. doi:10.1016/j.cose.2023.103545