CYBERSIREN — URL ANALYSIS ML MODEL SPECIFICATION

Document: ML-SPEC-v1.1 Date: 2025-07-15 Scope: SVC-03 URL Analysis — Low-Latency Phishing Detection Model Classification: Internal / Graduation Project

Context: This document specifies the machine learning model deployed inside SVC-03 (URL Analysis Service) as defined in ARCH-SPEC-v2.1. The model runs as a Python subprocess invoked by the Go service. It receives a raw URL string, extracts 28 structural features, and returns a phishing probability. No network calls or page downloads occur at inference time. The model file is 0.8 MB (joblib-compressed LightGBM).

1. Literature Foundation

Five peer-reviewed papers informed feature selection, model choice, and evaluation methodology. Every design decision traces to one or more of these sources.

Ref	Citation	Year	Dataset	Contribution to This Work
[1]	K.L. Chiew et al. "A new hybrid ensemble feature selection framework for ML-based phishing detection." Information Sciences 484, 153–166.	2019	10,000 URLs, 48 features	HEFS framework. CDF-g algorithm for feature cut-off. Identified 10 baseline features from 48 candidates. RF consensus best classifier. Features NumDash, NumNumericChars, NumSensitiveWords selected as HEFS baseline.
[2]	M.A. Tamal et al. "Dataset of suspicious phishing URL detection." Frontiers in Computer Science 6:1308634.	2024	247,950 URLs, 42 features	10 novel features: Shannon entropy of URL/domain, repeated digits, subdomain statistics. URL-only design. IQR outlier removal methodology.
[3]	R.M. Mohammad et al. "An Assessment of Features Related to Phishing Websites using an Automated Technique." ICITST-2012, IEEE.	2012	2,500 phishing URLs	Rule-based extraction thresholds. URL length ≥54 = suspicious (48.8%). HTTPS absence = 92.8%. IP address in URL = 22.8%. Feature frequency weights.
[4]	R.S. Potpelwar et al. "LegitPhish: A large-scale annotated dataset for URL-based phishing detection." Data in Brief 63, 111972.	2025	101,219 URLs, 17 features	Manually verified dataset. Source for DS3 (LegitPhish). Features: tld_popularity, suspicious_file_extension, percentage_numeric_chars.
[5]	A. Prasad, S. Chandra. "PhiUSIIL: A diverse security profile empowered phishing URL detection framework." Computers & Security 136, 103545.	2024	235,795 URLs, 54 features	Source for DS2 (PhiUSIIL). Novel derived features: URLCharProb (Eq. 1), CharContinuationRate, TLDLegitimateProb. LightGBM 99.99% in Table 3. MCC as primary metric (Eq. 7). Incremental learning architecture. Security profiles.

2. Data Pipeline

2.1 Source Datasets

Two datasets selected. Criteria: raw URL strings present, continuous feature encoding, no pre-discretization.

ID	Name	Rows	Cols	Source Paper	URL Column	Label Column	Native Encoding
DS2	PhiUSIIL Phishing URL Dataset	235,795	56	[5] Prasad & Chandra (2024)	`URL`	`label`	1=legit, 0=phish
DS3	LegitPhish	101,219	18	[4] Potpelwar et al. (2025)	`URL`	`ClassLabel`	1=legit, 0=phish

2.2 Excluded Datasets (with reason)

ID	Name	Rows	Reason for Exclusion
DS1	Chiew (2019) Feature Evaluation	10,000	No raw URL column. 6 ternary-encoded features. Incompatible with continuous pipeline.
DS4	Tamal (2024) Phishing Detection	247,950	No raw URL column. 118,172 exact duplicate rows (47.6%).
DS6	UCI Phishing Websites (ARFF)	11,055	Ternary (-1/0/1) encoding. 5,206 duplicate rows. Pre-discretized — original values lost.
DS7	UCI Phishing Websites (legacy)	2,456	Same ternary encoding as DS6. 740 duplicate rows.

2.3 Label Unification

Both source datasets encode legitimate=1, phishing=0. CyberSiren unifies to standard convention: 0=legitimate, 1=phishing (positive class). Applied via map({1: 0, 0: 1}). This aligns with [5] Algorithm 3 where "prediction 1 means legitimate" and [4] Table 1 where "Phishing=0, Legitimate=1."

2.4 Quality Steps

Step	Action	Rows Affected
1	Drop rows with null URL or null label	1
2	Drop URLs <10 characters or containing no dots	1
3	Strip whitespace from URL strings	—
4	Case-insensitive deduplication. On conflict: keep PhiUSIIL (larger, more recent)	37,706

2.5 Final Dataset — PullDD

Metric	Value
Total rows	299,306
Legitimate (label=0)	135,295 (45.2%)
Phishing (label=1)	164,011 (54.8%)
From PhiUSIIL	235,370
From LegitPhish	63,936

Metric	Value
Features extracted	30
Features used in model	28 (2 pruned — zero importance)
Null cells	0
Infinite values	0
Output file	`cybersiren_lowlatency_dataset.csv` (43.8 MB)

3. Lookup Tables

Three pre-computed tables are loaded at startup. Built from the Cisco Umbrella top-1M domains list.

Table	Source	Paper	Build Method	Size
CHAR_PROB_TABLE	Cisco Umbrella top-1M	[5] PhiUSIIL §3.1.4, Eq. 1	For each char a–z and 0–9: `count(char) / total_alphanumeric_chars`. Corpus: 21,593,440 chars. Chars a,c,e,o,r,t higher in legitimate. Chars b,f,q,v,w,x,y,z and all digits higher in phishing.	36 entries
TLD_LEGIT_PROB	Cisco Umbrella top-1M	[5] PhiUSIIL §3.1.4	For each TLD: `count(TLD) / total_TLDs`. Parsed via `tldextract`. Top: com=0.617, net=0.141, org=0.026, io=0.023, co.uk=0.010.	1,319 entries
SENSITIVE_WORDS	[1] Chiew (2019) feature #25	[1] §Appendix, Feature 25	Extended from Chiew base list: secure, account, webscr, login, ebayisapi, signin, banking, confirm, update, verify, password, suspend, paypal, authenticate, wallet, credential.	16 words

4. Feature Specification — 30 Defined / 28 Active

Extraction: All features extracted from the raw URL string at inference time. No network calls. No page downloads. No third-party API lookups. Feature extraction is vectorized using pandas string operations for batch processing (~60,000 URLs/sec). Single-URL extraction uses per-row functions.

Pruned features: has_ip_address (F04) and double_slash_in_path (F24) contributed 0 splits in the champion LightGBM model and have been removed from the active feature set. Both are defined below for completeness but are not passed to the model at inference time.

4.1 Tier 1 — Strongest Features (5/5 or 4/5 paper support, HEFS-validated)

#	Feature	Type	Extraction	Papers	Evidence
F01	url_length	int	`len(url)`	5/5	[3]: threshold ≥54, 48.8% of phishing. [1] feature #4. [2] F1. [4]. [5].
F02	num_dots	int	`url.count('.')`	5/5	[1] feature #1. [2] F2. [3] subdomain rule. [4] dot_count. [5].
F03	num_subdomains	int	Count parts in subdomain split by `'.'`	5/5	[3]: 3 dots in domain = suspicious, 44.4%. [2] F24. [5] NoOfSubDomain. [1] #2.
F04	has_ip_address	bin	`PRUNED` — Anchored regex on parsed hostname: `^(\d{1,3}\.){3}\d{1,3}$` or hex IP. Not on full URL string.	4/5	[3]: 22.8%. [1] #17. [5] IsDomainIP. Removed: 0 splits in LightGBM.
F05	num_hyphens_url	int	`url.count('-')`	5/5	[1]: HEFS baseline feature. [3]: 26.4%. [2] F6.
F06	num_hyphens_hostname	int	`hostname.count('-')`	5/5	[1] feature #6 NumDashInHostname.
F07	https_flag	bin	`scheme == 'https'`	4/5	[3]: 92.8% phishing lacks HTTPS. [5] IsHTTPS.
F08	entropy_url	float	Shannon entropy: `E = -Σ(p_i × log₂(p_i))` where `p_i = freq(char)/len(url)`	3/5	[2] F40. [4] url_entropy. [5] via URLCharProb.
F09	num_numeric_chars	int	Count digits 0–9 in URL	4/5	[1]: HEFS baseline feature. [2] F4. [5].
F10	num_sensitive_words	int	`Σ url.lower().count(w)` for each word in 16-word list. Counts total occurrences, not just presence.	1/5	[1]: HEFS baseline feature. Selected by CDF-g + ensemble from 48 candidates.

4.2 Tier 2 — Strong Features (2–3 papers, novel derived features)

#	Feature	Type	Extraction	Papers	Evidence
F11	hostname_length	int	`len(parsed_hostname)`	3/5	[1] #21. [4] domain_name_length.
F12	path_length	int	`len(parsed_path)`	3/5	[1] #22. [2] F36. [4].
F13	url_char_prob	float	`Σ prob_table[char_i] / n` for alphanumeric chars. Prob table from 1M legit URLs.	1/5	[5] Eq. 1. Rank #1 in feature importance (2,697 splits, 15.5% cumulative).
F14	char_continuation_rate	float	`(max_alpha_seq + max_digit_seq + max_special_seq) / len(url)`	1/5	[5] §3.1.4. Lower rate = more randomized = suspicious.
F15	tld_legit_prob	float	Lookup TLD in frequency table built from top-1M sites.	2/5	[5] §3.1.4. [4] tld_popularity. Rank #4 in feature importance (2,081 splits).
F16	entropy_domain	float	Shannon entropy on domain string only.	2/5	[2] F41. Novel feature.
F17	num_query_params	int	`len(query.split('&'))` if query else 0. Counts raw delimiters, not unique keys.	4/5	[1] #11, #12. [2] F9, F10. Raw count catches `?id=1&id=2&id=3` as 3, not 1.
F18	num_special_chars	int	Count of !@#$%^&*~`\|\<>{} in URL.	2/5	[2] F5, F20. [5] NoOfObfuscatedChar.
F19	at_symbol_present	bin	`'@' in url`	4/5	[3]: 3.6% frequency. Browser ignores everything before @. [1] #7.
F20	pct_numeric_chars	float	`num_numeric_chars / max(len(url), 1)`	1/5	[4] percentage_numeric_chars. Ratio-normalized F09.

4.3 Tier 3 — Useful Features (1–2 papers, lower importance)

#	Feature	Type	Extraction	Papers
F21	suspicious_file_ext	bin	Path ends in .exe/.zip/.rar/.scr/.bat/.cmd/.msi/.dll/.vbs/.js/.jar/.ps1/.wsf/.lnk/.7z/.cab	[4]
F22	path_depth	int	`path.count('/') - 1`, min 0	[1] #3. [2] F8.
F23	num_underscores	int	`url.count('_')`	[1] #9. [2] F7.
F24	double_slash_in_path	bin	`PRUNED` — `'//' in path` (after protocol)	[1] #24. Removed: 0 splits in LightGBM.
F25	query_length	int	`len(query_string)`	[1] #23.
F26	has_fragment	bin	Fragment component is non-empty.	[2] F38. [1] #13.
F27	has_repeated_digits	bin	Regex: `(\d)\1{2,}`	[2] F3. Novel.
F28	avg_subdomain_length	float	Mean length of subdomain parts.	[2] F27. Novel.
F29	tld_length	int	`len(tld_string)`	[4].
F30	token_count	int	Split URL by `/?.&=-_.:@#+~%`, count non-empty tokens.	[4].

5. Feature Importance — Measured from Champion Model

Method: LightGBM feature_importances_ attribute. Values are split counts across all trees. Top 4 features account for 53.6% of total model importance. Two of four are novel derived features from [5]; two are Shannon entropy variants from [2].

Rank	Feature	Splits	Cumul. %	Tier	Source Paper
1	url_char_prob	2,697	15.5%	T2	[5] Eq. 1
2	entropy_domain	2,293	28.7%	T2	[2] F41
3	entropy_url	2,244	41.6%	T1	[2] F40
4	tld_legit_prob	2,081	53.6%	T2	[5] §3.1.4
5	char_continuation_rate	1,659	63.1%	T2	[5] §3.1.4
6	url_length	1,171	69.9%	T1	[1][2][3][4][5]
7	hostname_length	844	74.7%	T2	[1][4]
8	path_length	708	78.8%	T2	[1][2][4]
9	avg_subdomain_length	692	82.8%	T3	[2] F27
10	pct_numeric_chars	413	85.2%	T2	[4]
... remaining 18 active features account for 14.8% of importance. has_ip_address and double_slash_in_path contributed 0 splits and have been pruned.

Observation: Features traditionally considered essential — IpAddress, AtSymbol, DoubleSlash — contributed zero or near-zero importance. This aligns with [1] §5.5: "some frequently promoted features in existing phishing detection studies are not chosen as baseline features... phishers are employing new schemes to evade detection."

6. Model Selection

6.1 Candidates Tested

10 models evaluated. Selection driven by paper evidence. Data split: 70/15/15 stratified ([1] §5.2 uses 70/30; we carve validation from the 30%).

Model	Paper Rationale	Val MCC	Val Acc	Train Time	Single Latency
LightGBM	[5] Table 3: 99.99%, fastest training	0.99564	0.99784	3.8s	1,582 μs
VotingEnsemble	[5] §4.2.2: multi-model consensus	0.99555	0.99779	61.3s	62,462 μs
StackingEnsemble	[5] Table 3: 99.979%	0.99564	0.99784	163.1s	42,260 μs
XGBoost	[5] Table 3: 99.993% — highest in paper	0.99542	0.99773	3.7s	4,385 μs
CatBoost	[5] Table 3: 99.987%	0.99528	0.99766	10.0s	2,006 μs
RandomForest	[1] §5.3: consensus best. [5]: 99.982%	0.99506	0.99755	28.7s	55,746 μs
ExtraTrees	RF variant, random splits	0.99447	0.99726	16.6s	57,950 μs
DecisionTree	[1]: C4.5 at 94.37%	0.99349	0.99677	2.0s	1,323 μs
AdaBoost	[5] Table 3: 99.981%	0.98131	0.99069	39.0s	34,359 μs
LogisticRegression	[5] Table 3: 99.654%	0.96525	0.98260	2.8s	975 μs

Ranking metric: MCC (Matthews Correlation Coefficient). [5] uses MCC as primary metric (Eq. 7). MCC accounts for all four confusion matrix quadrants. F1 ignores true negatives. Accuracy treats all errors equally. With 45.2/54.8 class split, MCC is the most informative single metric. Range: −1 to +1.

6.2 Why LightGBM over XGBoost

XGBoost achieved highest accuracy in [5] Table 3 (99.993%). In our benchmark, LightGBM ranks #1 by MCC (0.99564 vs 0.99542). Single-inference latency: 1,582 μs vs 4,385 μs. Training: 3.8s vs 3.7s. LightGBM is faster at inference with higher MCC. For a low-latency production service, this is the correct choice.

7. Final Results — Held-Out Test Set

Protocol: Test set touched exactly once. 44,896 URLs. Stratified. Positive rate: 0.5480. These are the official reported numbers.

7.1 Top 3 Models

Model	Accuracy	F1	MCC	AUC-ROC	Log Loss	FPR	FNR
LightGBM	0.99824	0.99839	0.99645	0.99929	0.01026	0.00099	0.00240
VotingEnsemble	0.99820	0.99835	0.99636	0.99931	0.01069	0.00103	0.00244
StackingEnsemble	0.99817	0.99833	0.99631	0.99936	0.01090	0.00118	0.00236

7.2 Champion: LightGBM

Classification Metrics

Accuracy	99.824%	79 errors out of 44,896
Precision	99.919%	Of flagged phishing, 99.919% correct
Recall	99.760%	Of actual phishing, 99.760% caught
F1 Score	99.839%	Harmonic mean of precision and recall
MCC	0.99645	Near-perfect across all quadrants
AUC-ROC	0.99929	Near-perfect class separation
Log Loss	0.01026	Calibrated probability estimates

Operational Metrics

FPR	0.099%	20 legit URLs blocked / 20,295
FNR	0.240%	59 phishing URLs missed / 24,601
Model size	0.8 MB	Compressed joblib
Training time	3.8 s	Kaggle 2-core CPU
Batch latency	7.6 μs/URL	Vectorized batch prediction
Single latency	1,582 μs/URL	Production API estimate
Feature extraction	~60,000 URLs/sec	Vectorized pandas ops

7.3 Confusion Matrix (Test Set)

	Predicted Legit	Predicted Phish
Actual Legit	20,275 (TN)	20 (FP)
Actual Phish	59 (FN)	24,542 (TP)

8. Production Integration — SVC-03 Interface

8.1 Inference Flow

INPUT: Raw URL string from analysis.urls Kafka message. Example: https://secure-login.example.com/verify?token=abc123

FEATURE EXTRACTION: Extract 28 active features from URL string (30 defined; has_ip_address and double_slash_in_path excluded). No network calls. Uses pre-loaded lookup tables (CHAR_PROB_TABLE, TLD_LEGIT_PROB, SENSITIVE_WORDS). Parse URL components via regex. Compute entropy, char probabilities, continuation rate.

PREDICTION: Pass 28-feature vector to LightGBM. Returns predict_proba[:,1] as phishing probability (float 0.0–1.0). Binary label: probability ≥ 0.5 → phishing.

OUTPUT: ml_score: int 0–100 (probability × 100, rounded). Included in scores.url message per ARCH-SPEC-v2.1 Step 3a schema.

8.2 Confidence-Gated Routing

Follows [5] security profile design (§4.2.2). Different thresholds accommodate different risk tolerances. URLs in the uncertainty band are candidates for SVC-03's enrichment path (WHOIS, SSL, DNS) before final scoring.

Probability	Risk Level	SVC-03 Action
0.85 – 1.00	DANGEROUS	Score = ml_score. Skip enrichment. Emit immediately.
0.50 – 0.85	SUSPICIOUS	Route to enrichment (WHOIS/SSL/DNS). Score = enriched model or ml_score.
0.30 – 0.50	UNCERTAIN	Route to enrichment (WHOIS/SSL/DNS). Score = enriched model or ml_score.
0.00 – 0.30	SAFE	Score = ml_score. Skip enrichment. Emit immediately.

8.3 Exported Artifacts

File	Size	Contents
`model.joblib`	0.8 MB	Trained LightGBM classifier (compress=3)
`config.json`	~50 KB	feature_names, char_prob_table, tld_legit_prob, sensitive_words
`metrics.json`	<1 KB	Test set performance metrics

9. Design Justifications

Decision	Rationale
URL-only features	[2] deliberately excluded content features "to optimize speed and responsiveness." [5] §3.1.2: URL features have high potential but cannot be sole defense — hence two-tier architecture. Content features reserved for enrichment path.
LightGBM over XGBoost	LightGBM ranks #1 by MCC on both validation (0.99564) and test (0.99645). Single-inference: 1,582 μs vs 4,385 μs. Train: 3.8s vs 3.7s. Faster at inference with higher MCC.
MCC for ranking	[5] uses MCC as primary metric (Eq. 7). Accounts for all four confusion matrix quadrants. F1 ignores TN. Accuracy treats all errors equally. With 45/55 class split, MCC is most informative.
Strict hostname IP detection	Original regex matched against full URL string. False positive on `http://192.168.1.1.example.com`. Fix: anchored regex on parsed hostname only. Also guards TLD/domain/subdomain fallback parsers from treating IP octets as TLD components.
Raw delimiter query counting	`parse_qs` groups duplicate keys. `?id=1&id=2&id=3` returns length 1. [1] feature #11 counts total components. Raw `'&'` counting preserves this intent.
Occurrence-based sensitive words	`url.count(w)` not `w in url`. URL containing "login" twice (e.g. `login.site.com/login`) scores 2, not 1. Captures keyword stuffing.
Two source datasets only	DS2 (PhiUSIIL) and DS3 (LegitPhish) are the only datasets with raw URL strings and continuous encoding. Raw URLs required for consistent feature extraction. Pre-discretized datasets (DS1, DS6, DS7) cannot be combined. DS4 has no raw URLs and 47.6% duplicates.
70/15/15 split	[1] §5.2 uses 70/30. We carve validation from the 30% for model comparison. Test set touched exactly once.
tldextract dependency	Fallback TLD parser failed on bare domains from Cisco Umbrella file (urlparse requires scheme prefix). Installing `tldextract` resolved this. Feature `tld_legit_prob` jumped to rank #4 in importance with 1,900 splits.

10. Limitations

#	Limitation	Impact	Mitigation
1	Training data from 2015–2023. Phishing tactics evolve.	Concept drift over time.	Periodic retraining. [5] incremental learning approach.
2	URL structure only. No page content analysis.	Sophisticated mimicry URLs evade detection.	Two-tier architecture. Enrichment path handles ambiguous cases.
3	English/Latin-script URL bias.	Non-Latin IDN URLs underrepresented.	Future dataset expansion. [4] acknowledges this.
4	has_ip_address and double_slash_in_path contributed 0 splits in the champion model.	Two features consumed memory and inference time without contributing signal.	Both pruned from active feature set in Phase 0 of the benchmark pipeline. 28 features now used at inference.
5	45.2/54.8 class split does not reflect real-world traffic.	Production prevalence is ~1% phishing. Threshold calibration needed.	Adjust decision threshold post-deployment using production data.

11. References

K.L. Chiew, C.L. Tan, K. Wong, K.S.C. Yong, W.K. Tiong. "A new hybrid ensemble feature selection framework for machine learning-based phishing detection system." Information Sciences 484 (2019) 153–166. doi:10.1016/j.ins.2019.01.064
M.A. Tamal, M.K. Islam, T. Bhuiyan, A. Sattar. "Dataset of suspicious phishing URL detection." Frontiers in Computer Science 6:1308634 (2024). doi:10.3389/fcomp.2024.1308634
R.M. Mohammad, F. Thabtah, L. McCluskey. "An Assessment of Features Related to Phishing Websites using an Automated Technique." ICITST-2012, IEEE (2012).
R.S. Potpelwar, U.V. Kulkarni, J.M. Waghmare. "LegitPhish: A large-scale annotated dataset for URL-based phishing detection." Data in Brief 63 (2025) 111972. doi:10.1016/j.dib.2025.111972
A. Prasad, S. Chandra. "PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning." Computers & Security 136 (2024) 103545. doi:10.1016/j.cose.2023.103545