-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Goal
Create a pipeline to quickly generate high quality, diverse, realistic benchmarking and ML training data for secret scanners.
Rough Idea
- Build up a repository of real private leaks to baseline this process
- Set up a benchmark system with the private data with a collection of scanners using the SIG's patterns
- Experiment with ML only scanning and feature extraction on the private data to baseline that process
- Set up Synthetic Data Generation (SDG) pipeline to generate synthetic variants1 of the private leaks
- Train the ML on the synthetic data and also check the extracted features to see if they have a similar distribution to the private data
- Compare the quality of things built on the synthetic data against the private data (through things like cross-validation and hunting for new leaks).
- Publish the process and build up a large set of training data for folks to use
Initial SDG Process Idea
Secrets themselves often have hidden structure that isn't documented, so for each secret type, create custom code that takes as much of this into consideration as possible.
This may be missing hidden structure, but for example for a Notion API token:
import string
import random
prefix = "ntn_"
alphanum = string.ascii_letters + string.digits
token = "".join(
(
prefix,
"".join(random.choices(string.digits, k=11)),
"".join(random.choices(alphanum, k=35)),
)
)
print(token)To generate something like this:
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
And then for the context generation it's important to avoid license issues and the ability to look at code comments, method names, etc to search for the original leak. I believe this process should solve that (please correct me if you see any issues):
synthetic_secret = gen_synthetic_secret(secret_type)
important_features = extract_important_features(private_leak)
description = llm(describe_prompt, private_leak)
synthetic_leak = llm(generate_prompt, description, important_features)Where
- gen_synthetic_secret is the specific secret generating code like the Notion snippet above.
- extract_important_features is custom code to pull out things we want to note about the source leak like language, maybe file size, and other parameters that would be passed to an LLM to shape the output but also generic enough to not run into copyright issues (i.e. "a python file with over 500 lines of code" is too generic to be copyrighted). -
- describe_prompt would be a prompt to the LLM to do something similar to
extract_important_featuresbut to get some higher level features. It is important to get this prompt right and make sure it doesn't include any of the original content in the description. It might be safer to drop this and lean only onextract_important_featuresif it is enough. But being able to get some higher level context would be nice. There would need to be a review process. - generate_prompt would be the prompt to generate the synthetic data given the description and important features2.
I would appreciate it if any SIG members have access to lawyers in their company that they could run this by. I'll try to do the same3.
Footnotes
-
The synthetic variants must NOT make it possible to find the original leak from the output and should be done in a way that avoids license issues. ↩
-
Example prompt that I've had good results with in my limited testing.
↩Role and Goal: You are an expert code generator specializing in creating realistic code snippets for cybersecurity testing. Your purpose is to help security professionals test the efficacy of their secret scanning tools. Core Task: Generate a high-quality, realistic code snippet based on the user's prompt. This snippet must contain a specific, synthetic (fake) credential provided by the user. The code should look authentic, as if a developer mistakenly hardcoded a secret and committed it to a repository. Critical Instructions: Plausibility is Key: The code must be a plausible example for the requested language and task. The placement of the fake credential should be natural (e.g., in a variable assignment, a configuration object, a connection string, etc.). Exact Credential: You MUST use the exact fake credential string provided by the user in the {{FAKE_CREDENTIAL}} field. Do not alter it. No Disclaimers in Code: Do NOT add comments or warnings like "// Do not hardcode secrets" or "// For testing purposes only" inside the generated code block. The output should purely simulate the vulnerable code. Format: Present the final output as a single, clean markdown code block with the correct language identifier (e.g., ```python). User Input: Prompt: {{PROMPT}} Fake Credential: {{FAKE_CREDENTIAL}} Example Usage: Here is an example of how a user would fill in the variables: Prompt: A simple Python script using the requests library to make a GET request to an API that requires a bearer token for authentication. Fake Credential: ghp_aBcDeFgHiJkLmNoPqRsTuVwXyZ1234567890 -
Any decisions on final synthetic data not being within the scope of copyright is the responsibility of the SIG member that submitted it and not their parent company. Any opinions expressed by a company's lawyer is only that, an opinion and is not the official position of that company. Our goal is to try to be good stewards of the data we are reviewing and plan to only use a description of it and not any of the original code. It is my understanding that simply saying a file is a go file with 300 lines and 5 methods does not constitute a derived work since the generated code will not contain any of the original nor be for a similar use. ↩