Synthetic Test Data Generation

# Goal

Create a pipeline to quickly generate high quality, diverse, realistic benchmarking and ML training data for secret scanners.

## Rough Idea

1. Build up a repository of real private leaks to baseline this process
2. Set up a benchmark system with the private data with a collection of scanners using the [SIG's patterns](http://github.com/secret-scanning-sig/patterns)
3. Experiment with ML only scanning and feature extraction on the private data to baseline that process
4. Set up Synthetic Data Generation (SDG) pipeline to generate synthetic variants[^1] of the private leaks
5. Train the ML on the synthetic data and also check the extracted features to see if they have a similar distribution to the private data
6. Compare the quality of things built on the synthetic data against the private data (through things like cross-validation and hunting for new leaks).
7. Publish the process and build up a large set of training data for folks to use

## Initial SDG Process Idea

Secrets themselves often have [hidden structure that isn't documented](https://medium.com/@TalBeerySec/a-short-note-on-aws-key-id-f88cc4317489), so for each secret type, create custom code that takes as much of this into consideration as possible.

This may be missing hidden structure, but for example for a Notion API token:

```python
import string
import random

prefix = "ntn_"
alphanum = string.ascii_letters + string.digits
token = "".join(
     (
        prefix,
        "".join(random.choices(string.digits, k=11)),
        "".join(random.choices(alphanum, k=35)),
    )
)

print(token)
```

To generate something like this:

```
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
```

And then for the context generation it's important to avoid license issues and the ability to look at code comments, method names, etc to search for the original leak. I _believe_ this process should solve that (please correct me if you see any issues):

```python
synthetic_secret = gen_synthetic_secret(secret_type)
important_features = extract_important_features(private_leak)
description = llm(describe_prompt, private_leak)
synthetic_leak = llm(generate_prompt, description, important_features)
```

**Where**

- **gen_synthetic_secret** is the specific secret generating code like the Notion snippet above.
- **extract_important_features** is custom code to pull out things we want to note about the source leak like language, maybe file size, and other parameters that would be passed to an LLM to shape the output but also generic enough to not run into copyright issues (i.e. "a python file with over 500 lines of code" is too generic to be copyrighted). - 
- **describe_prompt** would be a prompt to the LLM to do something similar to `extract_important_features` but to get some higher level features. It is important to get this prompt right and make sure it doesn't include any of the original content in the description. It might be safer to drop this and lean only on `extract_important_features` if it is enough. But being able to get some higher level context would be nice. There would need to be a review process.
- **generate_prompt** would be the prompt to generate the synthetic data given the description and important features[^2].

I would appreciate it if any SIG members have access to lawyers in their company that they could run this by. I'll try to do the same[^3].


[^1]: The synthetic variants must NOT make it possible to find the original leak from the output and should be done in a way that avoids license issues. 
[^2]: Example prompt that I've had good results with in my limited testing.
      ```
      Role and Goal:

      You are an expert code generator specializing in creating realistic code
      snippets for cybersecurity testing. Your purpose is to help security
      professionals test the efficacy of their secret scanning tools.

      Core Task:

      Generate a high-quality, realistic code snippet based on the user's prompt.
      This snippet must contain a specific, synthetic (fake) credential provided by
      the user. The code should look authentic, as if a developer mistakenly
      hardcoded a secret and committed it to a repository.

      Critical Instructions:

      Plausibility is Key: The code must be a plausible example for the requested
      language and task. The placement of the fake credential should be natural
      (e.g., in a variable assignment, a configuration object, a connection string,
      etc.).

      Exact Credential: You MUST use the exact fake credential string provided by the
      user in the {{FAKE_CREDENTIAL}} field. Do not alter it.

      No Disclaimers in Code: Do NOT add comments or warnings like "// Do not
      hardcode secrets" or "// For testing purposes only" inside the generated code
      block. The output should purely simulate the vulnerable code.

      Format: Present the final output as a single, clean markdown code block with
      the correct language identifier (e.g., ```python).

      User Input:

      Prompt: {{PROMPT}}

      Fake Credential: {{FAKE_CREDENTIAL}}

      Example Usage:

      Here is an example of how a user would fill in the variables:

      Prompt: A simple Python script using the requests library to make a GET request
      to an API that requires a bearer token for authentication.

      Fake Credential: ghp_aBcDeFgHiJkLmNoPqRsTuVwXyZ1234567890
      ```

[^3]: Any decisions on final synthetic data not being within the scope of copyright is the responsibility of the SIG member that submitted it and not their parent company. Any opinions expressed by a company's lawyer is only that, an opinion and is not the official position of that company. Our goal is to try to be good stewards of the data we are reviewing and plan to only use a description of it and not any of the original code. It is my understanding that simply saying a file is a go file with 300 lines and 5 methods does not constitute a derived work since the generated code will not contain any of the original nor be for a similar use. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Synthetic Test Data Generation #4

Goal

Rough Idea

Initial SDG Process Idea

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Synthetic Test Data Generation #4

Description

Goal

Rough Idea

Initial SDG Process Idea

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions