Skip to content

Conversation

@IanHoang
Copy link
Contributor

@IanHoang IanHoang commented Nov 3, 2025

Description

Adds synthetic data generation documentation to OSB documentation. Adds 2.1.0 docs too

Issues Resolved

N/A

Version

All

Frontend features

N/A

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

E2E Testing:
image

@github-actions
Copy link

github-actions bot commented Nov 3, 2025

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

Signed-off-by: Ian Hoang <[email protected]>
@kolchfa-aws kolchfa-aws added Tech review PR: Tech review in progress backport 3.3 labels Nov 4, 2025

---

#### `sample_vectors` (optional, highly recommended)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is helpful, but it might be nice to include how to actually use sample vectors once they're obtained?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an explanation below this sectionbut I've added clarity in the next revision


---

#### `distribution_type` (default: "gaussian")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, even just a small code snippet showing an example of using the flag or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each of these params have the sample YAML config beneath them. The config is just provided in as mentioned here:

The following are parameters that users can add to their SDG Config (YAML Config) to fine-tune generation of dense vectors.

Let me know if there's something else that can be added?

| Production Simulation | Complex | Must match actual embedding model behavior |
| Search Quality Testing | Complex | Need proper vector clusters for recall/precision testing |

**Rule of thumb**: If you're testing search quality or comparing algorithms, use complex configuration with sample vectors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty much the same note as above but for all these flags 😄


This documentation is for advanced options related to synthetic data generation. Read more in the following sections.

{% include cards.html cards=page.more_cards %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the index page for advanced section and the this line just lists all the related pages as cards

Comment on lines +19 to +20
* **Required**: Custom logic defined in Python module
* **Optional**: Synthetic Data Generation Config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be nice to link to these sections (within the docs sight or outside sources) if we have them readily available

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a sample config below but I'll link it to the full config available in OSB



[NOTE] ✨ Dashboard link to monitor processes and task streams: [http://127.0.0.1:8787/status]
[NOTE] ✨ For users who are running generation on a virtual machine, consider SSH port forwarding (tunneling) to localhost to view dashboard.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth showing a screenshot of this dashboard? so users can see what should appear for them when running this

Comment on lines +12 to +13
* OpenSearch index mappings
* Custom logic (via Python module)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we link to examples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some examples at the bottom of this page. Let me know if you're looking for a different example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 3.3 Tech review PR: Tech review in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants