-
Notifications
You must be signed in to change notification settings - Fork 628
Add Synthetic Data Generation & Features to OSB Documentation #11480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ian Hoang <[email protected]>
|
Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged. Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer. When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review. |
Signed-off-by: Ian Hoang <[email protected]>
Signed-off-by: Ian Hoang <[email protected]>
_benchmark/features/synthetic-data-generation/advanced/generating-vectors.md
Show resolved
Hide resolved
|
|
||
| --- | ||
|
|
||
| #### `sample_vectors` (optional, highly recommended) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is helpful, but it might be nice to include how to actually use sample vectors once they're obtained?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an explanation below this sectionbut I've added clarity in the next revision
|
|
||
| --- | ||
|
|
||
| #### `distribution_type` (default: "gaussian") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, even just a small code snippet showing an example of using the flag or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each of these params have the sample YAML config beneath them. The config is just provided in as mentioned here:
The following are parameters that users can add to their SDG Config (YAML Config) to fine-tune generation of dense vectors.
Let me know if there's something else that can be added?
| | Production Simulation | Complex | Must match actual embedding model behavior | | ||
| | Search Quality Testing | Complex | Need proper vector clusters for recall/precision testing | | ||
|
|
||
| **Rule of thumb**: If you're testing search quality or comparing algorithms, use complex configuration with sample vectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty much the same note as above but for all these flags 😄
|
|
||
| This documentation is for advanced options related to synthetic data generation. Read more in the following sections. | ||
|
|
||
| {% include cards.html cards=page.more_cards %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the index page for advanced section and the this line just lists all the related pages as cards
| * **Required**: Custom logic defined in Python module | ||
| * **Optional**: Synthetic Data Generation Config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be nice to link to these sections (within the docs sight or outside sources) if we have them readily available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a sample config below but I'll link it to the full config available in OSB
|
|
||
|
|
||
| [NOTE] ✨ Dashboard link to monitor processes and task streams: [http://127.0.0.1:8787/status] | ||
| [NOTE] ✨ For users who are running generation on a virtual machine, consider SSH port forwarding (tunneling) to localhost to view dashboard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth showing a screenshot of this dashboard? so users can see what should appear for them when running this
| * OpenSearch index mappings | ||
| * Custom logic (via Python module) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we link to examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some examples at the bottom of this page. Let me know if you're looking for a different example?
Description
Adds synthetic data generation documentation to OSB documentation. Adds 2.1.0 docs too
Issues Resolved
N/A
Version
All
Frontend features
N/A
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
E2E Testing:
