Skip to content

Commit f544390

Browse files
authored
Merge pull request #21 from linuxfoundation/andrest50/reindex-script
[LFXV2-567] Add data migration script for access query fields
2 parents 8765755 + 70c809e commit f544390

File tree

3 files changed

+702
-0
lines changed

3 files changed

+702
-0
lines changed

pkg/env/env.go

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
// Copyright The Linux Foundation and each contributor to LFX.
2+
// SPDX-License-Identifier: MIT
3+
4+
// Package env provides utilities for reading environment variables with type conversion and default values.
5+
package env
6+
7+
import (
8+
"os"
9+
"strconv"
10+
"time"
11+
)
12+
13+
// GetString returns the value of the environment variable or the default value if not set or empty.
14+
func GetString(key, defaultValue string) string {
15+
if value := os.Getenv(key); value != "" {
16+
return value
17+
}
18+
return defaultValue
19+
}
20+
21+
// GetInt returns the value of the environment variable as an integer or the default value if not set, empty, or invalid.
22+
func GetInt(key string, defaultValue int) int {
23+
if value := os.Getenv(key); value != "" {
24+
if intValue, err := strconv.Atoi(value); err == nil {
25+
return intValue
26+
}
27+
}
28+
return defaultValue
29+
}
30+
31+
// GetBool returns the value of the environment variable as a boolean or the default value if not set, empty, or invalid.
32+
func GetBool(key string, defaultValue bool) bool {
33+
if value := os.Getenv(key); value != "" {
34+
if boolValue, err := strconv.ParseBool(value); err == nil {
35+
return boolValue
36+
}
37+
}
38+
return defaultValue
39+
}
40+
41+
// GetDuration returns the value of the environment variable as a time.Duration or the default value if not set, empty, or invalid.
42+
func GetDuration(key string, defaultValue time.Duration) time.Duration {
43+
if value := os.Getenv(key); value != "" {
44+
if duration, err := time.ParseDuration(value); err == nil {
45+
return duration
46+
}
47+
}
48+
return defaultValue
49+
}
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# Access Query Fields Migration Script
2+
3+
This script migrates existing OpenSearch documents to add the new `access_check_query` and `history_check_query` fields that were recently added as part of [PR #20](https://github.com/linuxfoundation/lfx-v2-indexer-service/pull/20).
4+
5+
## Background
6+
7+
The LFX indexer service was recently updated to include new document fields for Fine-Grained Authorization (FGA) of documents via the [query service](https://github.com/linuxfoundation/lfx-v2-query-service):
8+
9+
- `access_check_query`: Combination of `access_check_object` + "#" + `access_check_relation`
10+
- `history_check_query`: Combination of `history_check_object` + "#" + `history_check_relation`
11+
12+
These fields are automatically populated for newly indexed documents (implemented in [PR #20](https://github.com/linuxfoundation/lfx-v2-indexer-service/pull/20)), but existing documents need to be migrated.
13+
14+
## Usage
15+
16+
### Basic Usage
17+
18+
```bash
19+
# Run in dry-run mode to see what would be changed
20+
DRY_RUN=true go run scripts/migration/001_add_access_query_fields/main.go
21+
22+
# Run the actual migration
23+
go run scripts/migration/001_add_access_query_fields/main.go
24+
```
25+
26+
## Environment Variables
27+
28+
| Variable | Default | Description |
29+
|----------|---------|-------------|
30+
| `OPENSEARCH_URL` | `http://localhost:9200` | OpenSearch cluster URL |
31+
| `OPENSEARCH_INDEX` | `resources` | Target index name |
32+
| `BATCH_SIZE` | `100` | Number of documents to process per batch |
33+
| `DRY_RUN` | `false` | If true, only log what would be updated without making changes |
34+
| `SCROLL_TIMEOUT` | `5m` | Scroll context timeout |
35+
36+
## Safety Features
37+
38+
- **Dry Run Mode**: Use `DRY_RUN=true` to preview changes without applying them
39+
- **Idempotent**: Safe to run multiple times - skips documents that already have the new fields
40+
- **Graceful Shutdown**: Responds to SIGINT/SIGTERM signals
41+
- **Progress Tracking**: Shows detailed progress and statistics
42+
- **Error Handling**: Continues processing even if individual batches fail
43+
44+
## Migration Logic
45+
46+
The script:
47+
48+
1. Searches for documents that have access control fields but are missing the new query fields
49+
2. For each document, constructs the query fields only if both object and relation are non-empty
50+
3. Updates documents in batches using the OpenSearch bulk API
51+
4. Provides detailed statistics and progress reporting
52+
53+
### Query Construction Rules
54+
55+
- `access_check_query` is created only if both `access_check_object` and `access_check_relation` are non-empty
56+
- `history_check_query` is created only if both `history_check_object` and `history_check_relation` are non-empty
57+
- Format: `{object}#{relation}` (e.g., `committee:abc123#viewer`)
58+
59+
## Example Output
60+
61+
```text
62+
Starting access query fields migration...
63+
=== Migration Configuration ===
64+
OpenSearch URL: http://opensearch:9200
65+
Index Name: resources
66+
Batch Size: 100
67+
Dry Run: false
68+
Scroll Timeout: 5m0s
69+
==============================
70+
✓ Connected to OpenSearch successfully
71+
Searching for documents that need migration...
72+
Found 1250 documents that may need migration
73+
74+
Processing batch 1 (100 documents)...
75+
Progress: 100/1250 documents (8.0%)
76+
77+
Processing batch 2 (100 documents)...
78+
Progress: 200/1250 documents (16.0%)
79+
...
80+
81+
=== Migration Statistics ===
82+
Total Documents Found: 1250
83+
Documents Processed: 1250
84+
Documents Updated: 987
85+
Documents Skipped: 263
86+
Documents with Errors: 0
87+
Duration: 45.6s
88+
Processing Rate: 27.4 docs/sec
89+
============================
90+
91+
✓ Migration completed successfully!
92+
```
93+
94+
## Troubleshooting
95+
96+
### Connection Issues
97+
98+
- Verify OpenSearch is running and accessible
99+
- Check the `OPENSEARCH_URL` environment variable
100+
- Ensure network connectivity and authentication if required
101+
102+
### Performance Tuning
103+
104+
- Increase `BATCH_SIZE` for faster processing of large datasets
105+
- Adjust `SCROLL_TIMEOUT` if processing very large result sets
106+
- Monitor OpenSearch cluster performance during migration
107+
108+
### Partial Failures
109+
110+
- The script continues processing even if individual batches fail
111+
- Check the error logs for specific failure reasons
112+
- Re-run the script to retry failed documents (it's idempotent)
113+
114+
## Testing
115+
116+
Always test in a non-production environment first:
117+
118+
1. Run with `DRY_RUN=true` to preview changes
119+
2. Test with a small `BATCH_SIZE` initially
120+
3. Verify the query fields are constructed correctly
121+
4. Check that no data is corrupted
122+
123+
## Technical Details
124+
125+
- Uses OpenSearch scroll API for efficient processing of large result sets
126+
- Bulk updates for optimal performance
127+
- Only fetches necessary fields to minimize network transfer
128+
- Implements proper signal handling for graceful shutdown
129+
- Comprehensive error handling and statistics tracking

0 commit comments

Comments
 (0)