Skip to content

Conversation

@ubaskota
Copy link
Contributor

@ubaskota ubaskota commented Nov 4, 2025

Description of changes:
Use os.replace() for atomic file operations to eliminate race conditions during concurrent access to the file cache. This ensures that writes are either fully completed or not applied at all, preventing partial writes that could leave the JSON file in an invalid state when multiple processes access the cache simultaneously.

Note: This change is based on the approach suggested in #3544 by @ranlz77. Thanks for identifying this issue and proposing the initial solution.

Tests:
I verified that all tests including the the newly added unit tests pass.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@codecov-commenter
Copy link

codecov-commenter commented Nov 4, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 50.00000% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.10%. Comparing base (8121342) to head (2b225c5).
⚠️ Report is 57 commits behind head on develop.

Files with missing lines Patch % Lines
botocore/utils.py 50.00% 12 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3586      +/-   ##
===========================================
- Coverage    93.17%   93.10%   -0.07%     
===========================================
  Files           68       68              
  Lines        15411    15432      +21     
===========================================
+ Hits         14359    14368       +9     
- Misses        1052     1064      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines +3585 to +3586
if hasattr(os, 'fchmod'):
os.fchmod(temp_fd, 0o600)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary – mkstemp's documentation says "The file is readable and writable only by the creating user ID.".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. This isn't needed. Thanks for pointing it out.

if hasattr(os, 'fchmod'):
os.fchmod(temp_fd, 0o600)
with os.fdopen(temp_fd, 'w') as f:
temp_fd = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.fdopen is an alias of open. Exiting the context manager (for example when unwinding from an exception) should close the underlying fd, so I'm not sure the temp_fd = None + cleanup dance in except Exception below is necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct that os.fdopen() takes ownership of the file descriptor and the context manager will close it automatically. But the cleanup code handles exceptions that occur before os.fdopen() and prevents double-close issues after it. The exception handler also removes orphaned temporary files from disk in both scenarios. Think of a scenario where there’s an exception before os.fdopen() (os.fchmod() fails), temp_fd is still open and needs manual closing to prevent file descriptor leaks.

Comment on lines +3590 to +3591
f.flush()
os.fsync(f.fileno())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason to add an explicit flush + fsync?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing the file doesn’t guarantee the data has reached the disk. f.write() sends data to Python’s buffer, then to the OS buffer, and the OS writes it to disk later.

If you call os.replace() right after closing and the system crashes before the OS flushes its buffer, you can end up with a partial or corrupt file. flush() pushes Python’s buffer to the OS, and fsync() forces the OS to commit it to disk. Together they ensure the data is fully written before you proceed. More here: https://docs.python.org/3/library/os.html#os.fsync

@ubaskota
Copy link
Contributor Author

This PR will be closed shortly. Per our open source contribution guidelines, we'll update the original PR #3544 instead.

@ubaskota
Copy link
Contributor Author

Implementation moved to: #3597

@ubaskota ubaskota closed this Nov 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants