-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-44910: [Swift] Fix IPC stream reader and writer impl #45029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
eb8d290 to
fc1e28e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment that explains the difference between fromMemoryStream and fromFileStream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the size of length data?
How about using UInt32 not Int32 because length data is UInt32 not Int32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the length var and it is already UInt32. From a couple of lines above: var length = getUInt32(fileData, offset: offset). Please let me know if this matches what you are seeing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. I don't remember this but I think that I referred var offset: Int = 0...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change offset += Int(MemoryLayout.size) to offset += Int(MemoryLayout.size). The variable offset is an Int due to the parameter type in the call to the buffers loadUnaligned.
|
@kou I hope all is well. Please review again when you get a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. I don't remember this but I think that I referred var offset: Int = 0...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use different name for this? This may be confused named because Apache Arrow specification uses:
- "IPC Streaming Format" https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format
- "IPC File Format" https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
If we use "File" and "Stream" in this method name, users may think that this is for "IPC Streaming Format" that is stored in a file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I will change the name to fromStream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using readStreaming (for the Arrow streaming format) and readFile (for the Arrow file format) instead of fromMemoryStream (for the Arrow streaming format) and fromStream (for the Arrow file format)?
| /* | |
| The Memory stream format is for reading the arrow streaming protocol. This | |
| format is slightly different from the File format protocol as it doesn't contain | |
| a header and footer | |
| */ | |
| public func fromMemoryStream( // swiftlint:disable:this function_body_length | |
| /* | |
| This is for reading the Arrow streaming format. The Arrow streaming format | |
| is slightly different from the Arrow File format as it doesn't contain a header | |
| and footer. | |
| */ | |
| public func readStreaming( // swiftlint:disable:this function_body_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /* | |
| The File stream format supports random accessing the data. This format contains | |
| a header and footer around the streaming format. | |
| */ | |
| public func fromStream( // swiftlint:disable:this function_body_length | |
| /* | |
| This is for reading the Arrow file format. The Arrow file format supports | |
| random accessing the data. The Arrow file format contains a header and | |
| footer around the Arrow streaming format. | |
| */ | |
| public func readFile( // swiftlint:disable:this function_body_length |
|
I see that @dongjoon-hyun is using this Swift Arrow implementation in the Spark Connect Client for Swift. Has this issue been fixed downstream in that repo? |
|
I've been following up Apache Arrow activity already in order to consume the official Apache Arrow release eventually when it's ready. 😄 |
|
For the record, Apache Spark Connect for Swift is a user of Apache Arrow. For the required changes, I've already contributed back except one thing (Swift 6 compilation stuff). Other than that, there is no new feature or bug fixes for this layer. |
|
Thanks very much for your contributions @dongjoon-hyun! |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @abandy and all . It looks good to me.
|
It seems that apache/spark-connect-swift bundles Apache Arrow Swift apache/spark-connect-swift@fe8322d instead of referring a package in https://github.com/apache/spark-connect-swift/blob/main/Package.swift . Is it only for backporting unreleased features/fixes? (Will apache/spark-connect-swift use Apache Arrow Swift as a package when we release 21.0.0?) |
|
I do not have privileges to merge. @dongjoon-hyun or @kou can you please merge when you get a chance? |
|
To @kou and @abandy , as a user, I really appreciated your efforts on Apache Arrow. To @kou ,
As a member of Apache Spark PMC, I can say that Apache Spark community has no intention to duplicate Apache Arrow. I clearly mentioned in the following PR from the beginning when I started with 19.0.1. Apache Spark community uses only the committed Apache Arrow codebase. To be honest, I've been monitoring, evaluating and waiting for Apache Arrow Swift for a long time than you guess, but it didn't meet my expectation. There are a few reasons why we couldn't start as a package consumer. The most important thing is the lack of Swift 6 support. In addition, some instability in Linux environments (due to the potential data race). I started inevitably To @abandy ,
|
|
As a side note, @kou , as a user, I hope Apache Arrow community publishes As of now, you can see that |
kou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for hijacking this PR for Apache Spark Connect Client for Swift.
@dongjoon-hyun Could you open an issue for remained issues for Apache Spark Connect Client for Swift? Let's use the issue for further discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the same naming rules as reader for writer too?
|
Oh, not at all. Apache Arrow community is a big eco-system. I'm happy to monitor the community decision and collaborate as a user.
Definitely, will do in a proper way.
|
I clarify this: "remained Apache Arrow Swift issues such as publishing to Swift Package Index" |
|
@kou please review and merge when you get a chance. |
kou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
Ah, we should have updated the PR description before we merge this... |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 8893e88. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Rationale for this change
Fixes IPC incorrect stream format issue.
Changes have been tested with:
This PR includes breaking changes to public APIs.
Writer and reader APIs have changed:
Reader:
fromStream -> fromFileStream
Writer:
toStream -> toFileStream