Skip to content

Conversation

@abandy
Copy link
Contributor

@abandy abandy commented Dec 14, 2024

Rationale for this change

Fixes IPC incorrect stream format issue.

Changes have been tested with:

  1. directions from GH-40488: [Swift] Add simple get swift example arrow-experiments#41 (comment)
  2. generated file using generate.py from https://github.com/apache/arrow-experiments/tree/main/data/rand-many-types (removed currently unsupported Swift types)

This PR includes breaking changes to public APIs.
Writer and reader APIs have changed:
Reader:
fromStream -> fromFileStream
Writer:
toStream -> toFileStream

@abandy abandy requested a review from kou as a code owner December 14, 2024 19:09
@github-actions
Copy link

⚠️ GitHub issue #44910 has been automatically assigned in GitHub to PR creator.

@abandy abandy force-pushed the GH-44910 branch 2 times, most recently from eb8d290 to fc1e28e Compare December 14, 2024 19:38
@kou kou changed the title GH-44910: [Swift] fix ipc stream reader and writer impl GH-44910: [Swift] Fix IPC stream reader and writer impl Dec 15, 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment that explains the difference between fromMemoryStream and fromFileStream?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the size of length data?
How about using UInt32 not Int32 because length data is UInt32 not Int32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the length var and it is already UInt32. From a couple of lines above: var length = getUInt32(fileData, offset: offset). Please let me know if this matches what you are seeing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I don't remember this but I think that I referred var offset: Int = 0...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change offset += Int(MemoryLayout.size) to offset += Int(MemoryLayout.size). The variable offset is an Int due to the parameter type in the call to the buffers loadUnaligned.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Dec 15, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 17, 2025
@abandy
Copy link
Contributor Author

abandy commented May 1, 2025

@kou I hope all is well. Please review again when you get a chance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I don't remember this but I think that I referred var offset: Int = 0...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use different name for this? This may be confused named because Apache Arrow specification uses:

If we use "File" and "Stream" in this method name, users may think that this is for "IPC Streaming Format" that is stored in a file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I will change the name to fromStream.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 2, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 3, 2025
Comment on lines 219 to 224
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using readStreaming (for the Arrow streaming format) and readFile (for the Arrow file format) instead of fromMemoryStream (for the Arrow streaming format) and fromStream (for the Arrow file format)?

Suggested change
/*
The Memory stream format is for reading the arrow streaming protocol. This
format is slightly different from the File format protocol as it doesn't contain
a header and footer
*/
public func fromMemoryStream( // swiftlint:disable:this function_body_length
/*
This is for reading the Arrow streaming format. The Arrow streaming format
is slightly different from the Arrow File format as it doesn't contain a header
and footer.
*/
public func readStreaming( // swiftlint:disable:this function_body_length

Comment on lines 284 to 288
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/*
The File stream format supports random accessing the data. This format contains
a header and footer around the streaming format.
*/
public func fromStream( // swiftlint:disable:this function_body_length
/*
This is for reading the Arrow file format. The Arrow file format supports
random accessing the data. The Arrow file format contains a header and
footer around the Arrow streaming format.
*/
public func readFile( // swiftlint:disable:this function_body_length

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 3, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 4, 2025
@ianmcook
Copy link
Member

ianmcook commented May 6, 2025

I see that @dongjoon-hyun is using this Swift Arrow implementation in the Spark Connect Client for Swift. Has this issue been fixed downstream in that repo?

@dongjoon-hyun
Copy link
Member

I've been following up Apache Arrow activity already in order to consume the official Apache Arrow release eventually when it's ready. 😄

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 6, 2025

For the record, Apache Spark Connect for Swift is a user of Apache Arrow. For the required changes, I've already contributed back except one thing (Swift 6 compilation stuff). Other than that, there is no new feature or bug fixes for this layer.

@ianmcook
Copy link
Member

ianmcook commented May 6, 2025

Thanks very much for your contributions @dongjoon-hyun!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @abandy and all . It looks good to me.

@kou
Copy link
Member

kou commented May 7, 2025

It seems that apache/spark-connect-swift bundles Apache Arrow Swift apache/spark-connect-swift@fe8322d instead of referring a package in https://github.com/apache/spark-connect-swift/blob/main/Package.swift .

Is it only for backporting unreleased features/fixes? (Will apache/spark-connect-swift use Apache Arrow Swift as a package when we release 21.0.0?)

@abandy
Copy link
Contributor Author

abandy commented May 7, 2025

I do not have privileges to merge. @dongjoon-hyun or @kou can you please merge when you get a chance?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 7, 2025

To @kou and @abandy , as a user, I really appreciated your efforts on Apache Arrow.

To @kou ,

Will apache/spark-connect-swift use Apache Arrow Swift as a package when we release 21.0.0?

As a member of Apache Spark PMC, I can say that Apache Spark community has no intention to duplicate Apache Arrow. I clearly mentioned in the following PR from the beginning when I started with 19.0.1.

Apache Spark community uses only the committed Apache Arrow codebase. To be honest, I've been monitoring, evaluating and waiting for Apache Arrow Swift for a long time than you guess, but it didn't meet my expectation. There are a few reasons why we couldn't start as a package consumer. The most important thing is the lack of Swift 6 support. In addition, some instability in Linux environments (due to the potential data race).

I started inevitably Spark Connect Swift Client as v0.1 because Apache Spark 4.0 is already RC4. In the end, I hope and I will remove all copied content from Apache Spark repository when Apache Arrow Swift is ready to be used directly.

To @abandy ,

  • Sorry for making you confused. Although I'm a ASF member and Apache Spark PMC member, I'm just a user in Apache Arrow community. I just tested your PR and approved it as an audience. I have no permissions here.
  • I'm wondering if we have a roadmap or ETA for the remaining development. For example, list type support?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 7, 2025

As a side note, @kou , as a user, I hope Apache Arrow community publishes Apache Arrow package in Swift Package Index site under Apache namespace as least. That could be the beginning of consumable Apache Arrow package.

As of now, you can see that Apache Spark Connect Client for Swift is the only registered Swift Package under Apache.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for hijacking this PR for Apache Spark Connect Client for Swift.

@dongjoon-hyun Could you open an issue for remained issues for Apache Spark Connect Client for Swift? Let's use the issue for further discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the same naming rules as reader for writer too?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 8, 2025
@dongjoon-hyun
Copy link
Member

Oh, not at all. Apache Arrow community is a big eco-system. I'm happy to monitor the community decision and collaborate as a user.

Sorry for hijacking this PR for Apache Spark Connect Client for Swift.

Definitely, will do in a proper way.

Could you open an issue for remained issues for Apache Spark Connect Client for Swift?

@kou
Copy link
Member

kou commented May 9, 2025

Could you open an issue for remained issues for Apache Spark Connect Client for Swift?

I clarify this: "remained Apache Arrow Swift issues such as publishing to Swift Package Index"

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 12, 2025
@abandy
Copy link
Contributor Author

abandy commented May 20, 2025

@kou please review and merge when you get a chance.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@kou kou merged commit 8893e88 into apache:main May 20, 2025
7 checks passed
@kou kou removed the awaiting change review Awaiting change review label May 20, 2025
@github-actions github-actions bot added the awaiting merge Awaiting merge label May 20, 2025
@kou
Copy link
Member

kou commented May 20, 2025

Ah, we should have updated the PR description before we merge this...

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 8893e88.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants