-
Notifications
You must be signed in to change notification settings - Fork 24
Description
I've been playing with yahoo data source and one thing occurs to me: in its current implementation user is locked in TimeArray. It's not always convenient, user may prefer to work with other data formats, DataFrames, Temporal or maybe some other custom format. What I am proposing is to give an interface like this:
data = yahoo("SPY", <SINK>)
# for example
data = yahoo("SPY", DataFrame) # download data and export it to DataFrame
data = yahoo("SPY", Temportal) # download data and export it to Temporal
...
Now, SINK can be anything: DataFrame, TimeArray or whatever user want. We can emit by default TimeArray for example, but that wouldn't limit user.
In order to do that we can wrap CSV.File in special structure which should conform Tables.jl protocol. The idea that if we for example define yahoo as
function yahoo(sym::AbstractString = "^GSPC", opt::YahooOpt = YahooOpt(), sink = DataFrame)
host = rand(["query1", "query2"])
url = "https://$host.finance.yahoo.com/v7/finance/download/$sym"
res = HTTP.get(url, query = opt)
@assert res.status == 200
csv = CSV.File(res.body, missingstrings = ["null"])
return sink(csv)
end
then this function is providing a DataFrame sink by default. In order for it to work for the TimeArray, one should only implement csv -> TimeArray interface which can look like
function TimeArray(csv)
sch = TimeSeries.Tables.schema(csv)
TimeArray(csv, timestamp = first(sch.names)) |> cleanup_colname!
end
and something similar for Temporal.
The problem with this direct approach is that it is very non-general. If in some other data source datetime column wouldn't be located at the first position it will break. So, we can do something smarter, like defining a structure
struct TimeDataWrapper{T1, T2}
meta::T1
data::T2
end
and use it
sch = (; schema = TimeSeries.Tables.schema(csv), timestamp = 1) # or something similar
timedatawrapper = TimeDataWrapper(sch, csv)
return sink(timedatawrapper)
This structure should implement corresponding Tables.jl methods and at the same time should provide the necessary information in meta field (like where datetime column is located). So, every sink which can use this structure can convert data source to its own format without any problems.
We can do it in a few small steps
- make this change in
MarketData.jl. As long asTimeDataWrapperlives insideMarketData.jl, functions likeTimeArray(x::TimeDataWrapper)is not a type piracy. As a result, we get the function that can extract its data toTimeArrayandDataFrameformats. Just to clarify,DataFramesupport is coming from the fact thatTimeDataWrapperfollows Tables.jl API. - Extract this functionality to a separate lightweight package
MarketDataInterface.jland ask the owner ofTimeSeries.jlto provide support for this package. - We can try to work with the owner of
Temporal.jland ask him to provide support. - I am currently reviving
Timestamps.jland can write necessary support for them as well.
As a result, we will have a generic method, which can work with multiple sinks, and instead of forcing users what package to choose for financial data, they will be able to use a single package for data sourcing and any package they like for further data processing. It's a win-win situation.
As a further step, Quandl.jl can be revived and it can go through the same procedure. So we will have multiple financial data sources with the same consistent logic.
If this proposal is ok, I can try to go with the first step and we will see how it works out.