Skip to content

Proposal to add sink feature to data downloaders #71

@Arkoniak

Description

@Arkoniak

I've been playing with yahoo data source and one thing occurs to me: in its current implementation user is locked in TimeArray. It's not always convenient, user may prefer to work with other data formats, DataFrames, Temporal or maybe some other custom format. What I am proposing is to give an interface like this:

data = yahoo("SPY", <SINK>)
# for example
data = yahoo("SPY", DataFrame) # download data and export it to DataFrame
data = yahoo("SPY", Temportal)   # download data and export it to Temporal
...

Now, SINK can be anything: DataFrame, TimeArray or whatever user want. We can emit by default TimeArray for example, but that wouldn't limit user.

In order to do that we can wrap CSV.File in special structure which should conform Tables.jl protocol. The idea that if we for example define yahoo as

function yahoo(sym::AbstractString = "^GSPC", opt::YahooOpt = YahooOpt(), sink = DataFrame)
    host = rand(["query1", "query2"])
    url  = "https://$host.finance.yahoo.com/v7/finance/download/$sym"
    res  = HTTP.get(url, query = opt)
    @assert res.status == 200
    csv = CSV.File(res.body, missingstrings = ["null"])
    return sink(csv)
end

then this function is providing a DataFrame sink by default. In order for it to work for the TimeArray, one should only implement csv -> TimeArray interface which can look like

function TimeArray(csv)
    sch = TimeSeries.Tables.schema(csv)
    TimeArray(csv, timestamp = first(sch.names)) |> cleanup_colname!
end

and something similar for Temporal.

The problem with this direct approach is that it is very non-general. If in some other data source datetime column wouldn't be located at the first position it will break. So, we can do something smarter, like defining a structure

struct TimeDataWrapper{T1, T2}
   meta::T1
   data::T2
end

and use it

  sch = (; schema = TimeSeries.Tables.schema(csv), timestamp = 1) # or something similar
  timedatawrapper = TimeDataWrapper(sch, csv)
  return sink(timedatawrapper)

This structure should implement corresponding Tables.jl methods and at the same time should provide the necessary information in meta field (like where datetime column is located). So, every sink which can use this structure can convert data source to its own format without any problems.

We can do it in a few small steps

  1. make this change in MarketData.jl. As long as TimeDataWrapper lives inside MarketData.jl, functions like TimeArray(x::TimeDataWrapper) is not a type piracy. As a result, we get the function that can extract its data to TimeArray and DataFrame formats. Just to clarify, DataFrame support is coming from the fact that TimeDataWrapper follows Tables.jl API.
  2. Extract this functionality to a separate lightweight package MarketDataInterface.jl and ask the owner of TimeSeries.jl to provide support for this package.
  3. We can try to work with the owner of Temporal.jl and ask him to provide support.
  4. I am currently reviving Timestamps.jl and can write necessary support for them as well.

As a result, we will have a generic method, which can work with multiple sinks, and instead of forcing users what package to choose for financial data, they will be able to use a single package for data sourcing and any package they like for further data processing. It's a win-win situation.

As a further step, Quandl.jl can be revived and it can go through the same procedure. So we will have multiple financial data sources with the same consistent logic.

If this proposal is ok, I can try to go with the first step and we will see how it works out.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions