Should / can we improve the handling of tabular data? #19586

bernt-matthias · 2025-02-11T12:42:17Z

bernt-matthias
Feb 11, 2025
Collaborator

Problem: The tabular datatype sets barely any metadata (except for columns and data_lines/comment_lines which is set for only for not-to-big datasets), but tabular is our main datatype for tabular data, because most tab delimited data is sniffed as tabular and we "push" tool authors toward this datatype.

Many tools for processing tabular data could make good use of metadata: column_names could be used in data_column parameters, comment_lines could be used to automatically treat header different from the data lines. Also the display is nicer/more user friendly if there are column names.

The reason for the abundant use of tabular is that it is very generic and makes only few assumptions (eg. on the presence of header).

Some possible improvements:

Make more use of tsv which sets more metadata (i.e. if the assumptions for tsv are fulfilled):
- use tsv more often as output type (for tab delimited data with header and constant column number)
- instead of just allowing onlytabular as format for input paramters, the more specialized tsv might always be added.
- changing the datatype hierarchy might be a good alternative that saves a lot of tool dev hours, i.e. tsv could be automatically accepted for tabular (since it's a more specialized format).
Make tools that provide the missing metadata of tabular output (e.g. be adding IUC guidlines)
- use <action name="METADATA_NAME" type="metadata" default="METADATA_VALUE"/> or
- tool provided metadata (which is barely documented) more often
Create a tool that adds metadata for tabular (without duplicating the data).
- or is the metadata setting dialog sufficient / can it be used in workflows?
- idea would be that user can set checkboxes for assumptions on the data:
  - first line contains header
  - first character of the header line is #
- advantage over the Tabular.set_meta function would be that we get metadata also for large data

Some facts:

Datatype hierarchy:

TabularData
- Tabular: tabular
- BaseCSV
  - CSV: csv
  - TSV: tsv

Sniffers exist only for csv and tsv. But if files are sniffed as tsv it is overwritten as tabular (in most cases).
The sniffers for csv and tsv use python's csv module:

check of the file can be parsed and csv/tsv
has 2 or more columns
for tsv: consistent number of columns
if csv.Sniffer().sniff() return the expected type

Possible metadata:

comment_lines
data_lines
columns
column_names
delimiter

Automatically set metadata for tabular:

columns is the maximum number of columns over all considered lines (max 100k)
delimiter: tab
number of data_lines and comment lines (lines starting with '#') are counted (files with more than 100k lines will not have number of data/comment lines)
column_names is never set

Automatically set metadata for csv and tsv

first line is header, defines columns_names
column_types are derived from the 2nd line
comment_lines is 1 iff 2 lines can be read
data_lines is number of lines - 1
columns is max of number of columns of 1st (header) and 2nd line

Important difference between csv and tsv is that csv allows for inconsistent number of columns.

mvdbeek · 2025-02-11T13:55:32Z

mvdbeek
Feb 11, 2025
Maintainer

Storing metadata on datasets is expensive in terms of database size, fetching them on demand from datasets might also not be feasible. Ideally everything should be based on column numbers or more precise datatypes. If a user needs to provide the right value to pick the column from a dataset I would say we have already failed from a UX perspective. It might be preferable to instead use specific datatypes that dictate the column layout.

1 reply

hechth Apr 17, 2025

So you mean instead of having a flexible table you have a specialized data type which is a "XYZ_Table" which you know must have 4 columns, the first being a name, the second being of type int etc.?

I don't think you can ever enforce that level of standardization.

hechth · 2025-04-17T07:48:13Z

hechth
Apr 17, 2025

The way how tables are handled in Galaxy from the perspective of a user and tool developer has multiple issues.

The distinction between tsv and tabular is really not obvious and I didn't know tsv was even a real datatype until reading this. For compatibility reasons, we recently changed all tools from outputting tsv or csv to tabular.
There is no guideline on what is the recommended format to use. There is also no "coherence" in the Galaxy world. If when uploading a table, the user would be asked to provide all information about the table, we could then convert it to an internal format that all tools could be based on.
Parquet is a third class citizen. Binary tables avoid much of the problems of pure text based representations, as they have proper encoding for column types and names and validations etc., but the support for parquet in Galaxy is super poor. Also the uptake in the community is not there yet, but this isn't helping it.

Question is, how do we improve this? I think the idea of setting metadata for a table via a tool is not bad, but it doesn't help during tool development that you can not make assumptions about tabular inputs.

0 replies

bernt-matthias · 2025-04-17T13:01:32Z

bernt-matthias
Apr 17, 2025
Collaborator Author

Storing metadata on datasets is expensive in terms of database size, fetching them on demand from datasets might also not be feasible.

We determine quite a bit of metadata for all sorts of datasets and basically make no use of it at all -- (at least on the tool side .. and I do not know of any other use). Here I see an example where metadata really could help and the extra storage might really help. But probably I lack a bit of background knowledge... :)

If a user needs to provide the right value to pick the column from a dataset I would say we have already failed from a UX perspective.

If you work with tabular data you need to be able to select columns / rows. Numbers might be good for some, but I think many users would benefit from a column name based selection.

Question is, how do we improve this?

We could have a mini tool that has a single select (somehow filled from the first line of a dataset - split by some delimiter ... maybe a from_first_line attribute in <options>) and outputs a column number.... but this would only work in workflows.

0 replies

bernt-matthias · 2025-04-17T13:02:10Z

bernt-matthias
Apr 17, 2025
Collaborator Author

So you mean instead of having a flexible table you have a specialized data type which is a "XYZ_Table"

Maybe just tabular_with_header. ... which is essentially tsv but unfortunately tsv is not derived from tabular.

1 reply

hechth Apr 17, 2025

tsv is not derived from tabular? That is mad.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Should / can we improve the handling of tabular data? #19586

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Should / can we improve the handling of tabular data? #19586

Uh oh!

bernt-matthias Feb 11, 2025 Collaborator

Replies: 4 comments · 2 replies

Uh oh!

Uh oh!

mvdbeek Feb 11, 2025 Maintainer

Uh oh!

hechth Apr 17, 2025

Uh oh!

hechth Apr 17, 2025

Uh oh!

bernt-matthias Apr 17, 2025 Collaborator Author

Uh oh!

Uh oh!

bernt-matthias Apr 17, 2025 Collaborator Author

Uh oh!

hechth Apr 17, 2025

bernt-matthias
Feb 11, 2025
Collaborator

Replies: 4 comments 2 replies

mvdbeek
Feb 11, 2025
Maintainer

hechth
Apr 17, 2025

bernt-matthias
Apr 17, 2025
Collaborator Author

bernt-matthias
Apr 17, 2025
Collaborator Author