Should / can we improve the handling of tabular data? #19586
Replies: 4 comments 2 replies
-
|
Storing metadata on datasets is expensive in terms of database size, fetching them on demand from datasets might also not be feasible. Ideally everything should be based on column numbers or more precise datatypes. If a user needs to provide the right value to pick the column from a dataset I would say we have already failed from a UX perspective. It might be preferable to instead use specific datatypes that dictate the column layout. |
Beta Was this translation helpful? Give feedback.
-
|
The way how tables are handled in Galaxy from the perspective of a user and tool developer has multiple issues.
Question is, how do we improve this? I think the idea of setting metadata for a table via a tool is not bad, but it doesn't help during tool development that you can not make assumptions about tabular inputs. |
Beta Was this translation helpful? Give feedback.
-
We determine quite a bit of metadata for all sorts of datasets and basically make no use of it at all -- (at least on the tool side .. and I do not know of any other use). Here I see an example where metadata really could help and the extra storage might really help. But probably I lack a bit of background knowledge... :)
If you work with tabular data you need to be able to select columns / rows. Numbers might be good for some, but I think many users would benefit from a column name based selection.
We could have a mini tool that has a single select (somehow filled from the first line of a dataset - split by some delimiter ... maybe a |
Beta Was this translation helpful? Give feedback.
-
Maybe just |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem: The
tabulardatatype sets barely any metadata (except forcolumnsanddata_lines/comment_lineswhich is set for only for not-to-big datasets), buttabularis our main datatype for tabular data, because most tab delimited data is sniffed as tabular and we "push" tool authors toward this datatype.Many tools for processing tabular data could make good use of metadata:
column_namescould be used indata_columnparameters,comment_linescould be used to automatically treat header different from the data lines. Also the display is nicer/more user friendly if there are column names.The reason for the abundant use of
tabularis that it is very generic and makes only few assumptions (eg. on the presence of header).Some possible improvements:
tsvwhich sets more metadata (i.e. if the assumptions fortsvare fulfilled):tsvmore often as output type (for tab delimited data with header and constant column number)tabularasformatfor input paramters, the more specializedtsvmight always be added.tsvcould be automatically accepted fortabular(since it's a more specialized format).tabularoutput (e.g. be adding IUC guidlines)<action name="METADATA_NAME" type="metadata" default="METADATA_VALUE"/>ortabular(without duplicating the data).#Tabular.set_metafunction would be that we get metadata also for large dataSome facts:
Datatype hierarchy:
Tabular:tabularBaseCSVCSV:csvTSV:tsvSniffers exist only for
csvandtsv. But if files are sniffed astsvit is overwritten astabular(in most cases).The sniffers for
csvandtsvuse python'scsvmodule:csv/tsvtsv: consistent number of columnsPossible metadata:
comment_linesdata_linescolumnscolumn_namesdelimiterAutomatically set metadata for
tabular:columnsis the maximum number of columns over all considered lines (max 100k)delimiter: tab'#') are counted (files with more than 100k lines will not have number of data/comment lines)column_namesis never setAutomatically set metadata for
csvandtsvcolumns_namescolumn_typesare derived from the 2nd linecomment_linesis 1 iff 2 lines can be readdata_linesis number of lines - 1columnsis max of number of columns of 1st (header) and 2nd lineImportant difference between
csvandtsvis thatcsvallows for inconsistent number of columns.Beta Was this translation helpful? Give feedback.
All reactions