bids2table
bids2table is a library for efficiently indexing and querying large-scale BIDS neuroimaging datasets and derivatives. It aims to improve upon the efficiency of PyBIDS by leveraging modern data science tools.
bids2table represents a BIDS dataset index as a single table with columns for BIDS entities and file metadata. The index is constructed using Arrow and stored in Parquet format, a binary tabular file format optimized for efficient storage and retrieval.
Installation
A pre-release version of bids2table can be installed with
pip install bids2table
The latest development version can be installed with
pip install git+https://github.com/childmindresearch/bids2table.git
Quickstart
The main entrypoint to the library is the bids2table.bids2table
function, which builds
the index.
tab = bids2table("path/to/dataset")
You can also build the index in parallel
tab = bids2table("path/to/dataset", workers=8)
To save the index to disk as a Parquet dataset for later reuse, run
tab = bids2table("path/to/dataset", persistent=True)
By default this saves the index to an index.b2t
directory under the dataset root
directory. To change the output destination, use the index_path
argument.
To generate and save an index from the command line, you can use the bids2table
CLI.
usage: bids2table [-h] [--output OUTPUT] [--incremental] [--overwrite] [--workers COUNT]
[--worker_id RANK] [--verbose]
ROOT
See bids2table --help
for more information.
Table representation
The generated index is represented as a bids2table.BIDSTable
, which is just a subclass
of a pandas.DataFrame
. Each row in the table corresponds to a BIDS data file, and the
columns are organized into groups:
- dataset (
BIDSTable.ds
): dataset name, relative dataset path, and the JSON dataset description - entities (
BIDSTable.ent
): All valid BIDS entities plus anextra_entities
dict containing any extra entities - metadata (
BIDSTable.meta
): BIDS JSON metadata - file info (
BIDSTable.finfo
): General file info including the full file path and last modified time
The BIDSTable
also makes it easy to access some of the key characteristics of your
dataset. The BIDS datatypes, modalities, subjects, and entities present in the dataset
are each accessible as properties of the table.
In addition, the associated JSON metadata for each file can be conveniently accessed via
the BIDSTable.flat_meta
property.
Filtering
To filter the table for files matching certain criteria, you can use the
BIDSTable.filter
method which selects for rows based on whether a specified column
meets a condition.
filtered = (
tab
.filter("task", "rest")
.filter("sub", items=["04", "06"])
.filter("RepetitionTime", 2.5)
)
To apply multiple filters at once, you can also use BIDSTable.filter_multi
filtered = tab.filter_multi(
task="rest"
sub={"items": ["04", "06"]},
RepetitionTime=2.5,
)
This is similar to the
BIDSLayout.get()
method in PyBIDS
, where each key=value
pair specifies the column to filter on and the condition to apply.
Advanced usage
For more advanced usage that goes beyond what's supported in this higher-level
interface, you can also interact directly with the underlying
DataFrame
.
The column labels of the raw table indicate the group as a prefix, e.g. ent__*
for
BIDS entities. However, you may find one of the alternative views of the table more
useful:
BIDSTable.nested
: Columns organized as a nested pandasMultiIndex
.BIDSTable.flat
: Flattened columns without any nesting or group prefix.
You should avoid manipulating the table in place if possible, as this may interfere
with the higher-level accessors. If you must manipulate in place, consider
converting the BIDSTable
to a plain DataFrame
first.
df = pd.DataFrame(tab)
1r""" 2[bids2table](https://github.com/childmindresearch/bids2table) is a library for efficiently 3indexing and querying large-scale BIDS neuroimaging datasets and derivatives. It aims to 4improve upon the efficiency of [PyBIDS](https://github.com/bids-standard/pybids) by 5leveraging modern data science tools. 6 7bids2table represents a BIDS dataset index as a single table with columns for BIDS 8entities and file metadata. The index is constructed using 9[Arrow](https://arrow.apache.org/) and stored in [Parquet](https://parquet.apache.org/) 10format, a binary tabular file format optimized for efficient storage and retrieval. 11 12## Installation 13 14A pre-release version of bids2table can be installed with 15 16```sh 17pip install bids2table 18``` 19 20The latest development version can be installed with 21 22```sh 23pip install git+https://github.com/childmindresearch/bids2table.git 24``` 25 26## Quickstart 27 28The main entrypoint to the library is the `bids2table.bids2table` function, which builds 29the index. 30 31```python 32tab = bids2table("path/to/dataset") 33``` 34 35You can also build the index in parallel 36 37```python 38tab = bids2table("path/to/dataset", workers=8) 39``` 40 41To save the index to disk as a [Parquet](https://parquet.apache.org/) dataset for later 42reuse, run 43 44```python 45tab = bids2table("path/to/dataset", persistent=True) 46``` 47 48By default this saves the index to an `index.b2t` directory under the dataset root 49directory. To change the output destination, use the `index_path` argument. 50 51To generate and save an index from the command line, you can use the `bids2table` CLI. 52 53```sh 54usage: bids2table [-h] [--output OUTPUT] [--incremental] [--overwrite] [--workers COUNT] 55 [--worker_id RANK] [--verbose] 56 ROOT 57``` 58 59See `bids2table --help` for more information. 60 61## Table representation 62 63The generated index is represented as a `bids2table.BIDSTable`, which is just a subclass 64of a `pandas.DataFrame`. Each row in the table corresponds to a BIDS data file, and the 65columns are organized into groups: 66 67- dataset (`BIDSTable.ds`): dataset name, relative dataset path, and the JSON dataset 68description 69- entities (`BIDSTable.ent`): All [valid BIDS 70entities](https://bids-specification.readthedocs.io/en/stable/appendices/entities.html) 71plus an `extra_entities` dict containing any extra entities 72- metadata (`BIDSTable.meta`): BIDS JSON metadata 73- file info (`BIDSTable.finfo`): General file info including the full file path and last 74modified time 75 76The `BIDSTable` also makes it easy to access some of the key characteristics of your 77dataset. The BIDS datatypes, modalities, subjects, and entities present in the dataset 78are each accessible as properties of the table. 79 80In addition, the associated JSON metadata for each file can be conveniently accessed via 81the `BIDSTable.flat_meta` property. 82 83### Filtering 84 85To filter the table for files matching certain criteria, you can use the 86`BIDSTable.filter` method which selects for rows based on whether a specified column 87meets a condition. 88 89```python 90filtered = ( 91 tab 92 .filter("task", "rest") 93 .filter("sub", items=["04", "06"]) 94 .filter("RepetitionTime", 2.5) 95) 96``` 97 98To apply multiple filters at once, you can also use `BIDSTable.filter_multi` 99 100```python 101 102filtered = tab.filter_multi( 103 task="rest" 104 sub={"items": ["04", "06"]}, 105 RepetitionTime=2.5, 106) 107``` 108 109This is similar to the 110[`BIDSLayout.get()`](https://bids-standard.github.io/pybids/generated/bids.layout.BIDSLayout.html#bids.layout.BIDSLayout) 111method in [`PyBIDS`](https://bids-standard.github.io/pybids/), where each `key=value` 112pair specifies the column to filter on and the condition to apply. 113 114### Advanced usage 115 116For more advanced usage that goes beyond what's supported in this higher-level 117interface, you can also interact directly with the underlying 118[`DataFrame`](https://pandas.pydata.org/docs/user_guide/index.html). 119 120The column labels of the raw table indicate the group as a prefix, e.g. `ent__*` for 121BIDS entities. However, you may find one of the alternative views of the table more 122useful: 123 124- `BIDSTable.nested`: Columns organized as a nested pandas 125[`MultiIndex`](https://pandas.pydata.org/docs/user_guide/advanced.html#hierarchical-indexing-multiindex). 126- `BIDSTable.flat`: Flattened columns without any nesting or group prefix. 127 128.. warning:: 129 You should avoid manipulating the table in place if possible, as this may interfere 130 with the higher-level accessors. If you must manipulate in place, consider 131 converting the `BIDSTable` to a plain `DataFrame` first. 132 133 ```python 134 df = pd.DataFrame(tab) 135 ``` 136""" 137 138# Register elbow extension types 139import elbow.dtypes # noqa 140 141from ._b2t import bids2table 142from ._version import __version__, __version_tuple__ # noqa 143from .entities import BIDSEntities, parse_bids_entities 144from .table import BIDSFile, BIDSTable, join_bids_path 145 146__all__ = [ 147 "bids2table", 148 "BIDSTable", 149 "BIDSFile", 150 "BIDSEntities", 151 "parse_bids_entities", 152 "join_bids_path", 153]
17def bids2table( 18 root: StrOrPath, 19 *, 20 with_meta: bool = True, 21 persistent: bool = False, 22 index_path: Optional[StrOrPath] = None, 23 exclude: Optional[List[str]] = None, 24 incremental: bool = False, 25 overwrite: bool = False, 26 workers: Optional[int] = None, 27 worker_id: Optional[int] = None, 28 return_table: bool = True, 29) -> Optional[BIDSTable]: 30 """ 31 Index a BIDS dataset directory and load as a pandas DataFrame. 32 33 Args: 34 root: path to BIDS dataset 35 with_meta: extract JSON sidecar metadata. Excluding metadata can result in much 36 faster indexing. 37 persistent: whether to save index to disk as a Parquet dataset 38 index_path: path to BIDS Parquet index to generate or load. Defaults to `root / 39 "index.b2t"`. Index generation requires `persistent=True`. 40 exclude: Optional list of directory names or glob patterns to exclude from indexing. 41 incremental: update index incrementally with only new or changed files. 42 overwrite: overwrite previous index. 43 workers: number of parallel processes. If `None` or 1, run in the main 44 process. Setting to <= 0 runs as many processes as there are cores 45 available. 46 worker_id: optional worker ID to use when scheduling parallel tasks externally. 47 Specifying the number of workers is required in this case. Incompatible with 48 overwrite. 49 return_table: whether to return the BIDS table or just build the persistent 50 index. 51 52 Returns: 53 A `BIDSTable` representing the indexed dataset(s), or `None` if `return_table` 54 is `False`. 55 """ 56 if not (return_table or persistent): 57 raise ValueError("persistent and return_table should not both be False") 58 59 root = Path(root).expanduser().resolve() 60 if not root.is_dir(): 61 raise FileNotFoundError(f"root directory {root} does not exists") 62 63 if exclude is None: 64 exclude = [] 65 66 source = Crawler( 67 root=root, 68 include=["sub-*"], # find subject dirs 69 skip=["sub-*"] + exclude, # but don't crawl into subject dirs 70 dirs_only=True, 71 follow_links=True, 72 ) 73 extract = partial(extract_bids_subdir, exclude=exclude, with_meta=with_meta) 74 75 if index_path is None: 76 index_path = root / "index.b2t" 77 else: 78 index_path = Path(index_path).resolve() 79 80 stale = overwrite or incremental or worker_id is not None 81 if index_path.exists() and not stale: 82 if return_table: 83 logger.info("Loading cached index %s", index_path) 84 tab = BIDSTable.from_parquet(index_path) 85 else: 86 logger.info("Found cached index %s; nothing to do", index_path) 87 tab = None 88 return tab 89 90 if not persistent: 91 logger.info("Building index in memory") 92 df = build_table( 93 source=source, 94 extract=extract, 95 workers=workers, 96 worker_id=worker_id, 97 ) 98 tab = BIDSTable.from_df(df) 99 return tab 100 101 logger.info("Building persistent Parquet index") 102 build_parquet( 103 source=source, 104 extract=extract, 105 output=index_path, 106 incremental=incremental, 107 overwrite=overwrite, 108 workers=workers, 109 worker_id=worker_id, 110 path_column="file__file_path", 111 mtime_column="file__mod_time", 112 ) 113 tab = BIDSTable.from_parquet(index_path) if return_table else None 114 return tab
Index a BIDS dataset directory and load as a pandas DataFrame.
Arguments:
- root: path to BIDS dataset
- with_meta: extract JSON sidecar metadata. Excluding metadata can result in much faster indexing.
- persistent: whether to save index to disk as a Parquet dataset
- index_path: path to BIDS Parquet index to generate or load. Defaults to
root / "index.b2t"
. Index generation requirespersistent=True
. - exclude: Optional list of directory names or glob patterns to exclude from indexing.
- incremental: update index incrementally with only new or changed files.
- overwrite: overwrite previous index.
- workers: number of parallel processes. If
None
or 1, run in the main process. Setting to <= 0 runs as many processes as there are cores available. - worker_id: optional worker ID to use when scheduling parallel tasks externally. Specifying the number of workers is required in this case. Incompatible with overwrite.
- return_table: whether to return the BIDS table or just build the persistent index.
Returns:
A
BIDSTable
representing the indexed dataset(s), orNone
ifreturn_table
isFalse
.
13class BIDSTable(pd.DataFrame): 14 """ 15 A table representing one or more BIDS datasets. 16 """ 17 18 @cached_property 19 def ds(self) -> pd.DataFrame: 20 """ 21 The dataset (`ds`) subtable. 22 """ 23 return self.nested["ds"] 24 25 @cached_property 26 def ent(self) -> pd.DataFrame: 27 """ 28 The entities (`ent`) subtable. 29 """ 30 return self.nested["ent"] 31 32 @cached_property 33 def meta(self) -> pd.DataFrame: 34 """ 35 The metadata (`meta`) subtable. 36 """ 37 return self.nested["meta"] 38 39 @cached_property 40 def finfo(self) -> pd.DataFrame: 41 """ 42 The file info (`finfo`) subtable. 43 """ 44 return self.nested["finfo"] 45 46 @cached_property 47 def files(self) -> List["BIDSFile"]: 48 """ 49 Convert the table to a list of structured `BIDSFile`s. 50 """ 51 52 def to_dict(val): 53 if pd.isna(val): 54 return {} 55 return dict(val) 56 57 return [ 58 BIDSFile( 59 dataset=row["ds"]["dataset"], 60 root=Path(row["ds"]["dataset_path"]), 61 path=Path(row["finfo"]["file_path"]), 62 entities=BIDSEntities.from_dict(row["ent"]), 63 metadata=to_dict(row["meta"]["json"]), 64 ) 65 for _, row in self.nested.iterrows() 66 ] 67 68 @cached_property 69 def datatypes(self) -> List[str]: 70 """ 71 Get all datatypes present in the table. 72 """ 73 return self.ent["datatype"].unique().tolist() 74 75 @cached_property 76 def modalities(self) -> List[str]: 77 """ 78 Get all modalities present in the table. 79 """ 80 # TODO: Is this the right way to get the modality 81 return self.ent["mod"].unique().tolist() 82 83 @cached_property 84 def subjects(self) -> List[str]: 85 """ 86 Get all unique subjects in the table. 87 """ 88 return self.ent["sub"].unique().tolist() 89 90 @cached_property 91 def entities(self) -> List[str]: 92 """ 93 Get all entity keys with at least one non-NA entry in the table. 94 """ 95 entities = self.ent.dropna(axis=1, how="all").columns.tolist() 96 special = set(BIDSEntities.special()) 97 return [key for key in entities if key not in special] 98 99 @cached_property 100 def flat_meta(self) -> pd.DataFrame: 101 """ 102 A table of flattened JSON metadata where each metadata field is converted to its 103 own column, with nested levels separated by `'.'`. 104 105 See also: 106 107 - [`pd.json_normalize`](https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html): 108 more general function in pandas. 109 """ 110 # Need to replace None with empty dict for max_level=0 to work. 111 metadata = pd.json_normalize( 112 self["meta__json"].map(lambda v: v or {}), max_level=0 113 ) 114 metadata.index = self.index 115 return metadata 116 117 @cached_property 118 def nested(self) -> pd.DataFrame: 119 """ 120 A copy of the table with column labels organized in a nested 121 [`MultiIndex`](https://pandas.pydata.org/docs/user_guide/advanced.html#hierarchical-indexing-multiindex). 122 """ 123 # Cast back to the base class since we no longer have the full BIDS table 124 # structure. 125 return pd.DataFrame(flat_to_multi_columns(self)) 126 127 @cached_property 128 def flat(self) -> pd.DataFrame: 129 """ 130 A copy of the table with subtable prefixes e.g. `ds__`, `ent__` removed. 131 """ 132 return self.nested.droplevel(0, axis=1) 133 134 def filter( 135 self, 136 key: str, 137 value: Optional[Any] = None, 138 *, 139 items: Optional[Iterable[Any]] = None, 140 contains: Optional[str] = None, 141 regex: Optional[str] = None, 142 func: Optional[Callable[[Any], bool]] = None, 143 ) -> "BIDSTable": 144 """ 145 Filter the rows of the table. 146 147 Args: 148 key: Column to filter. Can be a metadata field, BIDS entity name, or any 149 unprefixed column label in the `flat` table. 150 value: Keep rows with this exact value. 151 items: Keep rows whose value is in `items`. 152 contains: Keep rows whose value contains `contains` (string only). 153 regex: Keep rows whose value matches `regex` (string only). 154 func: Apply an arbitrary function and keep values that evaluate to `True`. 155 156 Returns: 157 A filtered BIDS table. 158 159 Example:: 160 filtered = ( 161 tab 162 .filter("task", "rest") 163 .filter("sub", items=["04", "06"]) 164 .filter("RepetitionTime", 2.0) 165 ) 166 """ 167 # NOTE: Should be careful about reinventing a new style of query API. There are 168 # some obvious things this can't do: 169 # - comparison operators <, >, <=, >= 170 # - negation 171 # - combining filters with 'or' instead of 'and' 172 # At the bottom of this rabbit hole are more general query interfaces like those 173 # already implemented in pandas, duckdb, polars. The goal should be not to 174 # create a new one, but to make the 95% of use cases as easy as possible, and 175 # empower users to interact with the underlying table using their more powerful 176 # tool of choice if necessary. 177 if sum(k is not None for k in [value, items, contains, regex, func]) != 1: 178 raise ValueError( 179 "Exactly one of value, items, contains, regex, or func must not be None" 180 ) 181 182 try: 183 # JSON metadata field 184 # NOTE: Assuming all JSON metadata fields are uppercase. 185 if key[:1].isupper(): 186 col = self.flat_meta[key] 187 # Long name entity 188 elif key in ENTITY_NAMES_TO_KEYS: 189 col = self.ent[ENTITY_NAMES_TO_KEYS[key]] 190 # Any other unprefixed column 191 else: 192 col = self.flat[key] 193 except KeyError as exc: 194 raise KeyError( 195 f"Invalid key {key}; expected a valid BIDS entity or metadata field " 196 "present in the dataset" 197 ) from exc 198 199 if value is not None: 200 mask = col == value 201 elif items is not None: 202 mask = col.isin(items) 203 elif contains is not None: 204 mask = col.str.contains(contains) 205 elif regex is not None: 206 mask = col.str.match(regex) 207 else: 208 mask = col.apply(func) 209 mask = mask.fillna(False).astype(bool) 210 211 return self.loc[mask] 212 213 def filter_multi(self, **filters) -> "BIDSTable": 214 """ 215 Apply multiple filters to the table sequentially. 216 217 Args: 218 filters: A mapping of column labels to queries. Each query can either be 219 a single value for an exact equality check or a `dict` for a more 220 complex query, e.g. `{"items": [1, 2, 3]}`, that's passed through to 221 `filter`. 222 223 Returns: 224 A filtered BIDS table. 225 226 Example:: 227 filtered = tab.filter_multi( 228 task="rest" 229 sub={"items": ["04", "06"]}, 230 RepetitionTime=2.0, 231 ) 232 """ 233 tab = self.copy(deep=False) 234 235 for k, query in filters.items(): 236 if not isinstance(query, dict): 237 query = {"value": query} 238 tab = tab.filter(k, **query) 239 return tab 240 241 def sort_entities( 242 self, by: Union[str, List[str]], inplace: bool = False 243 ) -> "BIDSTable": 244 """ 245 Sort the values of the table by entities. 246 247 Args: 248 by: label or list of labels. Can be `"dataset"` or a short or long entity 249 name. 250 inplace: sort the table in place 251 252 Returns: 253 A sorted BIDS table. 254 """ 255 if isinstance(by, str): 256 by = [by] 257 258 # TODO: what about sorting by other columns, e.g. file_path? 259 def add_prefix(k: str): 260 if k == "dataset": 261 k = f"ds__{k}" 262 elif k in ENTITY_NAMES_TO_KEYS: 263 k = f"ent__{ENTITY_NAMES_TO_KEYS[k]}" 264 else: 265 k = f"ent__{k}" 266 return k 267 268 by = [add_prefix(k) for k in by] 269 out = self.sort_values(by, inplace=inplace) 270 if inplace: 271 return self 272 return out 273 274 def with_meta(self, inplace: bool = False) -> "BIDSTable": 275 """ 276 Returns a new BIDS table complete with JSON sidecar metadata. 277 """ 278 out = self if inplace else self.copy() 279 file_paths = out.finfo["file_path"] 280 meta_json = file_paths.apply(lambda path: extract_metadata(path)["json"]) 281 out.loc[:, "meta__json"] = meta_json 282 return out 283 284 @classmethod 285 def from_df(cls, df: pd.DataFrame) -> "BIDSTable": 286 """ 287 Create a BIDS table from a pandas `DataFrame` generated by `bids2table`. 288 """ 289 return cls(df) 290 291 @classmethod 292 def from_parquet(cls, path: Path) -> "BIDSTable": 293 """ 294 Read a BIDS table from a Parquet file or dataset directory generated by 295 `bids2table`. 296 """ 297 df = pd.read_parquet(path) 298 return cls.from_df(df) 299 300 @property 301 def _constructor(self): 302 # Makes sure that dataframe slices return a subclass instance 303 # https://pandas.pydata.org/docs/development/extending.html#override-constructor-properties 304 return BIDSTable
A table representing one or more BIDS datasets.
A table of flattened JSON metadata where each metadata field is converted to its
own column, with nested levels separated by '.'
.
See also:
pd.json_normalize
: more general function in pandas.
A copy of the table with column labels organized in a nested
MultiIndex
.
A copy of the table with subtable prefixes e.g. ds__
, ent__
removed.
134 def filter( 135 self, 136 key: str, 137 value: Optional[Any] = None, 138 *, 139 items: Optional[Iterable[Any]] = None, 140 contains: Optional[str] = None, 141 regex: Optional[str] = None, 142 func: Optional[Callable[[Any], bool]] = None, 143 ) -> "BIDSTable": 144 """ 145 Filter the rows of the table. 146 147 Args: 148 key: Column to filter. Can be a metadata field, BIDS entity name, or any 149 unprefixed column label in the `flat` table. 150 value: Keep rows with this exact value. 151 items: Keep rows whose value is in `items`. 152 contains: Keep rows whose value contains `contains` (string only). 153 regex: Keep rows whose value matches `regex` (string only). 154 func: Apply an arbitrary function and keep values that evaluate to `True`. 155 156 Returns: 157 A filtered BIDS table. 158 159 Example:: 160 filtered = ( 161 tab 162 .filter("task", "rest") 163 .filter("sub", items=["04", "06"]) 164 .filter("RepetitionTime", 2.0) 165 ) 166 """ 167 # NOTE: Should be careful about reinventing a new style of query API. There are 168 # some obvious things this can't do: 169 # - comparison operators <, >, <=, >= 170 # - negation 171 # - combining filters with 'or' instead of 'and' 172 # At the bottom of this rabbit hole are more general query interfaces like those 173 # already implemented in pandas, duckdb, polars. The goal should be not to 174 # create a new one, but to make the 95% of use cases as easy as possible, and 175 # empower users to interact with the underlying table using their more powerful 176 # tool of choice if necessary. 177 if sum(k is not None for k in [value, items, contains, regex, func]) != 1: 178 raise ValueError( 179 "Exactly one of value, items, contains, regex, or func must not be None" 180 ) 181 182 try: 183 # JSON metadata field 184 # NOTE: Assuming all JSON metadata fields are uppercase. 185 if key[:1].isupper(): 186 col = self.flat_meta[key] 187 # Long name entity 188 elif key in ENTITY_NAMES_TO_KEYS: 189 col = self.ent[ENTITY_NAMES_TO_KEYS[key]] 190 # Any other unprefixed column 191 else: 192 col = self.flat[key] 193 except KeyError as exc: 194 raise KeyError( 195 f"Invalid key {key}; expected a valid BIDS entity or metadata field " 196 "present in the dataset" 197 ) from exc 198 199 if value is not None: 200 mask = col == value 201 elif items is not None: 202 mask = col.isin(items) 203 elif contains is not None: 204 mask = col.str.contains(contains) 205 elif regex is not None: 206 mask = col.str.match(regex) 207 else: 208 mask = col.apply(func) 209 mask = mask.fillna(False).astype(bool) 210 211 return self.loc[mask]
Filter the rows of the table.
Arguments:
- key: Column to filter. Can be a metadata field, BIDS entity name, or any
unprefixed column label in the
flat
table. - value: Keep rows with this exact value.
- items: Keep rows whose value is in
items
. - contains: Keep rows whose value contains
contains
(string only). - regex: Keep rows whose value matches
regex
(string only). - func: Apply an arbitrary function and keep values that evaluate to
True
.
Returns:
A filtered BIDS table.
Example:: filtered = ( tab .filter("task", "rest") .filter("sub", items=["04", "06"]) .filter("RepetitionTime", 2.0) )
213 def filter_multi(self, **filters) -> "BIDSTable": 214 """ 215 Apply multiple filters to the table sequentially. 216 217 Args: 218 filters: A mapping of column labels to queries. Each query can either be 219 a single value for an exact equality check or a `dict` for a more 220 complex query, e.g. `{"items": [1, 2, 3]}`, that's passed through to 221 `filter`. 222 223 Returns: 224 A filtered BIDS table. 225 226 Example:: 227 filtered = tab.filter_multi( 228 task="rest" 229 sub={"items": ["04", "06"]}, 230 RepetitionTime=2.0, 231 ) 232 """ 233 tab = self.copy(deep=False) 234 235 for k, query in filters.items(): 236 if not isinstance(query, dict): 237 query = {"value": query} 238 tab = tab.filter(k, **query) 239 return tab
Apply multiple filters to the table sequentially.
Arguments:
- filters: A mapping of column labels to queries. Each query can either be
a single value for an exact equality check or a
dict
for a more complex query, e.g.{"items": [1, 2, 3]}
, that's passed through tofilter
.
Returns:
A filtered BIDS table.
Example:: filtered = tab.filter_multi( task="rest" sub={"items": ["04", "06"]}, RepetitionTime=2.0, )
241 def sort_entities( 242 self, by: Union[str, List[str]], inplace: bool = False 243 ) -> "BIDSTable": 244 """ 245 Sort the values of the table by entities. 246 247 Args: 248 by: label or list of labels. Can be `"dataset"` or a short or long entity 249 name. 250 inplace: sort the table in place 251 252 Returns: 253 A sorted BIDS table. 254 """ 255 if isinstance(by, str): 256 by = [by] 257 258 # TODO: what about sorting by other columns, e.g. file_path? 259 def add_prefix(k: str): 260 if k == "dataset": 261 k = f"ds__{k}" 262 elif k in ENTITY_NAMES_TO_KEYS: 263 k = f"ent__{ENTITY_NAMES_TO_KEYS[k]}" 264 else: 265 k = f"ent__{k}" 266 return k 267 268 by = [add_prefix(k) for k in by] 269 out = self.sort_values(by, inplace=inplace) 270 if inplace: 271 return self 272 return out
Sort the values of the table by entities.
Arguments:
- by: label or list of labels. Can be
"dataset"
or a short or long entity name. - inplace: sort the table in place
Returns:
A sorted BIDS table.
274 def with_meta(self, inplace: bool = False) -> "BIDSTable": 275 """ 276 Returns a new BIDS table complete with JSON sidecar metadata. 277 """ 278 out = self if inplace else self.copy() 279 file_paths = out.finfo["file_path"] 280 meta_json = file_paths.apply(lambda path: extract_metadata(path)["json"]) 281 out.loc[:, "meta__json"] = meta_json 282 return out
Returns a new BIDS table complete with JSON sidecar metadata.
284 @classmethod 285 def from_df(cls, df: pd.DataFrame) -> "BIDSTable": 286 """ 287 Create a BIDS table from a pandas `DataFrame` generated by `bids2table`. 288 """ 289 return cls(df)
Create a BIDS table from a pandas DataFrame
generated by bids2table
.
291 @classmethod 292 def from_parquet(cls, path: Path) -> "BIDSTable": 293 """ 294 Read a BIDS table from a Parquet file or dataset directory generated by 295 `bids2table`. 296 """ 297 df = pd.read_parquet(path) 298 return cls.from_df(df)
Read a BIDS table from a Parquet file or dataset directory generated by
bids2table
.
Inherited Members
- pandas.core.frame.DataFrame
- DataFrame
- axes
- shape
- to_string
- style
- items
- iterrows
- itertuples
- dot
- from_dict
- to_numpy
- to_dict
- to_gbq
- from_records
- to_records
- to_stata
- to_feather
- to_markdown
- to_parquet
- to_orc
- to_html
- to_xml
- info
- memory_usage
- transpose
- T
- isetitem
- query
- eval
- select_dtypes
- insert
- assign
- set_axis
- reindex
- drop
- rename
- pop
- shift
- set_index
- reset_index
- isna
- isnull
- notna
- notnull
- dropna
- drop_duplicates
- duplicated
- sort_values
- sort_index
- value_counts
- nlargest
- nsmallest
- swaplevel
- reorder_levels
- eq
- ne
- le
- lt
- ge
- gt
- add
- radd
- sub
- subtract
- rsub
- mul
- multiply
- rmul
- truediv
- div
- divide
- rtruediv
- rdiv
- floordiv
- rfloordiv
- mod
- rmod
- pow
- rpow
- compare
- combine
- combine_first
- update
- groupby
- pivot
- pivot_table
- stack
- explode
- unstack
- melt
- diff
- aggregate
- agg
- transform
- apply
- map
- applymap
- join
- merge
- round
- corr
- cov
- corrwith
- count
- any
- all
- min
- max
- sum
- prod
- mean
- median
- sem
- var
- std
- skew
- kurt
- kurtosis
- product
- cummin
- cummax
- cumsum
- cumprod
- nunique
- idxmin
- idxmax
- mode
- quantile
- to_timestamp
- to_period
- isin
- index
- columns
- plot
- hist
- boxplot
- sparse
- values
- pandas.core.generic.NDFrame
- attrs
- flags
- set_flags
- ndim
- size
- swapaxes
- droplevel
- squeeze
- rename_axis
- equals
- bool
- abs
- keys
- empty
- to_excel
- to_json
- to_hdf
- to_sql
- to_pickle
- to_clipboard
- to_xarray
- to_latex
- to_csv
- take
- xs
- get
- reindex_like
- add_prefix
- add_suffix
- head
- tail
- sample
- pipe
- dtypes
- astype
- copy
- infer_objects
- convert_dtypes
- fillna
- ffill
- pad
- bfill
- backfill
- replace
- interpolate
- asof
- clip
- asfreq
- at_time
- between_time
- resample
- first
- last
- rank
- align
- where
- mask
- truncate
- tz_convert
- tz_localize
- describe
- pct_change
- rolling
- expanding
- ewm
- first_valid_index
- last_valid_index
- pandas.core.indexing.IndexingMixin
- iloc
- loc
- at
- iat
307@dataclass 308class BIDSFile: 309 """ 310 A structured BIDS file. 311 """ 312 313 dataset: str 314 """Parent BIDS dataset.""" 315 root: Path 316 """Path to parent dataset.""" 317 path: Path 318 """File path.""" 319 entities: BIDSEntities 320 """BIDS entities.""" 321 metadata: Dict[str, Any] = field(default_factory=dict) 322 """BIDS JSON metadata.""" 323 324 @property 325 def relative_path(self) -> Path: 326 """ 327 The file path relative to the dataset root. 328 """ 329 return self.path.relative_to(self.root)
A structured BIDS file.
Inherited Members
- bids2table.entities._BIDSEntitiesBase
- sub
- ses
- datatype
- suffix
- ext
- extra_entities
- special
- from_dict
- from_path
- to_dict
- to_path
- with_update
291@lru_cache() 292def parse_bids_entities(path: StrOrPath) -> Dict[str, str]: 293 """ 294 Parse all BIDS filename ``"{key}-{value}"`` entities as well as special entities: 295 296 - datatype 297 - suffix 298 - ext (extension) 299 300 from the file path. 301 302 .. note:: This function does not validate entities. 303 """ 304 path = Path(path) 305 entities = {} 306 307 # datatype 308 match = re.search( 309 f"/({'|'.join(BIDS_DATATYPES)})/", 310 path.as_posix(), 311 ) 312 datatype = match.group(1) if match is not None else None 313 314 filename = path.name 315 parts = filename.split("_") 316 317 # suffix and extension 318 suffix_ext = parts.pop() 319 idx = suffix_ext.find(".") 320 if idx < 0: 321 suffix, ext = suffix_ext, None 322 else: 323 suffix, ext = suffix_ext[:idx], suffix_ext[idx:] 324 325 # suffix is actually an entity, put back in list 326 if "-" in suffix: 327 parts.append(suffix) 328 suffix = None 329 330 # parse entities 331 for part in parts: 332 if "-" in part: 333 key, val = part.split("-", maxsplit=1) 334 else: 335 key, val = part, "" 336 entities[key] = val 337 338 for k, v in zip(["datatype", "suffix", "ext"], [datatype, suffix, ext]): 339 if v is not None: 340 entities[k] = v 341 return entities
Parse all BIDS filename "{key}-{value}"
entities as well as special entities:
- datatype
- suffix
- ext (extension)
from the file path.
This function does not validate entities.
373def join_bids_path( 374 row: Union[pd.Series, Dict[str, Any]], 375 prefix: Optional[Union[str, Path]] = None, 376 valid_only: bool = True, 377) -> Path: 378 """ 379 Reconstruct a BIDS path from a table row or entities dict. 380 381 Args: 382 row: row from a `BIDSTable` or `BIDSTable.ent` subtable. 383 prefix: output file prefix path. 384 valid_only: only include valid BIDS entities. 385 386 Example:: 387 388 tab = BIDSTable.from_parquet("dataset/index.b2t") 389 paths = tab.apply(join_bids_path, axis=1) 390 """ 391 # Filter in case input is a row from the raw dataframe and not the entities group. 392 row = _filter_row(row, group="ent") 393 entities = BIDSEntities.from_dict(row, valid_only=valid_only) 394 path = entities.to_path(prefix=prefix, valid_only=valid_only) 395 return path
Reconstruct a BIDS path from a table row or entities dict.
Arguments:
- row: row from a
BIDSTable
orBIDSTable.ent
subtable. - prefix: output file prefix path.
- valid_only: only include valid BIDS entities.
Example::
tab = BIDSTable.from_parquet("dataset/index.b2t")
paths = tab.apply(join_bids_path, axis=1)