bids2table
Index BIDS datasets fast, locally or in the cloud.
Installation
To install the latest release from pypi, you can run
pip install bids2table
To install with S3 support, include the s3
extra
pip install bids2table[s3]
The latest development version can be installed with
pip install "bids2table[s3] @ git+https://github.com/childmindresearch/bids2table.git"
Usage
To run these examples, you will need to clone the bids-examples repo.
git clone -b 1.9.0 https://github.com/bids-standard/bids-examples.git
Finding BIDS datasets
You can search a directory for valid BIDS datasets using b2t2 find
(bids2table) clane$ b2t2 find bids-examples | head -n 10
bids-examples/asl002
bids-examples/ds002
bids-examples/ds005
bids-examples/asl005
bids-examples/ds051
bids-examples/eeg_rishikesh
bids-examples/asl004
bids-examples/asl003
bids-examples/ds003
bids-examples/eeg_cbm
Indexing datasets from the command line
Indexing datasets is done with b2t2 index
. Here we index a single example dataset, saving the output as a parquet file.
(bids2table) clane$ b2t2 index -o ds102.parquet bids-examples/ds102
ds102: 100%|███████████████████████████████████████| 26/26 [00:00<00:00, 154.12it/s, sub=26, N=130]
You can also index a list of datasets. Note that each iteration in the progress bar represents one dataset.
(bids2table) clane$ b2t2 index -o bids-examples.parquet bids-examples/*
100%|████████████████████████████████████████████| 87/87 [00:00<00:00, 113.59it/s, ds=None, N=9727]
You can pipe the output of b2t2 find
to b2t2 index
to create an index of all datasets under a root directory.
(bids2table) clane$ b2t2 find bids-examples | b2t2 index -o bids-examples.parquet
97it [00:01, 96.05it/s, ds=ieeg_filtered_speech, N=10K]
The resulting index will include both top-level datasets (as in the previous command) as well nested derivatives datasets.
Indexing datasets hosted on S3
bids2table supports indexing datasets hosted on S3 via cloudpathlib. To use this functionality, make sure to install bids2table with the s3
extra. Or you can also just install cloudpathlib directly
pip install cloudpathlib[s3]
As an example, here we index all datasets on OpenNeuro
(bids2table) clane$ b2t2 index -o openneuro.parquet \
-j 8 --use-threads s3://openneuro.org/ds*
100%|█████████████████████████████████████| 1408/1408 [12:25<00:00, 1.89it/s, ds=ds006193, N=1.2M]
Using 8 threads, we can index all ~1400 OpenNeuro datasets (1.2M files) in less than 15 minutes.
Indexing datasets from python
You can also index datasets using the Python API.
import bids2table as b2t2
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# Index a single dataset.
tab = b2t2.index_dataset("bids-examples/ds102")
# Find and index a batch of datasets.
tabs = b2t2.batch_index_dataset(
b2t2.find_bids_datasets("bids-examples"),
)
tab = pa.concat_tables(tabs)
# Index a dataset on S3.
tab = b2t2.index_dataset("s3://openneuro.org/ds000224")
# Save as parquet.
pq.write_table(tab, "ds000224.parquet")
# Convert to a pandas dataframe.
df = tab.to_pandas(types_mapper=pd.ArrowDtype)
1# ruff: noqa: I001 2r""" 3[](https://github.com/childmindresearch/bids2table/actions/workflows/ci.yaml?query=branch%3Amain) 4[](https://childmindresearch.github.io/bids2table/bids2table) 5[](https://codecov.io/gh/childmindresearch/bids2table) 6[](https://github.com/astral-sh/ruff) 7 8[](LICENSE) 9 10Index [BIDS](https://bids-specification.readthedocs.io/en/stable/) datasets fast, locally or in the cloud. 11 12## Installation 13 14To install the latest release from pypi, you can run 15 16```sh 17pip install bids2table 18``` 19 20To install with S3 support, include the `s3` extra 21 22```sh 23pip install bids2table[s3] 24``` 25 26The latest development version can be installed with 27 28```sh 29pip install "bids2table[s3] @ git+https://github.com/childmindresearch/bids2table.git" 30``` 31 32## Usage 33 34To run these examples, you will need to clone the [bids-examples](https://github.com/bids-standard/bids-examples) repo. 35 36```sh 37git clone -b 1.9.0 https://github.com/bids-standard/bids-examples.git 38``` 39 40### Finding BIDS datasets 41 42You can search a directory for valid BIDS datasets using `b2t2 find` 43 44``` 45(bids2table) clane$ b2t2 find bids-examples | head -n 10 46bids-examples/asl002 47bids-examples/ds002 48bids-examples/ds005 49bids-examples/asl005 50bids-examples/ds051 51bids-examples/eeg_rishikesh 52bids-examples/asl004 53bids-examples/asl003 54bids-examples/ds003 55bids-examples/eeg_cbm 56``` 57 58### Indexing datasets from the command line 59 60Indexing datasets is done with `b2t2 index`. Here we index a single example dataset, saving the output as a parquet file. 61 62``` 63(bids2table) clane$ b2t2 index -o ds102.parquet bids-examples/ds102 64ds102: 100%|███████████████████████████████████████| 26/26 [00:00<00:00, 154.12it/s, sub=26, N=130] 65``` 66 67You can also index a list of datasets. Note that each iteration in the progress bar represents one dataset. 68 69``` 70(bids2table) clane$ b2t2 index -o bids-examples.parquet bids-examples/* 71100%|████████████████████████████████████████████| 87/87 [00:00<00:00, 113.59it/s, ds=None, N=9727] 72``` 73 74You can pipe the output of `b2t2 find` to `b2t2 index` to create an index of all datasets under a root directory. 75 76``` 77(bids2table) clane$ b2t2 find bids-examples | b2t2 index -o bids-examples.parquet 7897it [00:01, 96.05it/s, ds=ieeg_filtered_speech, N=10K] 79``` 80 81The resulting index will include both top-level datasets (as in the previous command) as well nested derivatives datasets. 82 83### Indexing datasets hosted on S3 84 85bids2table supports indexing datasets hosted on S3 via [cloudpathlib](https://github.com/drivendataorg/cloudpathlib). To use this functionality, make sure to install bids2table with the `s3` extra. Or you can also just install cloudpathlib directly 86 87```sh 88pip install cloudpathlib[s3] 89``` 90 91As an example, here we index all datasets on [OpenNeuro](https://openneuro.org/) 92 93``` 94(bids2table) clane$ b2t2 index -o openneuro.parquet \ 95 -j 8 --use-threads s3://openneuro.org/ds* 96100%|█████████████████████████████████████| 1408/1408 [12:25<00:00, 1.89it/s, ds=ds006193, N=1.2M] 97``` 98 99Using 8 threads, we can index all ~1400 OpenNeuro datasets (1.2M files) in less than 15 minutes. 100 101 102### Indexing datasets from python 103 104You can also index datasets using the Python API. 105 106```python 107import bids2table as b2t2 108import pandas as pd 109import pyarrow as pa 110import pyarrow.parquet as pq 111 112# Index a single dataset. 113tab = b2t2.index_dataset("bids-examples/ds102") 114 115# Find and index a batch of datasets. 116tabs = b2t2.batch_index_dataset( 117 b2t2.find_bids_datasets("bids-examples"), 118) 119tab = pa.concat_tables(tabs) 120 121# Index a dataset on S3. 122tab = b2t2.index_dataset("s3://openneuro.org/ds000224") 123 124# Save as parquet. 125pq.write_table(tab, "ds000224.parquet") 126 127# Convert to a pandas dataframe. 128df = tab.to_pandas(types_mapper=pd.ArrowDtype) 129``` 130""" 131 132__all__ = [ 133 "index_dataset", 134 "batch_index_dataset", 135 "find_bids_datasets", 136 "get_arrow_schema", 137 "get_column_names", 138 "parse_bids_entities", 139 "validate_bids_entities", 140 "set_bids_schema", 141 "get_bids_schema", 142 "get_bids_entity_arrow_schema", 143 "format_bids_path", 144 "cloudpathlib_is_available", 145] 146 147from ._indexing import ( 148 index_dataset, 149 batch_index_dataset, 150 find_bids_datasets, 151 get_arrow_schema, 152 get_column_names, 153) 154from ._entities import ( 155 parse_bids_entities, 156 validate_bids_entities, 157 set_bids_schema, 158 get_bids_schema, 159 get_bids_entity_arrow_schema, 160 format_bids_path, 161) 162from ._pathlib import cloudpathlib_is_available 163from ._version import *
195def index_dataset( 196 root: str | PathT, 197 include_subjects: str | list[str] | None = None, 198 max_workers: int | None = 0, 199 chunksize: int = 32, 200 executor_cls: type[Executor] = ProcessPoolExecutor, 201 show_progress: bool = False, 202) -> pa.Table: 203 """Index a BIDS dataset. 204 205 Args: 206 root: BIDS dataset root directory. 207 include_subjects: Glob pattern or list of patterns for matching subjects to 208 include in the index. 209 max_workers: Number of indexing processes to run in parallel. Setting 210 `max_workers=0` (the default) uses the main process only. Setting 211 `max_workers=None` starts as many workers as there are available CPUs. See 212 `concurrent.futures.ProcessPoolExecutor` for details. 213 chunksize: Number of subjects per process task. Only used for 214 `ProcessPoolExecutor` when `max_workers > 0`. 215 executor_cls: Executor class to use for parallel indexing. 216 show_progress: Show progress bar. 217 218 Returns: 219 An Arrow table index of the BIDS dataset. 220 """ 221 root = as_path(root) 222 223 schema = get_arrow_schema() 224 225 dataset, _ = _get_bids_dataset(root) 226 if dataset is None: 227 _logger.warning(f"Path {root} is not a valid BIDS dataset directory.") 228 return pa.Table.from_pylist([], schema=schema) 229 230 subject_dirs = _find_bids_subject_dirs(root, include_subjects) 231 subject_dirs = sorted(subject_dirs, key=lambda p: p.name) 232 if len(subject_dirs) == 0: 233 _logger.warning(f"Path {root} contains no matching subject dirs.") 234 return pa.Table.from_pylist([], schema=schema) 235 236 func = partial(_index_bids_subject_dir, schema=schema, dataset=dataset) 237 238 tables = [] 239 file_count = 0 240 for sub, table in ( 241 pbar := tqdm( 242 _pmap(func, subject_dirs, max_workers, chunksize, executor_cls), 243 desc=dataset, 244 total=len(subject_dirs), 245 disable=not show_progress, 246 ) 247 ): 248 file_count += len(table) 249 pbar.set_postfix(dict(sub=sub, N=_hfmt(file_count)), refresh=False) 250 tables.append(table) 251 252 # NOTE: concat_tables produces a table where each column is a ChunkedArray, with one 253 # chunk per original subject table. Is it better to keep the original chunks (one 254 # per subject) or merge using `combine_chunks`? 255 table = pa.concat_tables(tables).combine_chunks() 256 return table
Index a BIDS dataset.
Arguments:
- root: BIDS dataset root directory.
- include_subjects: Glob pattern or list of patterns for matching subjects to include in the index.
- max_workers: Number of indexing processes to run in parallel. Setting
max_workers=0
(the default) uses the main process only. Settingmax_workers=None
starts as many workers as there are available CPUs. Seeconcurrent.futures.ProcessPoolExecutor
for details. - chunksize: Number of subjects per process task. Only used for
ProcessPoolExecutor
whenmax_workers > 0
. - executor_cls: Executor class to use for parallel indexing.
- show_progress: Show progress bar.
Returns:
An Arrow table index of the BIDS dataset.
259def batch_index_dataset( 260 roots: list[str | PathT], 261 max_workers: int | None = 0, 262 executor_cls: type[Executor] = ProcessPoolExecutor, 263 show_progress: bool = False, 264) -> Generator[pa.Table, None, None]: 265 """Index a batch of BIDS datasets. 266 267 Args: 268 roots: List of BIDS dataset root directories. 269 max_workers: Number of indexing processes to run in parallel. Setting 270 `max_workers=0` (the default) uses the main process only. Setting 271 `max_workers=None` starts as many workers as there are available CPUs. See 272 `concurrent.futures.ProcessPoolExecutor` for details. 273 executor_cls: Executor class to use for parallel indexing. 274 show_progress: Show progress bar. 275 276 Yields: 277 An Arrow table index for each BIDS dataset. 278 """ 279 file_count = 0 280 for dataset, table in ( 281 pbar := tqdm( 282 _pmap(_batch_index_func, roots, max_workers, executor_cls=executor_cls), 283 total=len(roots) if isinstance(roots, Sequence) else None, 284 disable=show_progress not in {True, "dataset"}, 285 ) 286 ): 287 file_count += len(table) 288 pbar.set_postfix(dict(ds=dataset, N=_hfmt(file_count)), refresh=False) 289 yield table
Index a batch of BIDS datasets.
Arguments:
- roots: List of BIDS dataset root directories.
- max_workers: Number of indexing processes to run in parallel. Setting
max_workers=0
(the default) uses the main process only. Settingmax_workers=None
starts as many workers as there are available CPUs. Seeconcurrent.futures.ProcessPoolExecutor
for details. - executor_cls: Executor class to use for parallel indexing.
- show_progress: Show progress bar.
Yields:
An Arrow table index for each BIDS dataset.
130def find_bids_datasets( 131 root: str | PathT, 132 exclude: str | list[str] | None = None, 133 maxdepth: int | None = None, 134) -> Generator[PathT, None, None]: 135 """Find all BIDS datasets under a root directory. 136 137 Args: 138 root: Root path to begin search. 139 exclude: Glob pattern or list of patterns matching sub-directory names to 140 exclude from the search. 141 maxdepth: Maximum depth to search. 142 143 Yields: 144 Root paths of all BIDS datasets under `root`. 145 """ 146 root = as_path(root) 147 148 if isinstance(exclude, str): 149 exclude = [exclude] 150 elif exclude is None: 151 exclude = [] 152 exclude = [re.compile(fnmatch.translate(pat)) for pat in exclude] 153 154 entry_count = 1 155 ds_count = 0 156 157 if _is_bids_dataset(root): 158 ds_count += 1 159 yield root 160 161 # Tuple of path, depth 162 stack = [(root, 0)] 163 164 while stack: 165 top, depth = stack.pop() 166 167 inside_bids = _is_bids_dataset(top) 168 depth += 1 169 170 for entry in top.iterdir(): 171 entry_count += 1 172 173 if any(re.fullmatch(pat, entry.name) for pat in exclude): 174 continue 175 176 if _is_bids_dataset(entry): 177 ds_count += 1 178 yield entry 179 180 # Checks if we should descend into this directory. 181 # Check not reached final depth. 182 descend = maxdepth is None or depth < maxdepth 183 # Heuristic checks whether the filename looks like a (visible) directory. 184 descend = descend and not (entry.suffix or entry.name.startswith(".")) 185 # Only descend into specific subdirectories of BIDS directories. 186 descend = descend and ( 187 not inside_bids or entry.name in _BIDS_NESTED_PARENT_DIRNAMES 188 ) 189 # Finally, check if actually a directory (which is slow so we want to 190 # short-circuit as much as possible). 191 if descend and entry.is_dir(): 192 stack.append((entry, depth))
Find all BIDS datasets under a root directory.
Arguments:
- root: Root path to begin search.
- exclude: Glob pattern or list of patterns matching sub-directory names to exclude from the search.
- maxdepth: Maximum depth to search.
Yields:
Root paths of all BIDS datasets under
root
.
90def get_arrow_schema() -> pa.Schema: 91 """Get Arrow schema of the BIDS dataset index.""" 92 entity_schema = get_bids_entity_arrow_schema() 93 index_fields = { 94 name: pa.field(name, cfg["dtype"], metadata=cfg["metadata"]) 95 for name, cfg in _INDEX_ARROW_FIELDS.items() 96 } 97 fields = [ 98 index_fields["dataset"], 99 *entity_schema, 100 index_fields["extra_entities"], 101 index_fields["root"], 102 index_fields["path"], 103 ] 104 105 metadata = { 106 **entity_schema.metadata, 107 "bids2table_version": importlib.metadata.version(__package__), 108 } 109 schema = pa.schema(fields, metadata=metadata) 110 return schema
Get Arrow schema of the BIDS dataset index.
113def get_column_names() -> enum.StrEnum: 114 """Get an enum of the BIDS index columns.""" 115 # TODO: It might be nice if the column names were statically available. One option 116 # would be to generate a static _schema.py module at install time (similar to how 117 # _version.py is generated) which defines the static default schema and column 118 # names. 119 schema = get_arrow_schema() 120 items = [] 121 for f in schema: 122 name = f.metadata["name".encode()].decode() 123 items.append((name, name)) 124 125 BIDSColumn = enum.StrEnum("BIDSColumn", items) 126 BIDSColumn.__doc__ = "Enum of BIDS index column names." 127 return BIDSColumn
Get an enum of the BIDS index columns.
138def parse_bids_entities(path: str | Path) -> dict[str, str]: 139 """Parse entities from BIDS file path. 140 141 Parses all BIDS filename `"{key}-{value}"` entities as well as special entities: 142 datatype, suffix, ext (extension). Does not validate entities or cast to types. 143 144 Args: 145 path: BIDS path to parse. 146 147 Returns: 148 A dict mapping BIDS entity keys to values. 149 """ 150 if isinstance(path, str): 151 path = Path(path) 152 entities = {} 153 154 filename = path.name 155 parts = filename.split("_") 156 157 datatype = _parse_bids_datatype(path) 158 159 # Get suffix and extension. 160 suffix_ext = parts.pop() 161 suffix, dot, ext = suffix_ext.partition(".") 162 ext = dot + ext if ext else None 163 164 # Suffix is actually an entity, put back in list. 165 if "-" in suffix: 166 parts.append(suffix) 167 suffix = None 168 169 # Split entities, skipping any that don't contain a '-'. 170 for part in parts: 171 if "-" in part: 172 key, val = part.split("-", maxsplit=1) 173 entities[key] = val 174 175 for k, v in zip(["datatype", "suffix", "ext"], [datatype, suffix, ext]): 176 if v is not None: 177 entities[k] = v 178 return entities
Parse entities from BIDS file path.
Parses all BIDS filename "{key}-{value}"
entities as well as special entities:
datatype, suffix, ext (extension). Does not validate entities or cast to types.
Arguments:
- path: BIDS path to parse.
Returns:
A dict mapping BIDS entity keys to values.
197def validate_bids_entities( 198 entities: dict[str, Any], 199) -> tuple[dict[str, BIDSValue], dict[str, Any]]: 200 """Validate BIDS entities. 201 202 Validates the type and allowed values of each entity against the BIDS schema. 203 204 Args: 205 entities: dict mapping BIDS keys to unvalidated entities 206 207 Returns: 208 A tuple of `(valid_entities, extra_entities)`, where `valid_entities` is a 209 mapping of valid BIDS keys to type-casted values, and `extra_entities` a 210 mapping of any leftover entity mappings that didn't match a known entity or 211 failed validation. 212 """ 213 valid_entities = {} 214 extra_entities = {} 215 216 for name, value in entities.items(): 217 if name in _BIDS_NAME_ENTITY_MAP: 218 entity = _BIDS_NAME_ENTITY_MAP[name] 219 cfg = _BIDS_ENTITY_SCHEMA[entity] 220 typ = _BIDS_FORMAT_PY_TYPE_MAP[cfg["format"]] 221 222 # Cast to target type. 223 try: 224 value = typ(value) 225 except ValueError: 226 _logger.warning( 227 f"Unable to coerce {repr(value)} to type {typ} for entity '{name}'.", 228 ) 229 extra_entities[name] = value 230 continue 231 232 # Check allowed values. 233 if "enum" in cfg and value not in cfg["enum"]: 234 _logger.warning( 235 f"Value {value} for entity '{name}' isn't one of the " 236 f"allowed values: {cfg['enum']}.", 237 ) 238 extra_entities[name] = value 239 continue 240 241 valid_entities[name] = value 242 else: 243 extra_entities[name] = value 244 245 return valid_entities, extra_entities
Validate BIDS entities.
Validates the type and allowed values of each entity against the BIDS schema.
Arguments:
- entities: dict mapping BIDS keys to unvalidated entities
Returns:
A tuple of
(valid_entities, extra_entities)
, wherevalid_entities
is a mapping of valid BIDS keys to type-casted values, andextra_entities
a mapping of any leftover entity mappings that didn't match a known entity or failed validation.
78def set_bids_schema(path: str | Path | None = None) -> None: 79 """Set the BIDS schema.""" 80 global _BIDS_SCHEMA, _BIDS_ENTITY_SCHEMA, _BIDS_NAME_ENTITY_MAP 81 global _BIDS_ENTITY_ARROW_SCHEMA 82 83 schema = bidsschematools.schema.load_schema(path) 84 entity_schema = { 85 entity: schema.objects.entities[entity].to_dict() 86 for entity in schema.rules.entities 87 } 88 # Also include special extra entities (datatype, suffix, extension). 89 entity_schema.update(_BIDS_SPECIAL_ENTITY_SCHEMA) 90 name_entity_map = {cfg["name"]: entity for entity, cfg in entity_schema.items()} 91 92 _BIDS_SCHEMA = schema 93 _BIDS_ENTITY_SCHEMA = entity_schema 94 _BIDS_NAME_ENTITY_MAP = name_entity_map 95 96 _BIDS_ENTITY_ARROW_SCHEMA = _bids_entity_arrow_schema( 97 entity_schema, 98 bids_version=schema["bids_version"], 99 schema_version=schema["schema_version"], 100 )
Set the BIDS schema.
128def get_bids_schema() -> Namespace: 129 """Get the current BIDS schema.""" 130 return _BIDS_SCHEMA
Get the current BIDS schema.
133def get_bids_entity_arrow_schema() -> pa.Schema: 134 """Get the current BIDS entity schema in Arrow format.""" 135 return _BIDS_ENTITY_ARROW_SCHEMA
Get the current BIDS entity schema in Arrow format.
248def format_bids_path(entities: dict[str, Any], int_format: str = "%d") -> Path: 249 """Construct a formatted BIDS path from entities dict. 250 251 Args: 252 entities: dict mapping BIDS keys to values. 253 int_format: format string for integer (index) BIDS values. 254 255 Returns: 256 A formatted `Path` instance. 257 """ 258 special = {"datatype", "suffix", "ext"} 259 260 # Formatted key-value entities. 261 entities_fmt = [] 262 for name, value in entities.items(): 263 if name not in special: 264 if isinstance(value, int): 265 value = int_format % value 266 entities_fmt.append(f"{name}-{value}") 267 name = "_".join(entities_fmt) 268 269 # Append suffix and extension. 270 if suffix := entities.get("suffix"): 271 name += f"_{suffix}" 272 if ext := entities.get("ext"): 273 name += ext 274 275 # Prepend parent directories. 276 path = Path(name) 277 if datatype := entities.get("datatype"): 278 path = datatype / path 279 if ses := entities.get("ses"): 280 path = f"ses-{ses}" / path 281 path = f"sub-{entities['sub']}" / path 282 return path
Construct a formatted BIDS path from entities dict.
Arguments:
- entities: dict mapping BIDS keys to values.
- int_format: format string for integer (index) BIDS values.
Returns:
A formatted
Path
instance.
29def cloudpathlib_is_available() -> bool: 30 """Check if cloudpathlib is available.""" 31 return _CLOUDPATHLIB_AVAILABLE
Check if cloudpathlib is available.