bids2table
Index BIDS datasets fast, locally or in the cloud.
Installation
To install the latest release from pypi, you can run
pip install bids2table
To install with S3 support, include the s3
extra
pip install bids2table[s3]
The latest development version can be installed with
pip install "bids2table[s3] @ git+https://github.com/childmindresearch/bids2table.git"
Usage
To run these examples, you will need to clone the bids-examples repo.
git clone -b 1.9.0 https://github.com/bids-standard/bids-examples.git
Finding BIDS datasets
You can search a directory for valid BIDS datasets using b2t2 find
(bids2table) clane$ b2t2 find bids-examples | head -n 10
bids-examples/asl002
bids-examples/ds002
bids-examples/ds005
bids-examples/asl005
bids-examples/ds051
bids-examples/eeg_rishikesh
bids-examples/asl004
bids-examples/asl003
bids-examples/ds003
bids-examples/eeg_cbm
Indexing datasets from the command line
Indexing datasets is done with b2t2 index
. Here we index a single example dataset, saving the output as a parquet file.
(bids2table) clane$ b2t2 index -o ds102.parquet bids-examples/ds102
ds102: 100%|███████████████████████████████████████| 26/26 [00:00<00:00, 154.12it/s, sub=26, N=130]
You can also index a list of datasets. Note that each iteration in the progress bar represents one dataset.
(bids2table) clane$ b2t2 index -o bids-examples.parquet bids-examples/*
100%|████████████████████████████████████████████| 87/87 [00:00<00:00, 113.59it/s, ds=None, N=9727]
You can pipe the output of b2t2 find
to b2t2 index
to create an index of all datasets under a root directory.
(bids2table) clane$ b2t2 find bids-examples | b2t2 index -o bids-examples.parquet
97it [00:01, 96.05it/s, ds=ieeg_filtered_speech, N=10K]
The resulting index will include both top-level datasets (as in the previous command) as well nested derivatives datasets.
Indexing datasets hosted on S3
bids2table supports indexing datasets hosted on S3 via cloudpathlib. To use this functionality, make sure to install bids2table with the s3
extra. Or you can also just install cloudpathlib directly
pip install cloudpathlib[s3]
As an example, here we index all datasets on OpenNeuro
(bids2table) clane$ b2t2 index -o openneuro.parquet \
-j 8 --use-threads s3://openneuro.org/ds*
100%|█████████████████████████████████████| 1408/1408 [12:25<00:00, 1.89it/s, ds=ds006193, N=1.2M]
Using 8 threads, we can index all ~1400 OpenNeuro datasets (1.2M files) in less than 15 minutes.
Indexing datasets from python
You can also index datasets using the Python API.
import bids2table as b2t2
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# Index a single dataset.
tab = b2t2.index_dataset("bids-examples/ds102")
# Find and index a batch of datasets.
tabs = b2t2.batch_index_dataset(
b2t2.find_bids_datasets("bids-examples"),
)
tab = pa.concat_tables(tabs)
# Index a dataset on S3.
tab = b2t2.index_dataset("s3://openneuro.org/ds000224")
# Save as parquet.
pq.write_table(tab, "ds000224.parquet")
# Convert to a pandas dataframe.
df = tab.to_pandas(types_mapper=pd.ArrowDtype)
1# ruff: noqa: I001 2r""" 3[](https://github.com/childmindresearch/bids2table/actions/workflows/ci.yaml?query=branch%3Amain) 4[](https://childmindresearch.github.io/bids2table/bids2table) 5[](https://codecov.io/gh/childmindresearch/bids2table) 6[](https://github.com/astral-sh/ruff) 7 8[](LICENSE) 9 10Index [BIDS](https://bids-specification.readthedocs.io/en/stable/) datasets fast, locally or in the cloud. 11 12## Installation 13 14To install the latest release from pypi, you can run 15 16```sh 17pip install bids2table 18``` 19 20To install with S3 support, include the `s3` extra 21 22```sh 23pip install bids2table[s3] 24``` 25 26The latest development version can be installed with 27 28```sh 29pip install "bids2table[s3] @ git+https://github.com/childmindresearch/bids2table.git" 30``` 31 32## Usage 33 34To run these examples, you will need to clone the [bids-examples](https://github.com/bids-standard/bids-examples) repo. 35 36```sh 37git clone -b 1.9.0 https://github.com/bids-standard/bids-examples.git 38``` 39 40### Finding BIDS datasets 41 42You can search a directory for valid BIDS datasets using `b2t2 find` 43 44``` 45(bids2table) clane$ b2t2 find bids-examples | head -n 10 46bids-examples/asl002 47bids-examples/ds002 48bids-examples/ds005 49bids-examples/asl005 50bids-examples/ds051 51bids-examples/eeg_rishikesh 52bids-examples/asl004 53bids-examples/asl003 54bids-examples/ds003 55bids-examples/eeg_cbm 56``` 57 58### Indexing datasets from the command line 59 60Indexing datasets is done with `b2t2 index`. Here we index a single example dataset, saving the output as a parquet file. 61 62``` 63(bids2table) clane$ b2t2 index -o ds102.parquet bids-examples/ds102 64ds102: 100%|███████████████████████████████████████| 26/26 [00:00<00:00, 154.12it/s, sub=26, N=130] 65``` 66 67You can also index a list of datasets. Note that each iteration in the progress bar represents one dataset. 68 69``` 70(bids2table) clane$ b2t2 index -o bids-examples.parquet bids-examples/* 71100%|████████████████████████████████████████████| 87/87 [00:00<00:00, 113.59it/s, ds=None, N=9727] 72``` 73 74You can pipe the output of `b2t2 find` to `b2t2 index` to create an index of all datasets under a root directory. 75 76``` 77(bids2table) clane$ b2t2 find bids-examples | b2t2 index -o bids-examples.parquet 7897it [00:01, 96.05it/s, ds=ieeg_filtered_speech, N=10K] 79``` 80 81The resulting index will include both top-level datasets (as in the previous command) as well nested derivatives datasets. 82 83### Indexing datasets hosted on S3 84 85bids2table supports indexing datasets hosted on S3 via [cloudpathlib](https://github.com/drivendataorg/cloudpathlib). To use this functionality, make sure to install bids2table with the `s3` extra. Or you can also just install cloudpathlib directly 86 87```sh 88pip install cloudpathlib[s3] 89``` 90 91As an example, here we index all datasets on [OpenNeuro](https://openneuro.org/) 92 93``` 94(bids2table) clane$ b2t2 index -o openneuro.parquet \ 95 -j 8 --use-threads s3://openneuro.org/ds* 96100%|█████████████████████████████████████| 1408/1408 [12:25<00:00, 1.89it/s, ds=ds006193, N=1.2M] 97``` 98 99Using 8 threads, we can index all ~1400 OpenNeuro datasets (1.2M files) in less than 15 minutes. 100 101 102### Indexing datasets from python 103 104You can also index datasets using the Python API. 105 106```python 107import bids2table as b2t2 108import pandas as pd 109import pyarrow as pa 110import pyarrow.parquet as pq 111 112# Index a single dataset. 113tab = b2t2.index_dataset("bids-examples/ds102") 114 115# Find and index a batch of datasets. 116tabs = b2t2.batch_index_dataset( 117 b2t2.find_bids_datasets("bids-examples"), 118) 119tab = pa.concat_tables(tabs) 120 121# Index a dataset on S3. 122tab = b2t2.index_dataset("s3://openneuro.org/ds000224") 123 124# Save as parquet. 125pq.write_table(tab, "ds000224.parquet") 126 127# Convert to a pandas dataframe. 128df = tab.to_pandas(types_mapper=pd.ArrowDtype) 129``` 130""" 131 132__all__ = [ 133 "index_dataset", 134 "batch_index_dataset", 135 "find_bids_datasets", 136 "get_arrow_schema", 137 "get_column_names", 138 "parse_bids_entities", 139 "validate_bids_entities", 140 "set_bids_schema", 141 "get_bids_schema", 142 "get_bids_entity_arrow_schema", 143 "format_bids_path", 144 "cloudpathlib_is_available", 145] 146 147from ._indexing import ( 148 index_dataset, 149 batch_index_dataset, 150 find_bids_datasets, 151 get_arrow_schema, 152 get_column_names, 153) 154from ._entities import ( 155 parse_bids_entities, 156 validate_bids_entities, 157 set_bids_schema, 158 get_bids_schema, 159 get_bids_entity_arrow_schema, 160 format_bids_path, 161) 162from ._pathlib import cloudpathlib_is_available 163from ._version import *
180def index_dataset( 181 root: str | PathT, 182 include_subjects: str | list[str] | None = None, 183 max_workers: int | None = 0, 184 chunksize: int = 32, 185 executor_cls: type[Executor] = ProcessPoolExecutor, 186 show_progress: bool = False, 187) -> pa.Table: 188 """Index a BIDS dataset. 189 190 Args: 191 root: BIDS dataset root directory. 192 include_subjects: Glob pattern or list of patterns for matching subjects to 193 include in the index. 194 max_workers: Number of indexing processes to run in parallel. Setting 195 `max_workers=0` (the default) uses the main process only. Setting 196 `max_workers=None` starts as many workers as there are available CPUs. See 197 `concurrent.futures.ProcessPoolExecutor` for details. 198 chunksize: Number of subjects per process task. Only used for 199 `ProcessPoolExecutor` when `max_workers > 0`. 200 executor_cls: Executor class to use for parallel indexing. 201 show_progress: Show progress bar. 202 203 Returns: 204 An Arrow table index of the BIDS dataset. 205 """ 206 root = as_path(root) 207 208 schema = get_arrow_schema() 209 210 dataset, _ = _get_bids_dataset(root) 211 if dataset is None: 212 _logger.warning(f"Path {root} is not a valid BIDS dataset directory.") 213 return pa.Table.from_pylist([], schema=schema) 214 215 subject_dirs = _find_bids_subject_dirs(root, include_subjects) 216 subject_dirs = sorted(subject_dirs, key=lambda p: p.name) 217 if len(subject_dirs) == 0: 218 _logger.warning(f"Path {root} contains no matching subject dirs.") 219 return pa.Table.from_pylist([], schema=schema) 220 221 func = partial(_index_bids_subject_dir, schema=schema, dataset=dataset) 222 223 tables = [] 224 file_count = 0 225 for sub, table in ( 226 pbar := tqdm( 227 _pmap(func, subject_dirs, max_workers, chunksize, executor_cls), 228 desc=dataset, 229 total=len(subject_dirs), 230 disable=not show_progress, 231 ) 232 ): 233 file_count += len(table) 234 pbar.set_postfix(dict(sub=sub, N=_hfmt(file_count)), refresh=False) 235 tables.append(table) 236 237 # NOTE: concat_tables produces a table where each column is a ChunkedArray, with one 238 # chunk per original subject table. Is it better to keep the original chunks (one 239 # per subject) or merge using `combine_chunks`? 240 table = pa.concat_tables(tables).combine_chunks() 241 return table
Index a BIDS dataset.
Arguments:
- root: BIDS dataset root directory.
- include_subjects: Glob pattern or list of patterns for matching subjects to include in the index.
- max_workers: Number of indexing processes to run in parallel. Setting
max_workers=0
(the default) uses the main process only. Settingmax_workers=None
starts as many workers as there are available CPUs. Seeconcurrent.futures.ProcessPoolExecutor
for details. - chunksize: Number of subjects per process task. Only used for
ProcessPoolExecutor
whenmax_workers > 0
. - executor_cls: Executor class to use for parallel indexing.
- show_progress: Show progress bar.
Returns:
An Arrow table index of the BIDS dataset.
244def batch_index_dataset( 245 roots: list[str | PathT], 246 max_workers: int | None = 0, 247 executor_cls: type[Executor] = ProcessPoolExecutor, 248 show_progress: bool = False, 249) -> Generator[pa.Table, None, None]: 250 """Index a batch of BIDS datasets. 251 252 Args: 253 roots: List of BIDS dataset root directories. 254 max_workers: Number of indexing processes to run in parallel. Setting 255 `max_workers=0` (the default) uses the main process only. Setting 256 `max_workers=None` starts as many workers as there are available CPUs. See 257 `concurrent.futures.ProcessPoolExecutor` for details. 258 executor_cls: Executor class to use for parallel indexing. 259 show_progress: Show progress bar. 260 261 Yields: 262 An Arrow table index for each BIDS dataset. 263 """ 264 file_count = 0 265 for dataset, table in ( 266 pbar := tqdm( 267 _pmap(_batch_index_func, roots, max_workers, executor_cls=executor_cls), 268 total=len(roots) if isinstance(roots, Sequence) else None, 269 disable=show_progress not in {True, "dataset"}, 270 ) 271 ): 272 file_count += len(table) 273 pbar.set_postfix(dict(ds=dataset, N=_hfmt(file_count)), refresh=False) 274 yield table
Index a batch of BIDS datasets.
Arguments:
- roots: List of BIDS dataset root directories.
- max_workers: Number of indexing processes to run in parallel. Setting
max_workers=0
(the default) uses the main process only. Settingmax_workers=None
starts as many workers as there are available CPUs. Seeconcurrent.futures.ProcessPoolExecutor
for details. - executor_cls: Executor class to use for parallel indexing.
- show_progress: Show progress bar.
Yields:
An Arrow table index for each BIDS dataset.
130def find_bids_datasets( 131 root: str | PathT, 132 exclude: str | list[str] | None = None, 133 follow_symlinks: bool = True, 134 log_frequency: int = 100, 135) -> Generator[PathT, None, None]: 136 """Find all BIDS datasets under a root directory. 137 138 Args: 139 root: Root path to begin search. 140 exclude: Glob pattern or list of patterns matching sub-directory names to 141 exclude from the search. 142 follow_symlinks: Search into symlinks that point to directories. 143 144 Yields: 145 Root paths of all BIDS datasets under `root`. 146 """ 147 root = as_path(root) 148 149 dir_count = 0 150 ds_count = 0 151 152 # NOTE: Path.walk was introduced in 3.12. Otherwise, could use an older python. 153 for dirpath, dirnames, _ in root.walk(follow_symlinks=follow_symlinks): 154 dir_count += 1 155 156 if _is_bids_dataset(dirpath): 157 ds_count += 1 158 yield dirpath 159 160 # Only descend into specific sub-directories that are allowed to contain 161 # sub-datasets. 162 _filter_dirnames(dirnames, _BIDS_NESTED_PARENT_DIRNAMES) 163 164 # Filter sub-directories to descend into. 165 if exclude: 166 matches = _filter_exclude(dirnames, exclude) 167 _filter_dirnames(dirnames, matches) 168 169 if log_frequency and dir_count % log_frequency == 0: 170 _logger.info( 171 "Searched %d directories; found %d BIDS datasets.", dir_count, ds_count 172 ) 173 174 if log_frequency: 175 _logger.info( 176 "Searched %d directories; found %d BIDS datasets.", dir_count, ds_count 177 )
Find all BIDS datasets under a root directory.
Arguments:
- root: Root path to begin search.
- exclude: Glob pattern or list of patterns matching sub-directory names to exclude from the search.
- follow_symlinks: Search into symlinks that point to directories.
Yields:
Root paths of all BIDS datasets under
root
.
90def get_arrow_schema() -> pa.Schema: 91 """Get Arrow schema of the BIDS dataset index.""" 92 entity_schema = get_bids_entity_arrow_schema() 93 index_fields = { 94 name: pa.field(name, cfg["dtype"], metadata=cfg["metadata"]) 95 for name, cfg in _INDEX_ARROW_FIELDS.items() 96 } 97 fields = [ 98 index_fields["dataset"], 99 *entity_schema, 100 index_fields["extra_entities"], 101 index_fields["root"], 102 index_fields["path"], 103 ] 104 105 metadata = { 106 **entity_schema.metadata, 107 "bids2table_version": importlib.metadata.version(__package__), 108 } 109 schema = pa.schema(fields, metadata=metadata) 110 return schema
Get Arrow schema of the BIDS dataset index.
113def get_column_names() -> enum.StrEnum: 114 """Get an enum of the BIDS index columns.""" 115 # TODO: It might be nice if the column names were statically available. One option 116 # would be to generate a static _schema.py module at install time (similar to how 117 # _version.py is generated) which defines the static default schema and column 118 # names. 119 schema = get_arrow_schema() 120 items = [] 121 for f in schema: 122 name = f.metadata["name".encode()].decode() 123 items.append((name, name)) 124 125 BIDSColumn = enum.StrEnum("BIDSColumn", items) 126 BIDSColumn.__doc__ = "Enum of BIDS index column names." 127 return BIDSColumn
Get an enum of the BIDS index columns.
138def parse_bids_entities(path: str | Path) -> dict[str, str]: 139 """Parse entities from BIDS file path. 140 141 Parses all BIDS filename `"{key}-{value}"` entities as well as special entities: 142 datatype, suffix, ext (extension). Does not validate entities or cast to types. 143 144 Args: 145 path: BIDS path to parse. 146 147 Returns: 148 A dict mapping BIDS entity keys to values. 149 """ 150 if isinstance(path, str): 151 path = Path(path) 152 entities = {} 153 154 filename = path.name 155 parts = filename.split("_") 156 157 datatype = _parse_bids_datatype(path) 158 159 # Get suffix and extension. 160 suffix_ext = parts.pop() 161 suffix, dot, ext = suffix_ext.partition(".") 162 ext = dot + ext if ext else None 163 164 # Suffix is actually an entity, put back in list. 165 if "-" in suffix: 166 parts.append(suffix) 167 suffix = None 168 169 # Split entities, skipping any that don't contain a '-'. 170 for part in parts: 171 if "-" in part: 172 key, val = part.split("-", maxsplit=1) 173 entities[key] = val 174 175 for k, v in zip(["datatype", "suffix", "ext"], [datatype, suffix, ext]): 176 if v is not None: 177 entities[k] = v 178 return entities
Parse entities from BIDS file path.
Parses all BIDS filename "{key}-{value}"
entities as well as special entities:
datatype, suffix, ext (extension). Does not validate entities or cast to types.
Arguments:
- path: BIDS path to parse.
Returns:
A dict mapping BIDS entity keys to values.
197def validate_bids_entities( 198 entities: dict[str, Any], 199) -> tuple[dict[str, BIDSValue], dict[str, Any]]: 200 """Validate BIDS entities. 201 202 Validates the type and allowed values of each entity against the BIDS schema. 203 204 Args: 205 entities: dict mapping BIDS keys to unvalidated entities 206 207 Returns: 208 A tuple of `(valid_entities, extra_entities)`, where `valid_entities` is a 209 mapping of valid BIDS keys to type-casted values, and `extra_entities` a 210 mapping of any leftover entity mappings that didn't match a known entity or 211 failed validation. 212 """ 213 valid_entities = {} 214 extra_entities = {} 215 216 for name, value in entities.items(): 217 if name in _BIDS_NAME_ENTITY_MAP: 218 entity = _BIDS_NAME_ENTITY_MAP[name] 219 cfg = _BIDS_ENTITY_SCHEMA[entity] 220 typ = _BIDS_FORMAT_PY_TYPE_MAP[cfg["format"]] 221 222 # Cast to target type. 223 try: 224 value = typ(value) 225 except ValueError: 226 _logger.warning( 227 f"Unable to coerce {repr(value)} to type {typ} for entity '{name}'.", 228 ) 229 extra_entities[name] = value 230 continue 231 232 # Check allowed values. 233 if "enum" in cfg and value not in cfg["enum"]: 234 _logger.warning( 235 f"Value {value} for entity '{name}' isn't one of the " 236 f"allowed values: {cfg['enum']}.", 237 ) 238 extra_entities[name] = value 239 continue 240 241 valid_entities[name] = value 242 else: 243 extra_entities[name] = value 244 245 return valid_entities, extra_entities
Validate BIDS entities.
Validates the type and allowed values of each entity against the BIDS schema.
Arguments:
- entities: dict mapping BIDS keys to unvalidated entities
Returns:
A tuple of
(valid_entities, extra_entities)
, wherevalid_entities
is a mapping of valid BIDS keys to type-casted values, andextra_entities
a mapping of any leftover entity mappings that didn't match a known entity or failed validation.
78def set_bids_schema(path: str | Path | None = None) -> None: 79 """Set the BIDS schema.""" 80 global _BIDS_SCHEMA, _BIDS_ENTITY_SCHEMA, _BIDS_NAME_ENTITY_MAP 81 global _BIDS_ENTITY_ARROW_SCHEMA 82 83 schema = bidsschematools.schema.load_schema(path) 84 entity_schema = { 85 entity: schema.objects.entities[entity].to_dict() 86 for entity in schema.rules.entities 87 } 88 # Also include special extra entities (datatype, suffix, extension). 89 entity_schema.update(_BIDS_SPECIAL_ENTITY_SCHEMA) 90 name_entity_map = {cfg["name"]: entity for entity, cfg in entity_schema.items()} 91 92 _BIDS_SCHEMA = schema 93 _BIDS_ENTITY_SCHEMA = entity_schema 94 _BIDS_NAME_ENTITY_MAP = name_entity_map 95 96 _BIDS_ENTITY_ARROW_SCHEMA = _bids_entity_arrow_schema( 97 entity_schema, 98 bids_version=schema["bids_version"], 99 schema_version=schema["schema_version"], 100 )
Set the BIDS schema.
128def get_bids_schema() -> Namespace: 129 """Get the current BIDS schema.""" 130 return _BIDS_SCHEMA
Get the current BIDS schema.
133def get_bids_entity_arrow_schema() -> pa.Schema: 134 """Get the current BIDS entity schema in Arrow format.""" 135 return _BIDS_ENTITY_ARROW_SCHEMA
Get the current BIDS entity schema in Arrow format.
248def format_bids_path(entities: dict[str, Any], int_format: str = "%d") -> Path: 249 """Construct a formatted BIDS path from entities dict. 250 251 Args: 252 entities: dict mapping BIDS keys to values. 253 int_format: format string for integer (index) BIDS values. 254 255 Returns: 256 A formatted `Path` instance. 257 """ 258 special = {"datatype", "suffix", "ext"} 259 260 # Formatted key-value entities. 261 entities_fmt = [] 262 for name, value in entities.items(): 263 if name not in special: 264 if isinstance(value, int): 265 value = int_format % value 266 entities_fmt.append(f"{name}-{value}") 267 name = "_".join(entities_fmt) 268 269 # Append suffix and extension. 270 if suffix := entities.get("suffix"): 271 name += f"_{suffix}" 272 if ext := entities.get("ext"): 273 name += ext 274 275 # Prepend parent directories. 276 path = Path(name) 277 if datatype := entities.get("datatype"): 278 path = datatype / path 279 if ses := entities.get("ses"): 280 path = f"ses-{ses}" / path 281 path = f"sub-{entities['sub']}" / path 282 return path
Construct a formatted BIDS path from entities dict.
Arguments:
- entities: dict mapping BIDS keys to values.
- int_format: format string for integer (index) BIDS values.
Returns:
A formatted
Path
instance.
29def cloudpathlib_is_available() -> bool: 30 """Check if cloudpathlib is available.""" 31 return _CLOUDPATHLIB_AVAILABLE
Check if cloudpathlib is available.