bids2table
bids2table
Index BIDS datasets fast, locally or in the cloud.
Installation
Install the core package using pip:
pip install bids2table
Variants
Depending on your use case, you may need extra dependencies. Choose the option that matches your use case:
| If you want to... | Run this command |
|---|---|
| Add cloud storage support (S3, GCS) | pip install bids2table[cloud] |
Enable pybids compatibility |
pip install bids2table[pybids] |
| Install everything | pip install bids2table[cloud,pybids] |
Deprecation Warning: Previous versions used bids2table[s3] for cloud support.
While the s3 extra still works for now, it will be removed in the next major
release. Please update your installation scripts to use [cloud].
Development Version
To test out the absolute latest features directly from the main branch, install directly from GitHub:
pip install "bids2table[cloud,pybids] @ git+https://github.com/childmindresearch/bids2table.git"
Usage
To run these examples, you will need to clone the bids-examples repo.
git clone -b 1.9.0 https://github.com/bids-standard/bids-examples.git
Finding BIDS datasets
You can search a directory for valid BIDS datasets using b2t2 find
(bids2table) clane$ b2t2 find bids-examples | head -n 10
bids-examples/asl002
bids-examples/ds002
bids-examples/ds005
bids-examples/asl005
bids-examples/ds051
bids-examples/eeg_rishikesh
bids-examples/asl004
bids-examples/asl003
bids-examples/ds003
bids-examples/eeg_cbm
Indexing datasets from the command line
Indexing datasets is done with b2t2 index. Here we index a single example dataset, saving the output as a parquet file.
(bids2table) clane$ b2t2 index -o ds102.parquet bids-examples/ds102
ds102: 100%|███████████████████████████████████████| 26/26 [00:00<00:00, 154.12it/s, sub=26, N=130]
You can also index a list of datasets. Note that each iteration in the progress bar represents one dataset.
(bids2table) clane$ b2t2 index -o bids-examples.parquet bids-examples/*
100%|████████████████████████████████████████████| 87/87 [00:00<00:00, 113.59it/s, ds=None, N=9727]
You can pipe the output of b2t2 find to b2t2 index to create an index of all datasets under a root directory.
(bids2table) clane$ b2t2 find bids-examples | b2t2 index -o bids-examples.parquet
97it [00:01, 96.05it/s, ds=ieeg_filtered_speech, N=10K]
The resulting index will include both top-level datasets (as in the previous command) as well nested derivatives datasets.
Indexing datasets hosted on S3
bids2table supports indexing datasets hosted on S3 via cloudpathlib. To use this functionality, make sure to install bids2table with the s3 extra. Or you can also just install cloudpathlib directly
pip install cloudpathlib[s3]
As an example, here we index all datasets on OpenNeuro
(bids2table) clane$ b2t2 index -o openneuro.parquet \
-j 8 --use-threads s3://openneuro.org/ds*
100%|█████████████████████████████████████| 1408/1408 [12:25<00:00, 1.89it/s, ds=ds006193, N=1.2M]
Using 8 threads, we can index all ~1400 OpenNeuro datasets (1.2M files) in less than 15 minutes.
Indexing datasets from python
You can also index datasets using the Python API.
import bids2table as b2t2
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# Index a single dataset.
tab = b2t2.index_dataset("bids-examples/ds102")
# Find and index a batch of datasets.
tabs = b2t2.batch_index_dataset(
b2t2.find_bids_datasets("bids-examples"),
)
tab = pa.concat_tables(tabs)
# Index a dataset on S3.
tab = b2t2.index_dataset("s3://openneuro.org/ds000224")
# Save as parquet.
pq.write_table(tab, "ds000224.parquet")
# Convert to a pandas dataframe.
df = tab.to_pandas(types_mapper=pd.ArrowDtype)
1""".. include:: ../README.md""" # noqa: D415 2 3__all__ = [ 4 "index_dataset", 5 "batch_index_dataset", 6 "find_bids_datasets", 7 "get_arrow_schema", 8 "get_column_names", 9 "parse_bids_entities", 10 "validate_bids_entities", 11 "set_bids_schema", 12 "get_bids_schema", 13 "get_bids_entity_arrow_schema", 14 "format_bids_path", 15 "load_bids_metadata", 16 "cloudpathlib_is_available", 17] 18 19import importlib.util 20 21if importlib.util.find_spec("pandas"): 22 __all__.append("pybids") 23 24from ._entities import ( 25 format_bids_path, 26 get_bids_entity_arrow_schema, 27 get_bids_schema, 28 parse_bids_entities, 29 set_bids_schema, 30 validate_bids_entities, 31) 32from ._indexing import ( 33 batch_index_dataset, 34 find_bids_datasets, 35 get_arrow_schema, 36 get_column_names, 37 index_dataset, 38) 39from ._metadata import load_bids_metadata 40from ._pathlib import cloudpathlib_is_available 41from ._version import *
197def index_dataset( 198 root: str | PathT, 199 include_subjects: str | list[str] | None = None, 200 show_progress: bool = False, 201) -> pa.Table: 202 """Index a BIDS dataset. 203 204 Args: 205 root: BIDS dataset root directory. 206 include_subjects: Glob pattern or list of patterns for matching subjects to 207 include in the index. 208 show_progress: Show progress bar. 209 210 Returns: 211 An Arrow table index of the BIDS dataset. 212 """ 213 root = as_path(root) 214 215 schema = get_arrow_schema() 216 217 dataset, _ = _get_bids_dataset(root) 218 if dataset is None: 219 _logger.warning(f"Path {root} is not a valid BIDS dataset directory.") 220 return pa.Table.from_pylist([], schema=schema) 221 222 subject_dirs = _find_bids_subject_dirs(root, include_subjects) 223 subject_dirs = sorted(subject_dirs, key=lambda p: p.name) 224 if len(subject_dirs) == 0: 225 _logger.warning(f"Path {root} contains no matching subject dirs.") 226 return pa.Table.from_pylist([], schema=schema) 227 228 tables = [] 229 file_count = 0 230 for sub in subject_dirs: 231 _, table = _index_bids_subject_dir(sub, schema=schema, dataset=dataset) 232 tables.append(table) 233 file_count += len(table) 234 table = pa.concat_tables(tables).combine_chunks() 235 return table
Index a BIDS dataset.
Arguments:
- root: BIDS dataset root directory.
- include_subjects: Glob pattern or list of patterns for matching subjects to include in the index.
- show_progress: Show progress bar.
Returns:
An Arrow table index of the BIDS dataset.
238def batch_index_dataset( 239 roots: list[str | PathT], 240 max_workers: int | None = 0, 241 executor_cls: type[Executor] = ProcessPoolExecutor, 242 show_progress: bool = False, 243) -> Generator[pa.Table, None, None]: 244 """Index a batch of BIDS datasets. 245 246 Args: 247 roots: List of BIDS dataset root directories. 248 max_workers: Number of indexing processes to run in parallel. Setting 249 `max_workers=0` (the default) uses the main process only. Setting 250 `max_workers=None` starts as many workers as there are available CPUs. See 251 `concurrent.futures.ProcessPoolExecutor` for details. 252 executor_cls: Executor class to use for parallel indexing. 253 show_progress: Show progress bar. 254 255 Yields: 256 An Arrow table index for each BIDS dataset. 257 """ 258 file_count = 0 259 for dataset, table in ( 260 pbar := tqdm( 261 _pmap(_batch_index_func, roots, max_workers, executor_cls=executor_cls), 262 total=len(roots) if isinstance(roots, Sequence) else None, 263 disable=show_progress not in {True, "dataset"}, 264 ) 265 ): 266 file_count += len(table) 267 pbar.set_postfix(dict(ds=dataset, N=_hfmt(file_count)), refresh=False) 268 yield table
Index a batch of BIDS datasets.
Arguments:
- roots: List of BIDS dataset root directories.
- max_workers: Number of indexing processes to run in parallel. Setting
max_workers=0(the default) uses the main process only. Settingmax_workers=Nonestarts as many workers as there are available CPUs. Seeconcurrent.futures.ProcessPoolExecutorfor details. - executor_cls: Executor class to use for parallel indexing.
- show_progress: Show progress bar.
Yields:
An Arrow table index for each BIDS dataset.
132def find_bids_datasets( 133 root: str | PathT, 134 exclude: str | list[str] | None = None, 135 maxdepth: int | None = None, 136) -> Generator[PathT, None, None]: 137 """Find all BIDS datasets under a root directory. 138 139 Args: 140 root: Root path to begin search. 141 exclude: Glob pattern or list of patterns matching sub-directory names to 142 exclude from the search. 143 maxdepth: Maximum depth to search. 144 145 Yields: 146 Root paths of all BIDS datasets under `root`. 147 """ 148 root = as_path(root) 149 150 if isinstance(exclude, str): 151 exclude = [exclude] 152 elif exclude is None: 153 exclude = [] 154 exclude = [re.compile(fnmatch.translate(pat)) for pat in exclude] 155 156 entry_count = 1 157 ds_count = 0 158 159 if _is_bids_dataset(root): 160 ds_count += 1 161 yield root 162 163 # Tuple of path, depth 164 stack = [(root, 0)] 165 166 while stack: 167 top, depth = stack.pop() 168 169 inside_bids = _is_bids_dataset(top) 170 depth += 1 171 172 for entry in top.iterdir(): 173 entry_count += 1 174 175 if any(re.fullmatch(pat, entry.name) for pat in exclude): 176 continue 177 178 if _is_bids_dataset(entry): 179 ds_count += 1 180 yield entry 181 182 # Checks if we should descend into this directory. 183 # Check not reached final depth. 184 descend = maxdepth is None or depth < maxdepth 185 # Heuristic checks whether the filename looks like a (visible) directory. 186 descend = descend and not (entry.suffix or entry.name.startswith(".")) 187 # Only descend into specific subdirectories of BIDS directories. 188 descend = descend and ( 189 not inside_bids or entry.name in _BIDS_NESTED_PARENT_DIRNAMES 190 ) 191 # Finally, check if actually a directory (which is slow so we want to 192 # short-circuit as much as possible). 193 if descend and entry.is_dir(): 194 stack.append((entry, depth))
Find all BIDS datasets under a root directory.
Arguments:
- root: Root path to begin search.
- exclude: Glob pattern or list of patterns matching sub-directory names to exclude from the search.
- maxdepth: Maximum depth to search.
Yields:
Root paths of all BIDS datasets under
root.
92def get_arrow_schema() -> pa.Schema: 93 """Get Arrow schema of the BIDS dataset index.""" 94 entity_schema = get_bids_entity_arrow_schema() 95 index_fields = { 96 name: pa.field(name, cfg["dtype"], metadata=cfg["metadata"]) 97 for name, cfg in _INDEX_ARROW_FIELDS.items() 98 } 99 fields = [ 100 index_fields["dataset"], 101 *entity_schema, 102 index_fields["extra_entities"], 103 index_fields["root"], 104 index_fields["path"], 105 ] 106 107 metadata = { 108 **entity_schema.metadata, 109 "bids2table_version": importlib.metadata.version(__package__), 110 } 111 schema = pa.schema(fields, metadata=metadata) 112 return schema
Get Arrow schema of the BIDS dataset index.
115def get_column_names() -> enum.StrEnum: 116 """Get an enum of the BIDS index columns.""" 117 # TODO: It might be nice if the column names were statically available. One option 118 # would be to generate a static _schema.py module at install time (similar to how 119 # _version.py is generated) which defines the static default schema and column 120 # names. 121 schema = get_arrow_schema() 122 items = [] 123 for f in schema: 124 name = f.metadata["name".encode()].decode() 125 items.append((name, name)) 126 127 BIDSColumn = enum.StrEnum("BIDSColumn", items) 128 BIDSColumn.__doc__ = "Enum of BIDS index column names." 129 return BIDSColumn
Get an enum of the BIDS index columns.
138def parse_bids_entities(path: str | Path) -> dict[str, str]: 139 """Parse entities from BIDS file path. 140 141 Parses all BIDS filename `"{key}-{value}"` entities as well as special entities: 142 datatype, suffix, ext (extension). Does not validate entities or cast to types. 143 144 Args: 145 path: BIDS path to parse. 146 147 Returns: 148 A dict mapping BIDS entity keys to values. 149 """ 150 if isinstance(path, str): 151 path = Path(path) 152 entities = {} 153 154 filename = path.name 155 parts = filename.split("_") 156 157 datatype = _parse_bids_datatype(path) 158 159 # Get suffix and extension. 160 suffix_ext = parts.pop() 161 suffix, dot, ext = suffix_ext.partition(".") 162 ext = dot + ext if ext else None 163 164 # Suffix is actually an entity, put back in list. 165 if "-" in suffix: 166 parts.append(suffix) 167 suffix = None 168 169 # Split entities, skipping any that don't contain a '-'. 170 for part in parts: 171 if "-" in part: 172 key, val = part.split("-", maxsplit=1) 173 entities[key] = val 174 175 for k, v in zip(["datatype", "suffix", "ext"], [datatype, suffix, ext]): 176 if v is not None: 177 entities[k] = v 178 return entities
Parse entities from BIDS file path.
Parses all BIDS filename "{key}-{value}" entities as well as special entities:
datatype, suffix, ext (extension). Does not validate entities or cast to types.
Arguments:
- path: BIDS path to parse.
Returns:
A dict mapping BIDS entity keys to values.
197def validate_bids_entities( 198 entities: dict[str, Any], 199) -> tuple[dict[str, BIDSValue], dict[str, Any]]: 200 """Validate BIDS entities. 201 202 Validates the type and allowed values of each entity against the BIDS schema. 203 204 Args: 205 entities: dict mapping BIDS keys to unvalidated entities 206 207 Returns: 208 A tuple of `(valid_entities, extra_entities)`, where `valid_entities` is a 209 mapping of valid BIDS keys to type-casted values, and `extra_entities` a 210 mapping of any leftover entity mappings that didn't match a known entity or 211 failed validation. 212 """ 213 valid_entities = {} 214 extra_entities = {} 215 216 for name, value in entities.items(): 217 if name in _BIDS_NAME_ENTITY_MAP: 218 entity = _BIDS_NAME_ENTITY_MAP[name] 219 cfg = _BIDS_ENTITY_SCHEMA[entity] 220 typ = _BIDS_FORMAT_PY_TYPE_MAP[cfg["format"]] 221 222 # Cast to target type. 223 try: 224 value = typ(value) 225 except ValueError: 226 _logger.warning( 227 f"Unable to coerce {repr(value)} to type {typ} for entity '{name}'.", 228 ) 229 extra_entities[name] = value 230 continue 231 232 # Check allowed values. 233 if "enum" in cfg and value not in cfg["enum"]: 234 _logger.warning( 235 f"Value {value} for entity '{name}' isn't one of the " 236 f"allowed values: {cfg['enum']}.", 237 ) 238 extra_entities[name] = value 239 continue 240 241 valid_entities[name] = value 242 else: 243 extra_entities[name] = value 244 245 return valid_entities, extra_entities
Validate BIDS entities.
Validates the type and allowed values of each entity against the BIDS schema.
Arguments:
- entities: dict mapping BIDS keys to unvalidated entities
Returns:
A tuple of
(valid_entities, extra_entities), wherevalid_entitiesis a mapping of valid BIDS keys to type-casted values, andextra_entitiesa mapping of any leftover entity mappings that didn't match a known entity or failed validation.
78def set_bids_schema(path: str | Path | None = None) -> None: 79 """Set the BIDS schema.""" 80 global _BIDS_SCHEMA, _BIDS_ENTITY_SCHEMA, _BIDS_NAME_ENTITY_MAP 81 global _BIDS_ENTITY_ARROW_SCHEMA 82 83 schema = bidsschematools.schema.load_schema(path) 84 entity_schema = { 85 entity: schema.objects.entities[entity].to_dict() 86 for entity in schema.rules.entities 87 } 88 # Also include special extra entities (datatype, suffix, extension). 89 entity_schema.update(_BIDS_SPECIAL_ENTITY_SCHEMA) 90 name_entity_map = {cfg["name"]: entity for entity, cfg in entity_schema.items()} 91 92 _BIDS_SCHEMA = schema 93 _BIDS_ENTITY_SCHEMA = entity_schema 94 _BIDS_NAME_ENTITY_MAP = name_entity_map 95 96 _BIDS_ENTITY_ARROW_SCHEMA = _bids_entity_arrow_schema( 97 entity_schema, 98 bids_version=schema["bids_version"], 99 schema_version=schema["schema_version"], 100 )
Set the BIDS schema.
128def get_bids_schema() -> Namespace: 129 """Get the current BIDS schema.""" 130 return _BIDS_SCHEMA
Get the current BIDS schema.
133def get_bids_entity_arrow_schema() -> pa.Schema: 134 """Get the current BIDS entity schema in Arrow format.""" 135 return _BIDS_ENTITY_ARROW_SCHEMA
Get the current BIDS entity schema in Arrow format.
248def format_bids_path(entities: dict[str, Any], int_format: str = "%d") -> Path: 249 """Construct a formatted BIDS path from entities dict. 250 251 Args: 252 entities: dict mapping BIDS keys to values. 253 int_format: format string for integer (index) BIDS values. 254 255 Returns: 256 A formatted `Path` instance. 257 """ 258 special = {"datatype", "suffix", "ext"} 259 260 # Formatted key-value entities. 261 entities_fmt = [] 262 for name, value in entities.items(): 263 if name not in special: 264 if isinstance(value, int): 265 value = int_format % value 266 entities_fmt.append(f"{name}-{value}") 267 name = "_".join(entities_fmt) 268 269 # Append suffix and extension. 270 if suffix := entities.get("suffix"): 271 name += f"_{suffix}" 272 if ext := entities.get("ext"): 273 name += ext 274 275 # Prepend parent directories. 276 path = Path(name) 277 if datatype := entities.get("datatype"): 278 path = datatype / path 279 if ses := entities.get("ses"): 280 path = f"ses-{ses}" / path 281 path = f"sub-{entities['sub']}" / path 282 return path
Construct a formatted BIDS path from entities dict.
Arguments:
- entities: dict mapping BIDS keys to values.
- int_format: format string for integer (index) BIDS values.
Returns:
A formatted
Pathinstance.
11def load_bids_metadata(path: str | PathT, inherit: bool = True) -> dict[str, Any]: 12 """Load the full JSON sidecar metadata for a BIDS file. 13 14 Sidecar files are loaded according to the inheritance principle in top-down order. 15 16 Args: 17 path: BIDS file path 18 inherit: Load the full metadata according to inheritance. Otherwise, load only 19 the first JSON sidecar found in the bottom-up search. 20 21 Returns: 22 A sidecar metadata dictionary. 23 """ 24 path = as_path(path) 25 entities = _cache_parse_bids_entities(path) 26 query = dict(entities, ext=".json") 27 28 metadata = {} 29 30 parent = path.parent 31 if inherit: 32 sidecars = reversed(list(_find_bids_parents(parent, query))) 33 else: 34 sidecars = [next(_find_bids_parents(parent, query))] 35 36 for path in sidecars: 37 try: 38 data = _load_json(path) 39 metadata.update(data) 40 except (json.JSONDecodeError, TypeError): 41 continue 42 return metadata
Load the full JSON sidecar metadata for a BIDS file.
Sidecar files are loaded according to the inheritance principle in top-down order.
Arguments:
- path: BIDS file path
- inherit: Load the full metadata according to inheritance. Otherwise, load only the first JSON sidecar found in the bottom-up search.
Returns:
A sidecar metadata dictionary.
37def cloudpathlib_is_available() -> bool: 38 """Check if cloudpathlib is available.""" 39 return _CLOUDPATHLIB_AVAILABLE
Check if cloudpathlib is available.