pyomop-migrate¶
The CdmGenericLoader lets you read data from any SQLAlchemy-compatible
source database and write it into an OMOP CDM target database, guided by a
JSON mapping file.
Overview¶
| Aspect | Detail |
|---|---|
| Source | Any database accessible via an async SQLAlchemy engine (SQLite, PostgreSQL, MySQL, …) |
| Target | Any OMOP CDM database initialised via CdmEngineFactory |
| Mapping | JSON file describing which source tables/columns map to which OMOP tables/columns |
| Post-processing | Automatic person_id FK normalisation, birth-date backfill, gender concept mapping, concept-code lookups |
Mapping File Format¶
The mapping file is a JSON document. The bundled example
src/pyomop/mapping.generic.example.json
demonstrates the full structure.
{
"tables": [
{
"source_table": "patients",
"name": "person",
"filters": [
{"column": "active", "equals": 1}
],
"columns": {
"person_id": "id",
"gender_concept_id": {"const": 0},
"gender_source_value": "gender",
"birth_datetime": "date_of_birth",
"year_of_birth": {"const": 0},
"race_concept_id": {"const": 0},
"race_source_value": "race",
"ethnicity_concept_id": {"const": 0},
"ethnicity_source_value": "ethnicity",
"person_source_value": "patient_key"
}
}
],
"concept": [
{
"table": "condition_occurrence",
"mappings": [
{
"source": "condition_source_value",
"target": "condition_concept_id"
}
]
}
],
"force_text_fields": ["drug_source_value"]
}
Top-level keys¶
| Key | Required | Description |
|---|---|---|
tables |
✅ | List of table-mapping objects (see below) |
concept |
☐ | Concept-code lookup definitions (same format as mapping.default.json) |
force_text_fields |
☐ | Column names that are always stored as plain text |
Table-mapping object¶
| Key | Required | Description |
|---|---|---|
source_table |
✅ | Table name in the source database |
name |
✅ | Table name in the target OMOP CDM database |
filters |
☐ | List of row-level filter objects (see below) |
columns |
✅ | {target_col: source_col} or {target_col: {"const": value}} |
Filter object¶
| Key | Description |
|---|---|
column |
Source column name |
equals |
Row is included only when the column equals this value |
not_empty |
If true, row is included only when the column is non-null and non-empty |
Column mapping values¶
| Value | Meaning |
|---|---|
"some_column" |
Copy the value from the named source column |
{"const": 0} |
Use a constant value (any JSON scalar) |
Post-load Steps¶
After inserting all rows the loader automatically performs the same cleanup
steps as CdmCsvLoader:
fix_person_id– Resolves any rows whereperson_idcontains a string source-value (e.g. UUID) and replaces it with the integer PK assigned to that person.backfill_person_birth_fields– Derivesyear_of_birth,month_of_birth,day_of_birthfrombirth_datetimewhere those fields are NULL or 0.update_person_gender_concept_id– Mapsgender_source_value(male/female/other) to the standard OMOP concept IDs (8507/8532/0).apply_concept_mappings– Looks upconcept.concept_idfor each source-value code defined in the"concept"section of the mapping.
Command-line Usage (--migrate)¶
CdmGenericLoader is also available directly from the pyomop CLI via the
--migrate flag. Source-database connection details are provided with
--src-* options; the target OMOP CDM database uses the standard --dbtype,
--host, --port, --user, --pw, --name, and --schema options.
Environment variables for source connection¶
As an alternative to CLI flags, all source-database connection parameters can be set as environment variables. This is recommended for production use because credentials are never part of the shell history:
| Environment variable | Equivalent CLI option |
|---|---|
SRC_DB_HOST |
--src-host |
SRC_DB_PORT |
--src-port |
SRC_DB_USER |
--src-user |
SRC_DB_PASSWORD |
--src-pw |
SRC_DB_NAME |
--src-name |
CLI flags take precedence over environment variables.
export SRC_DB_HOST=db.hospital.org
export SRC_DB_USER=readonly
export SRC_DB_PASSWORD=secret
export SRC_DB_NAME=ehr_db
pyomop-migrate --migrate \
--src-dbtype pgsql \
--dbtype sqlite --name omop.sqlite \
--mapping ehr_to_omop.json
pyomop-migrate --migrate [OPTIONS]
Required options¶
| Option | Description |
|---|---|
--mapping FILE / -m FILE |
Path to the JSON mapping file |
Target database options (OMOP CDM)¶
| Option | Default | Description |
|---|---|---|
--dbtype |
sqlite |
Target DB type: sqlite, mysql, or pgsql |
--host |
localhost |
Target DB host |
--port |
5432 |
Target DB port |
--user |
root |
Target DB user |
--pw |
pass |
Target DB password |
--name |
cdm.sqlite |
Target DB name or SQLite file path |
--schema |
(empty) | Target DB schema (PostgreSQL) |
--version |
cdm54 |
OMOP CDM version (cdm54 or cdm6) |
Source database options¶
| Option | Env var | Default | Description |
|---|---|---|---|
--src-dbtype |
— | sqlite |
Source DB type: sqlite, mysql, or pgsql |
--src-host |
SRC_DB_HOST |
localhost |
Source DB host |
--src-port |
SRC_DB_PORT |
5432 |
Source DB port |
--src-user |
SRC_DB_USER |
root |
Source DB user |
--src-pw |
SRC_DB_PASSWORD |
pass |
Source DB password |
--src-name |
SRC_DB_NAME |
source.sqlite |
Source DB name or SQLite file path |
--src-schema |
— | (empty) | Source DB schema (PostgreSQL) |
Tuning¶
| Option | Default | Description |
|---|---|---|
--batch-size |
1000 |
Rows per INSERT batch |
Examples¶
SQLite → SQLite
pyomop-migrate --migrate \
--src-dbtype sqlite --src-name /data/hospital_ehr.sqlite \
--dbtype sqlite --name /data/omop.sqlite \
--mapping /etc/pyomop/ehr_to_omop.json
PostgreSQL source → PostgreSQL OMOP target
pyomop-migrate --migrate \
--src-dbtype pgsql --src-host srchost --src-port 5432 \
--src-user readonly --src-pw secret --src-name ehr_db \
--dbtype pgsql --host omophost --port 5432 \
--user omop_writer --pw secret --name omop_db --schema cdm \
--mapping ehr_to_omop.json --batch-size 500
MySQL source → SQLite OMOP target
pyomop-migrate --migrate \
--src-dbtype mysql --src-host 192.168.1.10 --src-user reader --src-pw pass --src-name clinic \
--dbtype sqlite --name omop.sqlite \
--mapping clinic_to_omop.json
Schema Extraction (--extract-schema)¶
Before writing a mapping file you need to understand the structure of your
source database. The --extract-schema option introspects the source database
and writes a Markdown document containing:
- A summary table listing all tables with row counts and primary keys.
- Per-table column details: name, data type, nullable, default value, and PK/FK annotations.
- Foreign key relationship sections.
You can feed this Markdown directly to an AI assistant and ask it to generate the appropriate mapping JSON.
Usage¶
pyomop-migrate --extract-schema \
--src-dbtype sqlite --src-name hospital.sqlite \
--schema-output hospital_schema.md
The same SRC_DB_* environment variables apply here too:
export SRC_DB_HOST=db.hospital.org
export SRC_DB_USER=readonly
export SRC_DB_PASSWORD=secret
export SRC_DB_NAME=ehr_db
pyomop-migrate --extract-schema \
--src-dbtype pgsql \
--schema-output ehr_schema.md
CLI options¶
| Option | Default | Description |
|---|---|---|
--src-dbtype |
sqlite |
Source DB type |
--src-host / SRC_DB_HOST |
localhost |
Source DB host |
--src-port / SRC_DB_PORT |
5432 |
Source DB port |
--src-user / SRC_DB_USER |
root |
Source DB user |
--src-pw / SRC_DB_PASSWORD |
pass |
Source DB password |
--src-name / SRC_DB_NAME |
source.sqlite |
Source DB name |
--schema-output |
schema.md |
Output Markdown file path |
Supported Databases¶
| Database | Async driver | Example URL |
|---|---|---|
| SQLite | aiosqlite |
sqlite+aiosqlite:///path/to/db.sqlite |
| PostgreSQL | asyncpg |
postgresql+asyncpg://user:pass@host/db |
| MySQL | aiomysql |
mysql+aiomysql://user:pass@host/db |
Both the source and target can independently use any of the above backends.
Differences from CdmCsvLoader¶
| Feature | CdmCsvLoader |
CdmGenericLoader |
|---|---|---|
| Source | CSV file (via pandas) | Any SQL database |
| Filters | Pandas boolean masks | SQL WHERE clauses |
| Source identification | csv_key field |
source_table field per mapping entry |
| Batching | Per-table chunked INSERT | Per-table chunked INSERT |
| Post-processing steps | ✅ identical | ✅ identical |
Error Handling¶
- Missing source tables are warned and skipped rather than raising an exception, so a partial mapping can still load valid tables.
- Missing target tables produce a warning and are skipped.
- Individual value-coercion failures fall back to string representation.
- All errors are logged via the standard Python
loggingmodule (logger namepyomop.generic_loader).
Example: PostgreSQL source → SQLite OMOP target¶
import asyncio
from pyomop import CdmEngineFactory
from pyomop.cdm54 import Base
from pyomop.migrate.pyomop_migrate import CdmGenericLoader, create_source_engine
async def main():
source = create_source_engine(
"postgresql+asyncpg://readonly_user:secret@db.hospital.org/ehr"
)
target = CdmEngineFactory(db="sqlite", name="omop_ehr.sqlite")
_ = target.engine
await target.init_models(Base.metadata)
loader = CdmGenericLoader(source, target)
await loader.load("ehr_to_omop_mapping.json", batch_size=1000)
asyncio.run(main())