pyomop-migrate¶

The CdmGenericLoader lets you read data from any SQLAlchemy-compatible source database and write it into an OMOP CDM target database, guided by a JSON mapping file.

Overview¶

Aspect	Detail
Source	Any database accessible via an async SQLAlchemy engine (SQLite, PostgreSQL, MySQL, …)
Target	Any OMOP CDM database initialised via `CdmEngineFactory`
Mapping	JSON file describing which source tables/columns map to which OMOP tables/columns
Post-processing	Automatic `person_id` FK normalisation, birth-date backfill, gender concept mapping, concept-code lookups

Mapping File Format¶

The mapping file is a JSON document. The bundled example src/pyomop/mapping.generic.example.json demonstrates the full structure.

{
  "tables": [
    {
      "source_table": "patients",
      "name": "person",
      "filters": [
        {"column": "active", "equals": 1}
      ],
      "columns": {
        "person_id":              "id",
        "gender_concept_id":      {"const": 0},
        "gender_source_value":    "gender",
        "birth_datetime":         "date_of_birth",
        "year_of_birth":          {"const": 0},
        "race_concept_id":        {"const": 0},
        "race_source_value":      "race",
        "ethnicity_concept_id":   {"const": 0},
        "ethnicity_source_value": "ethnicity",
        "person_source_value":    "patient_key"
      }
    }
  ],
  "concept": [
    {
      "table": "condition_occurrence",
      "mappings": [
        {
          "source": "condition_source_value",
          "target": "condition_concept_id"
        }
      ]
    }
  ],
  "force_text_fields": ["drug_source_value"]
}

Top-level keys¶

Key	Required	Description
`tables`	✅	List of table-mapping objects (see below)
`concept`	☐	Concept-code lookup definitions (same format as `mapping.default.json`)
`force_text_fields`	☐	Column names that are always stored as plain text

Table-mapping object¶

Key	Required	Description
`source_table`	✅	Table name in the source database
`name`	✅	Table name in the target OMOP CDM database
`filters`	☐	List of row-level filter objects (see below)
`columns`	✅	`{target_col: source_col}` or `{target_col: {"const": value}}`

Filter object¶

Key	Description
`column`	Source column name
`equals`	Row is included only when the column equals this value
`not_empty`	If `true`, row is included only when the column is non-null and non-empty

Column mapping values¶

Value	Meaning
`"some_column"`	Copy the value from the named source column
`{"const": 0}`	Use a constant value (any JSON scalar)

Post-load Steps¶

After inserting all rows the loader automatically performs the same cleanup steps as CdmCsvLoader:

fix_person_id – Resolves any rows where person_id contains a string source-value (e.g. UUID) and replaces it with the integer PK assigned to that person.
backfill_person_birth_fields – Derives year_of_birth, month_of_birth, day_of_birth from birth_datetime where those fields are NULL or 0.
update_person_gender_concept_id – Maps gender_source_value (male/female/other) to the standard OMOP concept IDs (8507/8532/0).
apply_concept_mappings – Looks up concept.concept_id for each source-value code defined in the "concept" section of the mapping.

Command-line Usage (`--migrate`)¶

CdmGenericLoader is also available directly from the pyomop CLI via the --migrate flag. Source-database connection details are provided with --src-* options; the target OMOP CDM database uses the standard --dbtype, --host, --port, --user, --pw, --name, and --schema options.

Environment variables for source connection¶

As an alternative to CLI flags, all source-database connection parameters can be set as environment variables. This is recommended for production use because credentials are never part of the shell history:

Environment variable	Equivalent CLI option
`SRC_DB_HOST`	`--src-host`
`SRC_DB_PORT`	`--src-port`
`SRC_DB_USER`	`--src-user`
`SRC_DB_PASSWORD`	`--src-pw`
`SRC_DB_NAME`	`--src-name`

CLI flags take precedence over environment variables.

export SRC_DB_HOST=db.hospital.org
export SRC_DB_USER=readonly
export SRC_DB_PASSWORD=secret
export SRC_DB_NAME=ehr_db

pyomop-migrate --migrate \
  --src-dbtype pgsql \
  --dbtype sqlite --name omop.sqlite \
  --mapping ehr_to_omop.json

pyomop-migrate --migrate [OPTIONS]

Required options¶

Option	Description
`--mapping FILE` / `-m FILE`	Path to the JSON mapping file

Target database options (OMOP CDM)¶

Option	Default	Description
`--dbtype`	`sqlite`	Target DB type: `sqlite`, `mysql`, or `pgsql`
`--host`	`localhost`	Target DB host
`--port`	`5432`	Target DB port
`--user`	`root`	Target DB user
`--pw`	`pass`	Target DB password
`--name`	`cdm.sqlite`	Target DB name or SQLite file path
`--schema`	(empty)	Target DB schema (PostgreSQL)
`--version`	`cdm54`	OMOP CDM version (`cdm54` or `cdm6`)

Source database options¶

Option	Env var	Default	Description
`--src-dbtype`	—	`sqlite`	Source DB type: `sqlite`, `mysql`, or `pgsql`
`--src-host`	`SRC_DB_HOST`	`localhost`	Source DB host
`--src-port`	`SRC_DB_PORT`	`5432`	Source DB port
`--src-user`	`SRC_DB_USER`	`root`	Source DB user
`--src-pw`	`SRC_DB_PASSWORD`	`pass`	Source DB password
`--src-name`	`SRC_DB_NAME`	`source.sqlite`	Source DB name or SQLite file path
`--src-schema`	—	(empty)	Source DB schema (PostgreSQL)

Tuning¶

Option	Default	Description
`--batch-size`	`1000`	Rows per INSERT batch

Examples¶

SQLite → SQLite

pyomop-migrate --migrate \
  --src-dbtype sqlite --src-name /data/hospital_ehr.sqlite \
  --dbtype sqlite --name /data/omop.sqlite \
  --mapping /etc/pyomop/ehr_to_omop.json

PostgreSQL source → PostgreSQL OMOP target

pyomop-migrate --migrate \
  --src-dbtype pgsql --src-host srchost --src-port 5432 \
  --src-user readonly --src-pw secret --src-name ehr_db \
  --dbtype pgsql --host omophost --port 5432 \
  --user omop_writer --pw secret --name omop_db --schema cdm \
  --mapping ehr_to_omop.json --batch-size 500

MySQL source → SQLite OMOP target

pyomop-migrate --migrate \
  --src-dbtype mysql --src-host 192.168.1.10 --src-user reader --src-pw pass --src-name clinic \
  --dbtype sqlite --name omop.sqlite \
  --mapping clinic_to_omop.json

Schema Extraction (`--extract-schema`)¶

Before writing a mapping file you need to understand the structure of your source database. The --extract-schema option introspects the source database and writes a Markdown document containing:

A summary table listing all tables with row counts and primary keys.
Per-table column details: name, data type, nullable, default value, and PK/FK annotations.
Foreign key relationship sections.

You can feed this Markdown directly to an AI assistant and ask it to generate the appropriate mapping JSON.

Usage¶

pyomop-migrate --extract-schema \
  --src-dbtype sqlite --src-name hospital.sqlite \
  --schema-output hospital_schema.md

The same SRC_DB_* environment variables apply here too:

export SRC_DB_HOST=db.hospital.org
export SRC_DB_USER=readonly
export SRC_DB_PASSWORD=secret
export SRC_DB_NAME=ehr_db

pyomop-migrate --extract-schema \
  --src-dbtype pgsql \
  --schema-output ehr_schema.md

CLI options¶

Option	Default	Description
`--src-dbtype`	`sqlite`	Source DB type
`--src-host` / `SRC_DB_HOST`	`localhost`	Source DB host
`--src-port` / `SRC_DB_PORT`	`5432`	Source DB port
`--src-user` / `SRC_DB_USER`	`root`	Source DB user
`--src-pw` / `SRC_DB_PASSWORD`	`pass`	Source DB password
`--src-name` / `SRC_DB_NAME`	`source.sqlite`	Source DB name
`--schema-output`	`schema.md`	Output Markdown file path

Supported Databases¶

Database	Async driver	Example URL
SQLite	`aiosqlite`	`sqlite+aiosqlite:///path/to/db.sqlite`
PostgreSQL	`asyncpg`	`postgresql+asyncpg://user:pass@host/db`
MySQL	`aiomysql`	`mysql+aiomysql://user:pass@host/db`

Both the source and target can independently use any of the above backends.

Differences from `CdmCsvLoader`¶

Feature	`CdmCsvLoader`	`CdmGenericLoader`
Source	CSV file (via pandas)	Any SQL database
Filters	Pandas boolean masks	SQL WHERE clauses
Source identification	`csv_key` field	`source_table` field per mapping entry
Batching	Per-table chunked INSERT	Per-table chunked INSERT
Post-processing steps	✅ identical	✅ identical

Error Handling¶

Missing source tables are warned and skipped rather than raising an exception, so a partial mapping can still load valid tables.
Missing target tables produce a warning and are skipped.
Individual value-coercion failures fall back to string representation.
All errors are logged via the standard Python logging module (logger name pyomop.generic_loader).

Example: PostgreSQL source → SQLite OMOP target¶

import asyncio
from pyomop import CdmEngineFactory
from pyomop.cdm54 import Base
from pyomop.migrate.pyomop_migrate import CdmGenericLoader, create_source_engine

async def main():
    source = create_source_engine(
        "postgresql+asyncpg://readonly_user:secret@db.hospital.org/ehr"
    )
    target = CdmEngineFactory(db="sqlite", name="omop_ehr.sqlite")
    _ = target.engine
    await target.init_models(Base.metadata)

    loader = CdmGenericLoader(source, target)
    await loader.load("ehr_to_omop_mapping.json", batch_size=1000)

asyncio.run(main())