samedi 25 janvier 2020

Source to Stage SQL Information Schema validation: Compare 'IS_NULLABLE', using 'COLUMN_NAME' columns, from two data-frames in python (Pandas)

Trying to validate source to stage (ETL) SQL Information schema. I did connect two servers/databases through pyodbc+pandas and assigned those Information Schemas as 'sourceDF' and 'StageDF'. The problem is 'COLUMN_NAME' is not identically matched in two dataframe.

Ex: 'Account_ID' and 'AccountId'

But I want to check 'IS_NULLABLE' with source and stage relevant to each row.

Source

Database: Source_Database / Server: Source_Server >>> sourceDF

<table><tbody><tr><th>|TABLE_SCHEMA|</th><th>|TABLE_NAME|</th><th>|COLUMN_NAME|</th><th>|IS_NULLABLE|</th></tr><tr><td>Stage</td><td>AccountDetails</td><td>Account_ID</td><td>Yes</td></tr><tr><td>Stage</td><td>AccountDetails</td><td>Engagement_ID</td><td>Yes</td></tr><tr><td>Stage</td><td>AccountDetails</td><td>Client_ID</td><td>NO</td></tr></tbody></table>

Database: Stage_Databse / Server: Stage_Server >>> stageDF

<table><tbody><tr><th>|TABLE_SCHEMA|</th><th>|TABLE_NAME|</th><th>|COLUMN_NAME|</th><th>|IS_NULLABLE|</th></tr><tr><td>Source</td><td>AccountDetails</td><td>AccountID</td><td>NO</td></tr><tr><td>Source</td><td>AccountDetails</td><td>EngagementID</td><td>NO</td></tr><tr><td>Source</td><td>AccountDetails</td><td>ClientID</td><td>NO</td></tr></tbody></table>

python code to connect two servers/Database:

import pyodbc
import pandas as pd
import time
from termcolor import colored, cprint

server1 = "serverSource.net"
database1 = "SourceDBneu01"
username1 = "SourceAdmin"
password1 = "Sourcepassword"
conn = pyodbc.connect(
    "DRIVER={ODBC Driver};SERVER="
    + server1
    + ";DATABASE="
    + database1
    + ";UID="
    + username1
    + ";PWD="
    + password1
)

SQL_QuerySource = pd.read_sql_query("""SELECT [TABLE_SCHEMA], [TABLE_NAME], [COLUMN_NAME ], [IS_NULLABLE] 
 FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'AccountDetails'""", conn)


sourceDF = pd.DataFrame(SQL_QuerySource, columns=('TABLE_SCHEMA'
, 'TABLE_NAME'
, 'COLUMN_NAME'
, 'IS_NULLABLE''))

print(sourceDF)

server1 = "Stageserver1.net"
database1 = "StageDBneu01"
username1 = "StageAdmin"
password1 = "Stagepassword"
conn = pyodbc.connect(
    "DRIVER={ODBC Driver};SERVER="
    + server1
    + ";DATABASE="
    + database1
    + ";UID="
    + username1
    + ";PWD="
    + password1
)

SQL_QueryStage = pd.read_sql_query("""SELECT [TABLE_SCHEMA], [TABLE_NAME], [COLUMN_NAME ], [IS_NULLABLE]
 FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'AccountDetails' AND TABLE_SCHEMA = 'Stage'""", conn)


stageDF = pd.DataFrame(SQL_QueryStage, columns=('TABLE_SCHEMA'
, 'TABLE_NAME'
, 'COLUMN_NAME'
, 'IS_NULLABLE''))

print(stageDF)

Note: Tying to prepare automated database validation code for the 'SQL infomation schema' which focusing source to Stage validation (ETL test validation). Stuck due to different column name

Aucun commentaire:

Enregistrer un commentaire