Metadata-Version: 2.4
Name: kyd_dataspec_gen
Version: 0.1.0
Summary: Generate data specification from data profile
Project-URL: Homepage, https://github.com/KYD-Analytics/kyd_dataspec_gen
Project-URL: Issue Tracker, https://github.com/KYD-Analytics/kyd_dataspec_gen/issues
Author-email: KYD Analytics <sales@kyd.ai>
License-Expression: MIT
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.10
Requires-Dist: dataprofiler
Requires-Dist: detect-delimiter
Requires-Dist: fsspec
Requires-Dist: gcsfs
Requires-Dist: glom
Requires-Dist: google-genai
Requires-Dist: jsonschema
Requires-Dist: mock
Requires-Dist: polars
Requires-Dist: pydantic
Requires-Dist: rapidfuzz
Requires-Dist: toml
Description-Content-Type: text/markdown

# KYD_DataSpec_Gen

[![Version](https://img.shields.io/badge/Python%20Version-3.11-green)](https://pypi.org/project/kyd_dataspec_gen)
[![Coverage](./coverage.svg)](./README.md)

The KYD DataSpec Gen is a KYD component for creating data specification files for data in different formats. This component runs Data Profiler and generate a data specification file for the input data.

## High-level Architecture

```mermaid
---
config:
  flowchart:
    subGraphTitleMargin:
      top: 10
      bottom: 30
---
flowchart TB
 dataInput[(Data Input)]
 dataProfiler[Data 
 Profiler]
 kydDataSpecification[("KYD Data Specification(JSON)")]
 reportGeneration["Report Generation"]
 dataSpecReport[("Data Specification Report(md)")]
 subgraph subGraph1["AI Generation"]
        descriptionGeneration["Description Generation"]
        locationCoverageDetection["Location Coverage Detection"]
        datasetRelationshipshipsIdentification["Dataset Relationships Identification"]
        foreignAndCompoundPrimaryKeysDetection["Foreign and Compound Primary Keys Detection"]
        dataClassification["Data 
        Classification"]
        dataAnonymisation["Data 
        Anonymisation"]
        dataDictionaryMatching["Data Dictionary Matching"]
  end
 subgraph subGraph0["Data Specification Generation"]
        schemaMapping["Statistical Insights to KYD Schema Mapping"]
        fullDataProfiling["Full Data Profiling"]
        identifyPrimaryKeys["Identify 
        Primary Keys"]
        referenceDatasetMatching["Reference Dataset Matching"]
        subGraph1
        verifyPrimaryKeys["Verify 
        Primary Keys"]
  end
    dataInput --> dataProfiler
    dataProfiler --> schemaMapping & fullDataProfiling
    fullDataProfiling --> schemaMapping
    schemaMapping --> identifyPrimaryKeys
    identifyPrimaryKeys --> referenceDatasetMatching & subGraph1
    referenceDatasetMatching --> subGraph1
    subGraph1 --> verifyPrimaryKeys
    verifyPrimaryKeys --> kydDataSpecification & reportGeneration
    reportGeneration --> dataSpecReport
```

## Installation

For development purposes a Regular and Editable Local install can be run.

### Regular Local Install

```bash
pip install -e path/to/kyd_dataspec_gen
```

As part of the install the following utilities as additionally installed:

| Utility                   | Description                |
| ------------------------- | -------------------------- |
| kyd_dataspec_gen                | Generates data specification files for datasets |

## Build Package

This will build the package in an isolated environment, generating a source-distribution and wheel in the directory dist/. See the documentation for full information.

Change directory into the folder where the where `pyproject.toml` is located and run:

```bash
python3 -m build
```

Or incase that does not work try:

```bash
python -m build
```

## Build Standards

The KYD DocGen uses the following tools to ensure the code is consistently formatted and tested:

- [`ruff`](https://docs.astral.sh/ruff/) - performs the Python Linting and formatting

## Features

### Identifying Correlated Columns Using Fuzzy Matching and Category Statistics

The `kyd_dataspec_gen` package identifies correlated columns in datasets by leveraging fuzzy matching techniques and category statistics. This functionality is particularly beneficial when working with datasets that have inconsistent column naming conventions or when detecting relationships between columns with related content.

Fuzzy matching evaluates the similarity between strings (e.g., column names) without requiring an exact match. This approach is useful for identifying correlations between columns with slight variations in their names or formats.

Category statistics comparison analyzes sampled categories ranked by sample size to compare values between columns. Columns that share the same number of categories and identical sample sizes for each category are considered correlated.

### Integrating AI to Summarise and Enhance Data Descriptions

The `generate_schema_information` function leverages AI to enhance the interpretability of datasets by generating and appending meaningful metadata based on profiling data.

This function automatically generates concise and meaningful descriptions for data sources, datasets, and columns and locations covered in the data source. It provides:

- **Column Descriptions**: Summarises the purpose or meaning of the column values.
- **Statistical Insights**: Summarises key statistics (e.g., value distributions) for each column.
- **Date Format Identification**: For date columns, it identifies the date format.
- **Primary and Foreign Keys Identification**: Automatically detects primary and foreign keys in datasets using a combination of rule-based logic and AI-driven analysis.
- **Data Classification**: Classifies columns into names, addresses, sensitive data and identifier data categories, enhancing the understanding of the data's nature and sensitivity.
- **Data Dictionary Matching**: Matches data elements against a provided data dictionary, labeling them as 'Matched', 'Multi-matched', or 'New/Missing'. If no exact match is found, it proposes a closest potential match.
- **Reference Data Matching**: Matches datasets against a published reference dataset to identify existing references.
- **Sensitive Data Anonymisation**: Anonymises sensitive data in the samples, ensuring that sensitive information is not exposed in the data specification.
- **Location Coverage**: Check all columns in each dataset and list out all countries covered in the data source.
- **Relationships**: Identifies relationships between datasets based on primary and foreign keys.

The generated descriptions and location coverage overview are seamlessly integrated into the data specification file, enhancing its usability.

## Setup

To protect API secrets the project uses [`dotenv`](https://pypi.org/project/python-dotenv/) to protect any secrets. To setup copy `.env-TEMPLATE` to `.env` file. The `.env` file is excluded by `.gitignore` to ensure the API secrets are not checked into the repository.

the google API key should be named GOOGLE_AI_KEY in the env file

### Config

#### File Location

The `config.toml` file is located in the `src/kyd_dataspec_gen/` directory of the package.

#### Configuration Options

`[dataspec]`
This section defines settings related to data specification generation.

| Key                        | Type    | Default Value | Description                                                                 |
|----------------------------|---------|---------------|-----------------------------------------------------------------------------|
| `separator`                | String  | `";"`         | The separator used to split values in columns.                             |
| `address_column_keyword`   | String  | `"address"`   | The keyword used to identify address columns in the dataset.               |
| `identification_column_keyword` | String | `"id"`      | The keyword used to identify ID columns in the dataset.                    |
| `categorical_limit`        | Integer | `250`         | The maximum number of unique values a column can have to be considered categorical. |
| `data_profile_shapes_limit` | Integer | `20` | The upper limit for data profile shape to be shown in the report. |
| `published_ref_id_prefix`  | String  | `"PREF-"`     | The prefix used for generating unique IDs for entries in the published reference dataset. |
| `ref_data_fuzzy_threshold` | Integer | `90`          | The threshold for fuzzy matching reference data name when comparing reference datasets. |
| `sample_values_limit`      | Integer | `5`           | The maximum number of sample values to include for each column in the published reference dataset. |

#### Example `config.toml`

```toml
[dataspec]
separator=";"
address_column_keyword="address"
identification_column_keyword="id"
categorical_limit=250
data_profile_shape_limit=20
published_ref_id_prefix="PREF-"
ref_data_fuzzy_threshold=90
sample_values_limit=5
```

### Solidatus Load

To use the `load_kyd_dataspec` functionality, you must provide the following environment variables:

- **SOLIDATUS_API_KEY**: Your Solidatus API key.
- **SOLIDATUS_API_URL**: The URL for the Solidatus API.

Ensure these variables are set in your `.env` file to enable seamless data specification files loading process.

## Usage

### kyd_dataspec_gen

This script is used for generating a data specification file for datasets from a specific data source and a dataset report in `data_spec` folder. Each data source will have a single data specification file and dataset report containing specifications for all datasets. Below are the arguments for running `kyd_dataspec_gen`:

| Argument | Description                     | Required | Default | Example         |
|----------|---------------------------------|----------|---------|-----------------|
| `-r` | Specifies which script or workflow to execute.<br>Available options: <ul><li>`"full"`: Executes all steps, including data profiling, data specification generation, and report creation.</li><li>`"data_profile"`: Performs only data profiling.</li><li>`"data_spec_gen"`: Generates the data specification JSON file. Requires previously profiled data files.</li><li>`"report_gen"`: Creates a human-readable dataset report. Requires an existing data specification JSON file.</li></ul><br>**Note:** For `"data_spec_gen"` and `"report_gen"` modes, prerequisite files must exist. If not, run `"full"` or the necessary preceding step(s). | Yes | / | `"full"` |
| `-ds`    | Data source                     | Yes      | None    | `"icij"`        |
| `-dt`    | Input data type                 | Yes when you run full script      | None    | `"csv"`         |
| `-v`    | Verify AI detected compound primary key. If enabled, when there are datasets that do not have a primary key identified by rule-based logic and a compound primary key is generated by AI, it will run a verification on the compound primary key. | No      | False    | /         |
| `-rd`    | Input raw data files location   | Yes when you run `full` or `data_profile` or opt for AI detected compound primary key `-v` and full data profiling `-f`  | None    | `"/raw/icij/"`  |
| `-o`    | Output directory for processed files i.e. profiled data, data spec json and md files | No  | `output/`    | `"/dataspec_output/"`  |
| `-c`    | Config toml file location   | Yes when you run full script and data spec gen    | None    | `"../kyd_dataspec_gen/config.toml"`  |
| `-f`    | Full data profiling. If enabled, it will re-profile datasets that do not have complete data statistics in the data spec generation step. Requires `-rd` to be set. | No    | False    | /  |
| `-a`    | Enable anonymisation of sensitive data in the profiling and data specification generation steps. | No    | False    | /  |
| `-pr`    | The path to the published reference dataset file in CSV format to match against reference datasets in the data spec generation step. | No    | /    | `"/path/to/published_reference_dataset.csv"`  |
| `-nr`    | Publish new reference dataset from the global schema. Specify the new reference dataset name and the output directory path to publish the new reference dataset CSV file. | No    | `"published_reference_dataset" "output/published_reference_dataset/"`    | `"published_reference_1" "published_reference_dataset/"`  |
| `-dd`    | The path to the data dictionary in CSV format to match data elements against data dictionary. | No    | /    | `"/path/to/data_dictionary.csv"`  |
| `-u`     | The file path of an existing dataspec JSON file. This allows you to reuse previously generated content, such as column descriptions or metadata, and integrate it into the new dataspec file. For example, if you have an existing file at `"data_spec/icij_data_spec.json"`, it will be used to populate the new dataspec file.<br>Note: It doesn't apply if you run full script. | No | "" | `"data_spec/icij_data_spec.json"` |
| `-tr`     | <ol><li>Dataset report template file name. The default template markdown file is stored in `kyd_dataspec_gen/src/kyd_dataspec_gen/templates/`.</li> <li>The output file name for the rendered report</li></ol> | No     | `["template.md", "{data_source_name}_report.md"]` | `"template.md" "{data_source_name}_report.md"` |

Example usage:

```bash
kyd_dataspec_gen -r "full" -ds "icij" -dt "csv" -rd "/raw/icij" -c "documents/config.toml" -tr "template.md" "report.md"
```

Running it in full mode for the data source `icij` with CSV data type, using the raw data files located in `/raw/icij`, and generating a report using the template `template.md` and outputting it to `report.md`.

### load_kyd_dataspec

This script is used for loading the data specification file generated using the above script to Solidatus. It calls the Solidaus API dataspec loading script. Below are the arguments for running `load_kyd_dataspec`:

| Argument | Description                     | Required | Default | Example         |
|----------|---------------------------------|----------|---------|-----------------|
| `-mn`    | Model name                      | No      | `"dataspec_<today's date>"`   | `"dataspec_model"`    |
| `-mt`    | Solidatus model type            | Yes      | None    | `"LineageModel"`         |
| `-specs` | The location of the data specification files to be loaded to Solidatus  | Yes      | `"data_spec/"` | `"data_spec/"`  |
| `-c`     | Flag to create a new Solidatus JSON file locally. Use this flag to generate a new model file.  | No      | False    |   |

Example usage:

Create a Solidatus json locally

```bash
load_kyd_dataspec -mt "LineageModel" -specs "output/data_spec/" -c
```

Only loading Solidatus model

```bash
load_kyd_dataspec -mt "LineageModel" -specs "output/data_spec/"
```

## License

`kyd_dataspec_gen` is distributed under the terms of the [GPL 3](https://www.gnu.org/licenses/gpl-3.0.html) license
