Automating Machine Learning Model Metadata Creation¶
Automate Metadata Creation using JSON
files
This guide provides a step-by-step approach to automate the creation of metadata for machine learning models, using a Python script. It includes an example script that trains a linear regression model, saves it, and generates a corresponding JSON metadata file.
Overview¶
- The script demonstrates training a linear regression model using Scikit-Learn, saving the model as a
.pkl
file, and generating a metadata file in JSON format. - The metadata includes model parameters, performance metrics, and data description.
Example Script¶
import json
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import load_boston
import joblib
# Load sample data
data = load_boston()
X = data.data
y = data.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict using the model
y_pred = model.predict(X_test)
# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Save the model
model_filename = 'linear_regression_model.pkl'
joblib.dump(model, model_filename)
# Metadata
metadata = {
'model_name': 'Linear Regression',
'timestamp': '20240123',
'model_parameters': model.get_params(),
'performance_metrics': {
'mean_squared_error': mse,
'r2_score': r2
},
'data_description': 'Boston housing dataset',
'feature_names': data.feature_names.tolist(),
'target_name': 'Housing Price'
}
# Save metadata to a JSON file
metadata_filename = 'service_sage_v1.2.0_linearReg_20240123_metadata.json'
with open(metadata_filename, 'w') as f:
json.dump(metadata, f, indent=4)
print(f"Model and metadata saved as {model_filename} and {metadata_filename} respectively.")
Script Explanation¶
- The Boston housing dataset is used for demonstration purposes.
- A linear regression model is trained on the dataset.
- Performance metrics like Mean Squared Error (MSE) and R-squared (R2) are calculated.
- The model is saved as a
.pkl
file, and metadata is saved in a.json
file.
Hypothetical Output¶
{
"model_name": "Linear Regression",
"timestamp": "20240123",
"model_parameters": {
"copy_X": true,
"fit_intercept": true,
"n_jobs": null,
"normalize": false
},
"performance_metrics": {
"mean_squared_error": 24.291119474973684,
"r2_score": 0.6687594935356314
},
"data_description": "Boston housing dataset",
"feature_names": [
"CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT"
],
"target_name": "Housing Price"
}