{"id":89,"date":"2020-05-05T13:36:52","date_gmt":"2020-05-05T20:36:52","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/azure-sdk\/?p=89"},"modified":"2020-06-12T21:09:33","modified_gmt":"2020-06-13T04:09:33","slug":"forecasting-service-scale-out-with-jupyter-notebooks-in-visual-studio-code","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/azure-sdk\/forecasting-service-scale-out-with-jupyter-notebooks-in-visual-studio-code\/","title":{"rendered":"Forecasting Service scale out with Jupyter Notebooks in Visual Studio Code"},"content":{"rendered":"<p>Visual Studio Code has <a href=\"https:\/\/code.visualstudio.com\/docs\/python\/python-tutorial\">an extension for running Jupyter Notebooks<\/a>, which is a great tool for those of us interested in data analytics as it simplifies our workflows. In this article, I will show how to consume Azure data in a Jupyter Notebook using the Azure SDK. The problem I will be demonstrating builds a predictive model to anticipate service scale-up, which is a common task for optimizing cloud spend and anticipating scale requirements.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-content\/uploads\/sites\/58\/2020\/06\/20200505-image1.png\" alt=\"An example plot\" \/><\/p>\n<p>To create the predictive model, I&#8217;ll need some data. My application is logging to Azure Application Insights, which then stores the data in Azure Blob Storage as a series of time-series log files.<\/p>\n<p>To start, you will need <a href=\"https:\/\/code.visualstudio.com\">Visual Studio Code<\/a> with the <a href=\"https:\/\/code.visualstudio.com\/docs\/python\/python-tutorial\">VSCode Python extension<\/a> installed. When you run a Jupyter notebook for the first time, VSCode will prompt you to install necessary modules.<\/p>\n<p>You can <a href=\"https:\/\/azure.github.io\/azure-sdk\/images\/posts\/ForecastingInVSCodeWithBlob.ipynb\">download the demo notebook<\/a> and open it directly in VSCode. Jupyter notebooks intermesh code and documentation seamlessly, allowing for an interactive documentation experience.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-content\/uploads\/sites\/58\/2020\/06\/20200505-image2.png\" alt=\"Run a Jupyter Notebook in VSCode\" \/><\/p>\n<p>The rest of this article walks through the Jupyter notebook. First, let&#8217;s bring in the packages we will need, including the Azure SDK that allows us to read the data, data manipulation, forecasting, and visualization packages.<\/p>\n<pre><code>import sys\n\n# Azure SDK for storage and identity\n!\"{sys.executable}\" -m pip install azure-storage-blob azure-identity\n# Data manipulation\n!\"{sys.executable}\" -m pip install pandas numpy sklearn\n# Visualization\n!\"{sys.executable}\" -m pip install matplotlib\n# Tooling to perform ARIMA forecasts\n!\"{sys.executable}\" -m pip install pmdarima\n# Needed to run the notebook\n!\"{sys.executable}\" -m pip jupyter notebook\n<\/code><\/pre>\n<p>This uses an odd quoting to ensure that the application supports paths with spaces in them. If you bump into permissions issues, you may be using a system installed version of Python and do not have permissions to install packages for global use. You can install Python in a user-controlled location instead, and set your Jupyter kernel accordingly.<\/p>\n<h2>Loading the data<\/h2>\n<p>Next, we need to load the data from Azure Blob Storage, which means dealing with authentication. Azure Identity provides a handy class called the <code>DefaultAzureCredential<\/code> that simplifies authentication. This lets you log in to Azure with a number of different utilities, including (if necessary) an interactive browser.<\/p>\n<pre><code>from azure.identity import DefaultAzureCredential\n\ncredential = DefaultAzureCredential(True)\n<\/code><\/pre>\n<p>We can now access the storage account where the logs are stored. In this example, I&#8217;m using an Azure App Service that is logged to Application Insights. Application Insights is configured to <a href=\"https:\/\/docs.microsoft.com\/azure\/azure-monitor\/app\/export-telemetry\">continuously export<\/a> log and telemetry data to Azure Blob Storage. The methodology can be applied to any similarly shaped data.<\/p>\n<p>First, create a connection to the storage account:<\/p>\n<pre><code># Name of the Azure Storage account\nazure_storage_account = 'REPLACE_ME'\n# Name of the Container holding the data\nazure_storage_container = 'REPLACE_ME'\n# Base path for the blobs within the container\nazure_storage_path = 'REPLACE_ME'\n\nfrom azure.storage.blob import BlobServiceClient\n\nstorage_account_url = \"https:\/\/{}.blob.core.windows.net\".format(azure_storage_account)\nstorage_client = BlobServiceClient(storage_account_url, credential)\n<\/code><\/pre>\n<p>We can now use the <code>storage_client<\/code> to enumerate and fetch the logs stored in blobs within the container. If you receive an authentication error in this section, check that the user or service principal being used has &#8220;Blob Data Owner&#8221; permissions to the storage account; &#8220;Owner&#8221; is not sufficient by itself. Refer to the <a href=\"https:\/\/docs.microsoft.com\/azure\/storage\/common\/storage-auth-aad-rbac-portal\">Azure documentation<\/a> for more information.<\/p>\n<pre><code>import json\nimport pandas\nfrom datetime from datetime, timedelta\n\ndef extract_requests_from_container(client, blob_path, container_name, start_time=None, end_time=None):\n    '''App Insights stores data in a series of folders (Metrics, Requests, etc.) within a container.\n    This function enumerates the blobs within the Requests folder, extracting the JSON formatted\n    request logs and storing their counts and timestamps to a dataframe.'''\n\n    data = pandas.DataFrame(columns=['count'])\n    container = client.get_container_client(container_name)\n    blob_list = container.list_blobs(blob_path + '\/Requests\/')\n    for blob in blob_list:\n        body = container.download_blob(blob.name).readall().decode('utf8')\n        for request_string in body.split('\\n'):\n            try:\n                request = json.loads(request_string)\n                # Convert from string to date\n                event_time = datetime.strptime(request['context']['data']['eventTime'][:-2], r'%Y-%m-%dT%H:%M:%S.%f')\n                if (event_time &lt; start_time or event_time &gt; end_time):\n                    continue\n                count = sum(r['count'] for r in request['request'])\n                data.loc[event_time] = count\n            except:\n                continue\n    return data\n\ndata = extract_requests_from_container(storage_client, azure_storage_path, azure_storage_container, \n    datetime.utcnow() - timedelta(hours=3), datetime.utcnow())\n<\/code><\/pre>\n<p>This code iterates through all the blobs in the &#8216;\/Requests\/&#8217; folder, reads the JSON from the file, and creates a dataframe with the count of requests.<\/p>\n<h2>Preparing the data<\/h2>\n<p>Now that we have some raw data, we can aggregate it into a useful granularity to do forecasting. The initial data set is per-event. It would be more useful to have the data bucketed by a timespan that allows us to see the underlying pattern we&#8217;re hoping to model. For our data, we&#8217;ll use 2 minute buckets. This may naturally differ with other datasets, the but goal is the same: produce a continuous and non-sparse representation of the desired load trend, smoothing over short-term variance without losing too much signal.<\/p>\n<pre><code>grouped_data = data.groupby(pandas.Grouper(freq='2Min')).agg({'count'})\ngrouped_data.plot(legend=False)\n<\/code><\/pre>\n<p>Note how easy it is to embed visualizations within a Jupyter Notebook!<\/p>\n<h2>Modeling the data<\/h2>\n<p>Now that we have a reasonably clean data set to work with, let&#8217;s apply forecasting techniques to provide a window into future behavior. We will use an <a href=\"https:\/\/en.wikipedia.org\/wiki\/Autoregressive_integrated_moving_average\">ARIMA model<\/a>. The ARIMA model is an algorithm that uses historical data to determine the underlying seasonality coupled with underlying moving averages for prediction. If you are familiar with regression-based approaches (such as linear or polynomial algorithms), ARIMA can better capture higher order behavior by leveraging seasonal look-backs, making it a useful tool for the sort of data patterns we see in service infrastructure.<\/p>\n<p>There is often a lot of analysis in determining the proper parameters to an ARIMA model; something that is outside the scope of this article. We&#8217;ll utilize an <code>auto-arima<\/code> package that attempts to determine the optimal structure of the model for us.<\/p>\n<pre><code>from pmdarima import auto_arima\nfrom pmdarima.model_selection import train_test_split\nfrom matplotlib import pyplot\nimport numpy\n\ntrain, test = train_test_split(grouped_data)\nmodel = auto_arima(train, suppress_warnings=True)\n\n# Visualize the results\nforecast = model.predict(test.shape[0])\nx = numpy.arange(grouped_data.shape[0])\npyplot.plot(x[:len(train)], train, c='blue')\npyplot.plot(x[:len(train):], forecast, c='green')\n# to compare vs. actual results, uncomment next line\n# pyplot.plot(x[len(train):], test, c='orange')\npyplot.show()\n<\/code><\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-content\/uploads\/sites\/58\/2020\/06\/20200505-image3.png\" alt=\"The final plot\" \/><\/p>\n<p>The forecast captures the primary data trend nicely. We can add the left-out test data to the chart to compare between the forecast and actual data.<\/p>\n<blockquote>\n<p>It is normal to not include all historical data when training your model. Including all historical data may result in overfitting if the behavior of the data changes subtlely. Finding a balance between &#8220;enough data to capture the trend&#8221; and &#8220;not enough to overfit&#8221; is an important point to consider. Jupyter notebooks can help by easily providing an environment to do this analysis, and can give inspiration for automation techniques.<\/p>\n<\/blockquote>\n<h2>Use Azure Storage to iterate and automate<\/h2>\n<p>Now that we have a promising model, we need to ensure it does not regress. A common pattern for this is to store the hyperparameters and model outcomes in a persisted data store such as Azure Storage. We&#8217;ll reuse the client that we used earlier. The credential you are using will need <a href=\"https:\/\/docs.microsoft.com\/azure\/storage\/common\/storage-auth-aad-rbac-portal\">the &#8220;Blob Data Writer&#8221; permission to write back to Azure Storage<\/a>.<\/p>\n<p>First, let&#8217;s capture some metrics to denote the current state of the model:<\/p>\n<pre><code>from sklearn.metrics import mean_squared_error\n\n# Root mean squared error - a common method for observing the delta between forecast and actual\nrmse = numpy.sqrt(mean_squared_error(test, forecast))\n# Akaike's information criterion, a measure that also folds the \"simplicity\" of the model into the score\naic = model.aic()\n# When the last forecast was made\nforecast_time = test.index[-1]\n# Parameters for the model\nparams = model.params()\n<\/code><\/pre>\n<p>Now, store these in an Azure Storage blob. This same information could be stored in any store; for example, CosmosDB, Table storage, or an SQL database.<\/p>\n<pre><code>from azure.core.exceptions import ResourceExistsError\n\n# The container that will hold the output data\noutput_container_name = 'REPLACE_ME'\n\n# Create the container if it doesn't exist\ntry:\n    storage_client.create_container(output_container_name)\nexcept ResourceExistsError:\n    print(\"Warning: Container already exists\")\n\n# Upload data to a blob\nimport json\nblob_client = storage_client.get_blob_client(output_container_name, str(forecast_time))\ntry:\n    blob_client.upload_blob(json.dumps({\n        'rmse': rmse,\n        'aic': aic,\n        'forecast_time': str(forecast_time),\n        'params', list(params)\n    }))\nexcept ResourceExistsError:\n    print(\"Warning: Blob already exists\")\n<\/code><\/pre>\n<p>We can then take steps similar to fetching the initial log data above to pull the model logs for inspection; for instance, to watch for a regression in model performance:<\/p>\n<pre><code>container = storage_client.get_container_client(output_container_name)\nblob_list = container.list_blobs()\nfor blob in blob_list:\n    body = json.loads(container.download_blob(blob.name).readall().decode('utf8'))\n    print(body)\n<\/code><\/pre>\n<h2>Conclusion<\/h2>\n<p>During this article, we&#8217;ve demonstrated how to consume semi-structured data, transform it into a useful form, perform analytics, and publish those results for further use. While this is a tightly scoped example, this pattern is quite close to the structure of commonly used systems for understanding and acting on time-series data. We hope we have also shown the utility of Jupyter Notebooks for interactive data exploration and communication, coupled with the capabilities of Azure Storage for data persistence.<\/p>\n<p>Follow us on Twitter at <a href=\"https:\/\/twitter.com\/AzureSDK\">@AzureSDK<\/a>. We&#8217;ll be covering more best practices in cloud-native development as well as providing updates on our progress in developing the next generation of Azure SDKs.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Visual Studio Code has an extension for running Jupyter Notebooks, which is a great tool for those of us interested in data analytics as it simplifies our workflows. In this article, I will show how to consume Azure data in a Jupyter Notebook using the Azure SDK. The problem I will be demonstrating builds a [&hellip;]<\/p>\n","protected":false},"author":31990,"featured_media":96,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[311,162,310],"class_list":["post-89","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-azure-sdk","tag-jupyter","tag-python","tag-scalability"],"acf":[],"blog_post_summary":"<p>Visual Studio Code has an extension for running Jupyter Notebooks, which is a great tool for those of us interested in data analytics as it simplifies our workflows. In this article, I will show how to consume Azure data in a Jupyter Notebook using the Azure SDK. The problem I will be demonstrating builds a [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/posts\/89","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/users\/31990"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/comments?post=89"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/posts\/89\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/media\/96"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/media?parent=89"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/categories?post=89"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sdk\/wp-json\/wp\/v2\/tags?post=89"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}