{"id":380773,"date":"2023-12-11T15:14:29","date_gmt":"2023-12-11T21:14:29","guid":{"rendered":"http:\/\/tomaztsql.wordpress.com\/?p=9622"},"modified":"2023-12-11T15:14:29","modified_gmt":"2023-12-11T21:14:29","slug":"advent-of-2023-day-11-starting-data-science-with-microsoft-fabric","status":"publish","type":"post","link":"https:\/\/www.r-bloggers.com\/2023\/12\/advent-of-2023-day-11-starting-data-science-with-microsoft-fabric\/","title":{"rendered":"Advent of 2023, Day 11 \u2013 Starting data science with Microsoft Fabric"},"content":{"rendered":"<!-- \r\n<div style=\"min-height: 30px;\">\r\n[social4i size=\"small\" align=\"align-left\"]\r\n<\/div>\r\n-->\r\n\r\n<div style=\"border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 12px;\">\r\n[This article was first published on  <strong><a href=\"https:\/\/tomaztsql.wordpress.com\/2023\/12\/11\/advent-of-2023-day-11-starting-data-science-with-microsoft-fabric\/\"> R \u2013 TomazTsql<\/a><\/strong>, and kindly contributed to <a href=\"https:\/\/www.r-bloggers.com\/\" rel=\"nofollow\">R-bloggers<\/a>].  (You can report issue about the content on this page <a href=\"https:\/\/www.r-bloggers.com\/contact-us\/\">here<\/a>)\r\n<hr>Want to share your content on R-bloggers?<a href=\"https:\/\/www.r-bloggers.com\/add-your-blog\/\" rel=\"nofollow\"> click here<\/a> if you have a blog, or <a href=\"http:\/\/r-posts.com\/\" rel=\"nofollow\"> here<\/a> if you don't.\r\n<\/div>\n\n<p>In this Microsoft Fabric series:<\/p>\n\n\n\n<ol>\n<li>Dec 01:\u00a0<a href=\"https:\/\/tomaztsql.wordpress.com\/2023\/12\/01\/advent-of-2023-day-1-what-is-microsoft-fabric\/\" rel=\"nofollow\" target=\"_blank\">What is Microsoft Fabric?<\/a><\/li>\n\n\n\n<li>Dec 02:\u00a0<a href=\"https:\/\/tomaztsql.wordpress.com\/2023\/12\/02\/advent-of-2023-day-2-getting-started-with-microsoft-fabric\/\" rel=\"nofollow\" target=\"_blank\">Getting started with Microsoft Fabric<\/a><\/li>\n\n\n\n<li>Dec 03:\u00a0<a href=\"https:\/\/tomaztsql.wordpress.com\/2023\/12\/03\/advent-of-2023-day-3-what-is-lakehouse-in-fabric\/\" rel=\"nofollow\" target=\"_blank\">What is lakehouse in\u00a0Fabric?<\/a><\/li>\n\n\n\n<li>Dec 04:\u00a0<a href=\"https:\/\/tomaztsql.wordpress.com\/2023\/12\/04\/advent-of-2023-day-4-delta-lake-and-delta-tables-in-microsoft-fabric\/\" rel=\"nofollow\" target=\"_blank\">Delta lake and delta tables in Microsoft\u00a0Fabric<\/a><\/li>\n\n\n\n<li>Dec 05:\u00a0<a href=\"https:\/\/tomaztsql.wordpress.com\/2023\/12\/05\/advent-of-2023-day-5-getting-data-into-lakehouse\/\" rel=\"nofollow\" target=\"_blank\">Getting data into lakehouse<\/a><\/li>\n\n\n\n<li>Dec 06:\u00a0<a href=\"https:\/\/tomaztsql.wordpress.com\/2023\/12\/06\/advent-of-2023-day-6-sql-analytics-endpoint\/\" rel=\"nofollow\" target=\"_blank\">SQL Analytics\u00a0endpoint<\/a><\/li>\n\n\n\n<li>Dec 07:\u00a0<a href=\"https:\/\/tomaztsql.wordpress.com\/2023\/12\/07\/advent-of-2023-day-7-sql-commands-in-sql-analytics-endpoint\/\" rel=\"nofollow\" target=\"_blank\">SQL commands in SQL Analytics\u00a0endpoint<\/a><\/li>\n\n\n\n<li>Dec 08: <a href=\"https:\/\/tomaztsql.wordpress.com\/2023\/12\/08\/advent-of-2023-day-8-using-lakehouse-rest-api\/\" rel=\"nofollow\" target=\"_blank\">Using Lakehouse REST\u00a0API<\/a><\/li>\n\n\n\n<li>Dec 09: <a href=\"https:\/\/tomaztsql.wordpress.com\/2023\/12\/09\/advent-of-2023-day-9-building-custom-environments-and-spark-job-definitions\/\" rel=\"nofollow\" target=\"_blank\">Building custom environments<\/a><\/li>\n\n\n\n<li>Dec 10: <a href=\"https:\/\/tomaztsql.wordpress.com\/2023\/12\/10\/advent-of-2023-day-10-creating-job-spark-definition\/\" rel=\"nofollow\" target=\"_blank\">Creating Job Spark definition<\/a><\/li>\n<\/ol>\n\n\n\n<p>We have looked into creating the lakehouse, checked the delta lake and delta tables, got some data into the lakehouse, and created a custom environment and Spark job definition. And now we need to see, how to start working with the data.<\/p>\n\n\n\n<p>There are four capabilities that one can explore, and since we have covered the Environment, let\u2019s go with the notebook capability.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-47.png\" rel=\"nofollow\" target=\"_blank\"><img loading=\"lazy\" data-attachment-id=\"9626\" data-permalink=\"https:\/\/tomaztsql.wordpress.com\/2023\/12\/11\/advent-of-2023-day-11-starting-data-science-with-microsoft-fabric\/image-47\/\" data-orig-file=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-47.png\" data-orig-size=\"1728,208\" data-comments-opened=\"1\" data-image-meta=\"{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}\" data-image-title=\"image-47\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-47.png?w=300\" data-large-file=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-47.png?w=605\" src=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-47.png?w=450\" alt=\"\" class=\"wp-image-9626\" srcset_temp=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-47.png?w=450 1024w, https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-47.png?w=150 150w, https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-47.png?w=300 300w, https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-47.png?w=768 768w, https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-47.png 1728w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" data-recalc-dims=\"1\" \/><\/a><\/figure>\n\n\n\n<p>Once you create a new (or open) notebook, add the lakehouse. I will add the existing one, we created on Day 3 \u2013 called Advent2023. You can also choose the preferred language of the notebook \u2013 PySpark (Python), Spark (Scala), Spark SQL and SparkR (R). On top of the language, you can also choose the environment; the default workspace or any custom-prepared environment (with custom libraries and compute).<\/p>\n\n\n\n<p>Get the data into notebook, by reading the delta table.<\/p>\n\n\n<pre>\n# PySpark - show data\ndf = spark.read.load('Tables\/iris_data',\n    format='delta',\n    header=True\n)\ndisplay(df.limit(10))\n<\/pre>\n\n\n<p>You can also get the data using SQL that is wrapped in PySpark command.<\/p>\n\n\n<pre>\ndf = spark.sql(\n    &quot;&quot;&quot;\n    SELECT \n         CAST(`Sepal.Length` AS DECIMAL(5,1)) AS SepalLength\n        ,CAST(`Sepal.Width` AS  DECIMAL(5,1))  AS SepalWidth\n        ,CAST(`Petal.Length` AS  DECIMAL(5,1)) AS PetalLength\n        ,CAST(`Petal.Width` AS  DECIMAL(5,1)) AS PetalWidth\n       ,Species\n    FROM\n    Advent2023.iris_data LIMIT 3000\n    &quot;&quot;&quot;\n    )\n\ndf.show()\n<\/pre>\n\n\n<p>You can always visualize the data:<\/p>\n\n\n<pre>\nimport pandas as pd\nimport warnings\nwarnings.filterwarnings(&quot;ignore&quot;)\nimport seaborn as sns\nimport matplotlib.pyplot as plt\n\ndf2 = df.toPandas()\n\ndf2.plot(kind=&quot;scatter&quot;, x=&quot;SepalLength&quot;, y=&quot;SepalWidth&quot;)\nplt.show()\n\nsns.jointplot(x=&quot;SepalLength&quot;, y=&quot;SepalWidth&quot;, data=df2, size=5)\nplt.show()\n<\/pre>\n\n\n<p>And the the visual perspective with great Seaborn joinplot:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-48.png\" rel=\"nofollow\" target=\"_blank\"><img loading=\"lazy\" data-attachment-id=\"9633\" data-permalink=\"https:\/\/tomaztsql.wordpress.com\/2023\/12\/11\/advent-of-2023-day-11-starting-data-science-with-microsoft-fabric\/image-48\/\" data-orig-file=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-48.png\" data-orig-size=\"777,735\" data-comments-opened=\"1\" data-image-meta=\"{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}\" data-image-title=\"image-48\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-48.png?w=300\" data-large-file=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-48.png?w=605\" src=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-48.png?w=450\" alt=\"\" class=\"wp-image-9633\" srcset_temp=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-48.png 777w, https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-48.png?w=150 150w, https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-48.png?w=300 300w, https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-48.png?w=768 768w\" sizes=\"(max-width: 777px) 100vw, 777px\" data-recalc-dims=\"1\" \/><\/a><\/figure>\n\n\n\n<p>And now do some features engineering. Create a Vector Assembler:<\/p>\n\n\n<pre>\nvectorAssembler = VectorAssembler(inputCols = ['SepalLength','SepalWidth','PetalLength','PetalWidth'], outputCol = 'features')\nv_iris_df = vectorAssembler.transform(df)\nv_iris_df.show(5)\n<\/pre>\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-49.png\" rel=\"nofollow\" target=\"_blank\"><img loading=\"lazy\" data-attachment-id=\"9636\" data-permalink=\"https:\/\/tomaztsql.wordpress.com\/2023\/12\/11\/advent-of-2023-day-11-starting-data-science-with-microsoft-fabric\/image-49\/\" data-orig-file=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-49.png\" data-orig-size=\"715,231\" data-comments-opened=\"1\" data-image-meta=\"{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}\" data-image-title=\"image-49\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-49.png?w=300\" data-large-file=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-49.png?w=605\" src=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-49.png?w=450\" alt=\"\" class=\"wp-image-9636\" srcset_temp=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-49.png 715w, https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-49.png?w=150 150w, https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-49.png?w=300 300w\" sizes=\"(max-width: 715px) 100vw, 715px\" data-recalc-dims=\"1\" \/><\/a><\/figure>\n\n\n\n<p>And convert the new feature to string:<\/p>\n\n\n<pre>\nindexer = StringIndexer(inputCol = 'Species', outputCol = 'label')\ni_v_iris_df = indexer.fit(v_iris_df).transform(v_iris_df)\ni_v_iris_df.show(5)\n<\/pre>\n\n\n<p>Now, do the split:<\/p>\n\n\n<pre>\nsplits = i_v_iris_df.randomSplit([0.6,0.4],1)\ntrain_df = splits[0]\ntest_df = splits[1]\ntrain_df.count(), test_df.count(), i_v_iris_df.count()\n<\/pre>\n\n\n<p>Load the Perceptron classifier and the evaluator function for evaluating the model.<\/p>\n\n\n<pre>\nfrom pyspark.ml.classification import MultilayerPerceptronClassifier\nfrom pyspark.ml.evaluation import MulticlassClassificationEvaluator\n<\/pre>\n\n\n<p>Now, let\u2019s play with the neural network and create the layer definition. We are using two hidden layers of 5 nodes each and hence our layers array is [4,5,5,3] (input-4, 2 x hidden-5, output nodes-3). And do the fitting-<\/p>\n\n\n<pre>\nlayers = [4,5,5,3]\nmlp = MultilayerPerceptronClassifier(layers = layers, seed = 1)\n#fit\nmlp_model = mlp.fit(train_df)\n<\/pre>\n\n\n<p>Once the training is completed, use the transform method on the test data frame using the model object from the previous step. We store the results in a data frame called <em>pred_df<\/em> review some of the columns and check the probabilities<\/p>\n\n\n<pre>\npred_df = mlp_model.transform(test_df)\npred_df.select('features','label','rawPrediction','probability','prediction').show(10)\n\n<\/pre>\n\n\n<p>And finally, evaluate the model:<\/p>\n\n\n<pre>\nevaluator = MulticlassClassificationEvaluator(labelCol = 'label', predictionCol = 'prediction', metricName = 'accuracy')\nmlpacc = evaluator.evaluate(pred_df)\nmlpacc\n<\/pre>\n\n\n<p>And the results are pretty good:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-50.png\" rel=\"nofollow\" target=\"_blank\"><img loading=\"lazy\" width=\"408\" height=\"177\" data-attachment-id=\"9644\" data-permalink=\"https:\/\/tomaztsql.wordpress.com\/2023\/12\/11\/advent-of-2023-day-11-starting-data-science-with-microsoft-fabric\/image-50\/\" data-orig-file=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-50.png\" data-orig-size=\"408,177\" data-comments-opened=\"1\" data-image-meta=\"{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}\" data-image-title=\"image-50\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-50.png?w=300\" data-large-file=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-50.png?w=408&#038;resize=408%2C177\" src=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-50.png?w=408&#038;resize=408%2C177\" alt=\"\" class=\"wp-image-9644\" srcset_temp=\"https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-50.png 408w, https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-50.png?w=150 150w, https:\/\/tomaztsql.files.wordpress.com\/2023\/12\/image-50.png?w=300 300w\" sizes=\"(max-width: 408px) 100vw, 408px\" data-recalc-dims=\"1\" \/><\/a><\/figure>\n\n\n\n<p>Tomorrow we will look the continue with data science!<\/p>\n\n\n\n<p>Complete set of code, documents, notebooks, and all of the materials will be available at the Github repository:\u00a0<a href=\"https:\/\/github.com\/tomaztk\/Microsoft-Fabric\" rel=\"nofollow\" target=\"_blank\">https:\/\/github.com\/tomaztk\/Microsoft-Fabric<\/a><\/p>\n\n\n\n<p>Happy Advent of 2023! <img src=\"https:\/\/i1.wp.com\/s0.wp.com\/wp-content\/mu-plugins\/wpcom-smileys\/twemoji\/2\/72x72\/1f642.png?w=578&#038;ssl=1\" alt=\"\ud83d\ude42\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\" data-recalc-dims=\"1\" \/><\/p>\n\n<div style=\"border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;\">\r\n<div style=\"text-align: center;\">To <strong>leave a comment<\/strong> for the author, please follow the link and comment on their blog: <strong><a href=\"https:\/\/tomaztsql.wordpress.com\/2023\/12\/11\/advent-of-2023-day-11-starting-data-science-with-microsoft-fabric\/\"> R \u2013 TomazTsql<\/a><\/strong>.<\/div>\r\n<hr \/>\r\n<a href=\"https:\/\/www.r-bloggers.com\/\" rel=\"nofollow\">R-bloggers.com<\/a> offers <strong><a href=\"https:\/\/feedburner.google.com\/fb\/a\/mailverify?uri=RBloggers\" rel=\"nofollow\">daily e-mail updates<\/a><\/strong> about <a title=\"The R Project for Statistical Computing\" href=\"https:\/\/www.r-project.org\/\" rel=\"nofollow\">R<\/a> news and tutorials about <a title=\"R tutorials\" href=\"https:\/\/www.r-bloggers.com\/how-to-learn-r-2\/\" rel=\"nofollow\">learning R<\/a> and many other topics. <a title=\"Data science jobs\" href=\"https:\/\/www.r-users.com\/\" rel=\"nofollow\">Click here if you're looking to post or find an R\/data-science job<\/a>.\r\n\r\n<hr>Want to share your content on R-bloggers?<a href=\"https:\/\/www.r-bloggers.com\/add-your-blog\/\" rel=\"nofollow\"> click here<\/a> if you have a blog, or <a href=\"http:\/\/r-posts.com\/\" rel=\"nofollow\"> here<\/a> if you don't.\r\n<\/div>","protected":false},"excerpt":{"rendered":"<div style = \"width:60%; display: inline-block; float:left; \"> In this Microsoft Fabric series: We have looked into creating the lakehouse, checked the delta lake and delta tables, got some data into the lakehouse, and created a custom environment and Spark job definition. And now we need to see,\u2026Read more \u203a<\/div>\n<div style = \"width: 40%; display: inline-block; float:right;\"><\/div>\n<div style=\"clear: both;\"><\/div>\n","protected":false},"author":1281,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[4],"tags":[],"aioseo_notices":[],"jetpack-related-posts":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/posts\/380773"}],"collection":[{"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/users\/1281"}],"replies":[{"embeddable":true,"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/comments?post=380773"}],"version-history":[{"count":7,"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/posts\/380773\/revisions"}],"predecessor-version":[{"id":385281,"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/posts\/380773\/revisions\/385281"}],"wp:attachment":[{"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/media?parent=380773"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/categories?post=380773"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.r-bloggers.com\/wp-json\/wp\/v2\/tags?post=380773"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}