How to preprocess data for Machine Learning in .NET
Jacob Malling-Olesen is part of our Software Craftsman Program in which we ask our graduates and residents to write about different topics as part of their training to become Software Craftsmen.
Previously, ML was equal to using Python or R. Now, .NET has made a framework that enables ML natively in .NET. In order to test this framework we did a project to compare ML in .NET to ML in Python. The project is based on previous weather data and an ML model is trained to predict the future temperature.
Figure 1: Showing the steps in the ML process. Part 1 of this article will cover step 1-3 about preparing data for ML, while part 2 covers step 4-6 about training and evaluating an ML model.
Using ML.NET for temperature forecasting
Today, weather forecasting is done by using complicated meteorological models and lots of collected data such as temperature and atmospheric pressure from many different locations around the world. Using the data, these models produce simulations on what could happen based on what has happened in the past given the same or similar data. This approach is similar to how ML works as ML uses previous data to try to predict future data. Simplifying this task to only forecast temperature, the forecasting is then done using ML. More specifically ML.NET is used in order to test how to do ML with this framework. Python and SciKit-Learn is also used for the same task and the two ML frameworks are then compared.
Data gathering
For weather prediction, free to use data from Silkeborg, Denmark was found. This data consists of values like temperature, barometer, humidity, rain and wind data per minute. This seems to be reasonable data to use for local temperature prediction. To increase forecasting precision – if needed – data such as the weather being cloudy or sunny would be nice to have. And maybe also the weather of surrounding cities to know which weather to expect when taking the wind into account.
Data cleaning
Like most data, this weather data is not immediately ready for use. Therefore, some cleaning is needed. The raw data received from Silkeborg is embedded in Excel files for each year with each new row being a new timeslot the minute after. First up, the files are manually changed from .xlxs files to .csv files, all containing a header that explains the columns. The data is then checked for faulty/missing data. During this process the 2008 data set is discarded as half the values had been partly moved to the next value field. Some parts of the kept data use a comma as the decimal separator while others use a dot instead. Also, for barometer data it uses a dot as the thousand separator which doesn’t work well when loading the data. All this needs to be handled in the data cleaning part.
Since ML.NET has limited support for cleaning this is done in clean C#. The cleaning in C# is done by parsing the raw string data to floats. The only special part needed was handling the temperature shift from “,” to “.”. This was handled using the standard string-replace. The headers are also removed as these are not used by ML.NET when reading the data from disk.
public MinuteData CleanSinglePoint(RawData input) { try { return new MinuteData(day: float.Parse(input.Day), month: float.Parse(input.Month), year: float.Parse(input.Year), hour: float.Parse(input.Hour), minute: float.Parse(input.Minute), temperature: float.Parse(input.Temperature.Replace('.', ',')), humidity: float.Parse(input.Humidity), dewpoint: float.Parse(input.DewPoint), barometer: float.Parse(input.Barometer), windspeed: float.Parse(input.WindSpeed), gustspeed: float.Parse(input.GustSpeed), winddirection: float.Parse(input.WindDirection), rainminutely: float.Parse(input.RainMinutely), raindaily: float.Parse(input.RainDaily), rainmonthly: float.Parse(input.RainMonthly), rainyearly: float.Parse(input.RainYearly), heatindex: float.Parse(input.HeatIndex)); } }
In Python the cleaning is done a little differently as a library called Pandas is used to contain the data as it works nicely with scikit-learn. These Pandas data frames are loaded directly from disk and uses a header to know where the data fits. These data frames are not needed for the cleaning process so here dictionaries are used. As Python is dynamically typed values with a separating dot are read as floats while separating commas are read as strings. The first part is therefore to change all the commas to dots in the cleaning process again using a Replace function. Since the barometer data had a thousand separator this needs to be removed in the Python version as there is no parsing to floats. Lastly, in the Python cleaning as one of the headers changed name in the data this also had to be handled and then all the headers in the dictionary was changed to have consistent English names for the further processes.
def CleanSinglePoint(self, record): if 'Heatindex' in record.keys(): return {'Day': record['Dag'], 'Month': record['Måned'], 'Year': record['År'], 'Hour': record['Time'], 'Minute': record['Minut'], 'Temperature': record['Temperatur'].replace(",","."), 'Humidity': record['Fugtighed'].replace(",","."), 'Dewpoint': record['Dugpunkt'].replace(",","."), 'Barometer': record['Barometer'].replace(".","").replace(",","."), 'Windspeed': record['Vindhastighed'].replace(",","."), 'Gustspeed': record['Vindstød'].replace(",","."), 'Winddirection': record['Vindretning'].replace(",","."), 'Rain - minutely': record['Regn - minut'].replace(",","."), 'Rain - hourly': record['Regn - daglig'].replace(",","."), 'Rain - monthly': record['Regn - månedlig'].replace(",","."), 'Rain - yearly': record['Regn - år'].replace(",","."), 'Heatindex': record['Heatindex'].replace(",",".")}
Data transformation
Now that the data is cleaned, it needs to be transformed. Since the weather does not change much each minute, moving from a minute to hour scale eases the computational requirements while keeping the information needed from the data.
In C# this is done using an aggregate function on a list of minute data in the same hour:
HourData result = input.Aggregate( new HourData( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), (a, b) => { return new HourData( day: day, month: month, year: year, hour: hour, temperature: (float)((decimal)a.Temperature + (decimal)b.Temperature), humidity: (float)((decimal)a.Humidity + (decimal)b.Humidity), dewpoint: (float)((decimal)a.DewPoint + (decimal)b.DewPoint), barometer: (float)((decimal)a.Barometer + (decimal)b.Barometer), windspeed: (float)((decimal)a.WindSpeed + (decimal)b.WindSpeed), gustspeed: (float)((decimal)a.GustSpeed + (decimal)b.GustSpeed), winddirection: (float)((decimal)a.WindDirection + (decimal)b.WindDirection), rain: (float)((decimal)a.Rain + (decimal)b.RainMinutely) ); }, (a) => new HourData( day: a.Day, month: a.Month, year: a.Year, hour: a.Hour, temperature: a.Temperature / inputSize, humidity: a.Humidity / inputSize, dewpoint: a.DewPoint / inputSize, barometer: a.Barometer / inputSize, windspeed: a.WindSpeed / inputSize, gustspeed: a.GustSpeed / inputSize, winddirection: a.WindDirection / inputSize, rain: a.Rain // Summing rain instead of average. ));
In Python this is done using an accumulating for loop going through a list of minute data in the same hour:
for record in recordsOfUpTo60: hourRecord['Temperature'] += ast.literal_eval(record['Temperature']) / len(recordsOfUpTo60) hourRecord['Humidity'] += ast.literal_eval(record['Humidity']) / len(recordsOfUpTo60) hourRecord['Dewpoint'] += ast.literal_eval(record['Dewpoint']) / len(recordsOfUpTo60) hourRecord['Barometer'] += ast.literal_eval(record['Barometer']) / len(recordsOfUpTo60) hourRecord['Windspeed'] += ast.literal_eval(record['Windspeed']) / len(recordsOfUpTo60) hourRecord['Gustspeed'] += ast.literal_eval(record['Gustspeed']) / len(recordsOfUpTo60) hourRecord['Winddirection'] += ast.literal_eval(record['Winddirection']) / len(recordsOfUpTo60) hourRecord['Rain'] += ast.literal_eval(record['Rain - minutely']) return hourRecord
Next, it is transformed to time series data. This means each data point is expanded to contain temperature data and then five hours past data on weather data.
In C# this is done by creating a timeseriesData class. This contains a float value and then a list of hourData. This transformed timeseriesData is then written to a CSV file on disk by separating each of the values in the timeseriesData with a “;”. Since the hourData contains 12 values each and each data point have five the time series data point ends up having 60+1 values per row in our CSV file. The transformation code in C#:
for (int i = hoursBack; i < input.Count(); i++) { List<HourData> previousHours = input.Skip(i - hoursBack).Take(hoursBack).ToList<HourData>(); output.Add(new TimeseriesData(input.ElementAt(i).Temperature, previousHours)); }
In Python as the headers had to be kept, the new header names are generated in a function.
def AddPreviousFeatures(self, recordsBack, currentRecord): timeseriesRecord = {'FutureTemperature': currentRecord['Temperature']} recordsBack.reverse() for i in range(0, len(recordsBack)): for key in recordsBack[i].keys(): timeseriesRecord["{}_{}".format(key, i)] = recordsBack[i][key] return timeseriesRecord def Transform(self, records, hoursBack): timeseriesRecords = [] for i in range(0, len(records)-hoursBack): timeseriesRecords.append(self.AddPreviousFeatures(records[i:i+hoursBack], records[i+hoursBack])) return timeseriesRecords
Now that the data is cleaned and transformed it is ready for training. Read more in the article How to make predictions using Machine Learning.
About the author
Jacob Malling-Olesen is a computer scientist specialised in Machine Learning. He is now a graduate attending our Software Craftsman Program. We encourage our graduates to write about the things they learn as we believe that software craftsmanship is based on continuous learning.
Do you want us to help you with Machine Learning projects?
Morten Hoffmann
CEO
T: (+45) 3095 6416 E: mhs@strongminds.dk