Thursday, March 15, 2012

Follow-up: Weather Index in Excel

This exercise was originally posted at:http://maritimeorca0.blogspot.com/2012/02/weather-index-in-excel.html
 

The Task

Create a weather index using spreadsheet software such as Microsoft Excel.

Grab weather data online and use that data to develop a seemingly objective measure of how nice the weather is each day, month, and/or year.


My Solution

My idea for a weather index started when I was exposed to someone else's weather index. That index was favorable to warm and sunny. Since I hate warm and sunny, I wanted something that made cool and wet weather look nicer. Cool and wet weather is common in Washington, so I decided to look at averages.

My Excel spreadsheet through January 31, 2012 can be found at https://skydrive.live.com/redir.aspx?cid=2e6efb80915ed5bd&resid=2E6EFB80915ED5BD!3775&parid=2E6EFB80915ED5BD!3774. Needless to say, my explanation will not make very much sense if you are not looking at the spreadsheet.

I took data from http://komonews.s3.amazonaws.com/seatac_daily.txt and http://www.beautifulseattle.com/clisumm.htm then corrected a few problems with the data. I also formatted the data to line up properly and created a table named TblDaily.

I labeled the high temperature column "High" and the low temperature column "Low." I then took their averages and found that they were pretty close to my preferences. I would base my index off of these averages. I also decided that the scale should reflect the extremes. In short, I was going to use percentages with extremes scoring 0% and averages scoring 100%. Below the columns for high and low, I placed structured formulas (I hadn’t used them before and wanted to give them a try) to calculate the minimum, average, and maximum.
Calculation Formula Description
Minimum High =MIN(TblDaily[High]) In TblDaily, find the values in the "High" column and return the minimum value.
Average High =AVERAGE(TblDaily[High]) In TblDaily, find the values in the "High" column and return the average value.
Maximum High =MAX(TblDaily[High]) In TblDaily, find the values in the "High" column and return the maximum value.
Minimum Low =MIN(TblDaily[Low]) In TblDaily, find the values in the "Low" column and return the minimum value.
Average Low =AVERAGE(TblDaily[Low]) In TblDaily, find the values in the "Low" column and return the average value.
Maximum Low =MAX(TblDaily[Low]) In TblDaily, find the values in the "Low" column and return the maximum value.
In addition to the highs and lows, I determined the average ("Avg" column) between the high and low for each day. The same minimum, average, and maximum was also calculated. These were not used in my final score. These calculations were primarily out of curiosity.

Of course, temperatures aren't everything. The evergreen state wouldn't be so evergreen without refreshing showers. Each day includes rainfall totals ("Rain" column). I also wanted to include cloudy days since that is another thing that is typical in Washington and something that I enjoy. Unfortunately, there is no reliable data for cloudiness. Instead, I included whether or not there was any rain. This convinced me to create two more columns. These columns added consistency to "Trace" measurement ("Tr" column) and placed a numerical value for rainfall for each day, even those with a trace ("RainTot" Column). I have included the same minimum, average, and maximums. I have also calculated the number of wet and dry days to show that we are typically wet.
Calculation Formula Description
Determine if Trace =IF(OR([@Rain]="0.00T",[@Rain]="Trace "),"TRACE","") For each day, determine if the value for rain is either of the two values provided from the two sources used to indicate trace amounts. If it is, return "TRACE." If not, return an empty string.
Numerical Representation of Rainfall =IF([@Tr]="TRACE",0,[@Rain]) For each day, whether or not rain is presented as a number depends on whether or not there was trace amount. If it was trace, I will treat it as zero for the rainfall (but it will still count as a wet day). If it isn't trace, the total in the "Rain" column should already be in the proper numerical format.
Minimum Rainfall =MIN(TblDaily[RainTot]) In TblDaily, find the values in the "RainTot" column and return the minimum value.
Average Rainfall =AVERAGE(TblDaily[RainTot]) In TblDaily, find the values in the "RainTot" column and return the average value.
Maximum Rainfall =MAX(TblDaily[RainTot]) In TblDaily, find the values in the "RainTot" column and return the maximum value.
Dry Days =COUNTIF(TblDaily[Rain],0) Count the cells in the "Rain" column that had zero rain recorded. Since trace amounts are not recorded as zero, they will not be counted.
Wet Days =COUNTIF(TblDaily[Rain],"<>0") Count the cells that show anything that is not equal to zero in the "Rain" column. Since "0.00T" and "Trace" are not the same as 0, they will be counted.
Now, we can start to make some real calculations. I have decided to generate a score for high temperature, low temperature, rainfall, and whether or not the day was wet. These scores will use a consistent system and then find an average. I start by looking for how far from average each number is. I will also use the extremes as the furthest from average that can be recorded. This is a modified percentage. The average will be recorded as 0, the extremes will be recorded as 1, and immediately in between the average and either extreme should be recorded as 0.5. Since I will be using the averages and extremes, I have named the cells where these values are located.
Minimum Average Maximum
High MinHigh AvgHigh MaxHigh
Low MinLow AvgLow MaxLow
Rain MinRain AvgRain MaxRain
Let me explain the rain formula. This can be found for each day in the "RainDif" column:

=IF([@RainTot]>AvgRain,([@RainTot]-AvgRain)/(MaxRain-AvgRain),(AvgRain-[@RainTot])/(AvgRain-MinRain))

Usually, it's easiest to explain Excel formulas in Excel by starting in the middle and working your way out. For this formula, I'm going to start in the opposite direction. Let me just write it in a different way:

=IF([@RainTot]>AvgRain,Formula1,Formula2)

The way that I'm approaching this value, the formula has to vary depending on whether the rainfall is above or below average. For each day, I'm determining if the rainfall for that day ([@RainTot]) is above or below average (AvgRain) to determine which formula to use. If it's above average, we will use Formula1. If it isn't above average (below or equal to the average), we will use Formula2.

Now let's look at Formula1:

([@RainTot]-AvgRain)/(MaxRain-AvgRain)

(MaxRain-AvgRain) reperesnts the difference between average rainfall and the maximum rainfall recorded. ([@RainTot]-AvgRain) represents how many inches above average rainfall was recorded for that day. By dividing the amount a rain recorded above average to the highest difference currently possible, we will generate a value of 0 to 1 that we will be able to convert to a percentage later.

Formula2 is next:

(AvgRain-[@RainTot])/(AvgRain-MinRain)

This is the same concept, but we are now looking at the rainfall below the average for each day and finding a value between 0 and 1. Both Formula1 and Formula2 will provide a value of 0 if we enter the average. Both formulas will also provide a value of 1 for an extreme (Maximum or minimum rainfall). This means that we can rely on the value calculated regardless of whether we are above or below average.

To avoid a lot of boring repetition, I'm not going to teach people math that they should already know or repeat the explanation that I just provided. For high and low temperatures, I essentially used the same formulas. These formulas will provide the values in the same 0 to 1 range with 0 being average and 1 being extreme.

Rather than going straight to the final percentages, I am going to separate the scores into temperature and rainfall. Temperature is easy because I have already figured out every component to include. To consolidate high and low temperature scores, we are going to take the average. There's just one slight problem. Our earlier calculations have 0 for average and 1 for extremes. It's not very intuitive that a score of 0% will be ideal and 100% being the worst scenario possible. To find the overall score for the temperature, here's my formula:

=1-AVERAGE(TblDaily[@[HighDif]:[LowDif]])

This takes the average between the high and low scores, then subtracts the result from 1. I have also formatted the cells to display percentages. This will give a day with the worst case score of 1 for both high and low a score of 0%, and matching the averages perfectly will see two zeros converted to 100%.

Rain is a little trickier simply because I had only calculated values based on rainfall totals. I also wanted to use whether or not there was any rainfall. Rather than make a separate calculation, I decided to wait for the final score for rain. In other words, I included it with the following formula:

=IF([@Rain]=0,0,(1+(1-[@RainDif])*2)/3)

Let's start with the obvious. As you saw with the temperatures, I have subtracted the previous rain value from 1 to give the higher score to the more desirable weather. Then I decided to provide a weighted average for the score. Since whether or not it rained did not match my initial idea, I cut it to half of the weight of the rainfall totals. Okay, maybe that's not technically accurate. I doubled the weight of the rainfall totals. If there is any rain at all, it will be given the 100% for that specific element of the weather. That's why I am adding the 1 to the doubled weight of the rainfall amounts. If it rains, this is an average between the 100% for any rain, the rainfall amounts, and the rainfall amounts again. For the average to be calculated correctly, we don't need to use the average function. We can simply divide by 3.

Obviously, the average only works if there was rain. If there wasn't, we will match the minimum rainfall of 0. With the calculations that we have been using, that 0 would be converted to the score of 1 and then back to 0% (perhaps I could have streamlined some of my earlier formulas when I was still trying to figure out what I was doing). There would also be a 0% score for a dry day. If we took the same average that we figured for wet days, we would average out to 0% for any dry day. There's no need to make Excel repeat the calculation, so we can simply say that dry days score 0 and wet days are calculated averages.

Next, I had to combine the temperature and rainfall scores into a final score. Similar to my rain scores, I felt that I should do a weighted average. After all, I have one legitimate rainfall component (rainfall amounts) and two legitimate temperature components (high and low). I gave temperatures the double weight in the following formula:

=([@TempScore]*2+[@RainScore])/3

After finishing the core calculations, I calculated the minimums, averages, and maximum for the final scores. As with the average temperature column, this was mostly out of curiosity. Not surprisingly, July 29, 2009 scored 0%. This was the day that we recorded our highest high, our highest low, and no rainfall. These are all extremes, so they all averaged to the worst score possible.

Since the number of days in a month is variable, averages seem to be the best way to rate the months. For these monthly ratings, I turned away from structured formulas. Part of the reason for this is due to the inability to use absolute references in structured formulas. I also didn't want to have to complicate the formulas too much. Structured formulas really are best used when everything in a column uses the same formula. For days 29, 30, and 31 for each month, not entering a formulas is simpler than creating a formula to determine whether or not data should be produced.

I created a sheet called "Months" with a table called "tblMonthly." Each month should have a single identifier that includes the month and year. Excel stores dates as day, month, and year. I simply used the first of each month in my "Date" column. I also used a custom format. That format effectively hides the data.

While I wanted the unique identifier, I also wanted quick references to the month and year for the purposes of sorting and filtering. I didn't think of using a custom format (mmmm) and wanted separate columns for month and year, so I used a slightly more complex formula instead. The basic idea is simple. I could use the month function to convert my date in the first column to a value representing the month:

=MONTH(A2)

Personally, I would rather see the actual name of the month rather than the numerical representation. I decided to use a table to convert these values into the appropriate months. I created a new sheet called "Tables" where I would place this table. At that point, I simply had 1 through 12 listed right next to their appropriate months. I then named the table "tblMonths" and adjusted the formula for the month to read:

=VLOOKUP(MONTH(A2),tblMonths,2)

This formula takes the value for the month, finds it in the table that I created, then returns the value from the second column of the table where the names of the months are located.

Calculating the year was a little easier since the numerical value would not need to be converted from a numerical representation of the year. All I needed was the following:

=YEAR(A2)

The next step was to put the information from each day into each month. This is a little complicated. I had to find the day, month, and year for each value that I wanted to place then use that information to pull the appropriate value from tblDaily. I have the value for the days of month as a header for the table. With the value of 1 column D, I can identify the the first day of the month as D$1. When I copy the formula, the relative column will change to the header representing that cell's day. The row, however, will always refer to the header. That's why I have the dollar sign in front of the 1.

The year has already been pulled from the original date identifier and placed in column C. That allows us to use $C2 as the reference for the first entry's year. The year will always be in column C, so I have added the dollar sign. Since the year for each value will be associated with that row's date, the row will remain relative.

The month isn't quite so easy since I had already converted it from a value usable to Excel to text that is easier to be read by humans, we have to go back to the original formula for the month. MONTH($A2). Like the year, the reference is mixed with the column being absolute (always A) and the row being relative (row changes for each date).

The next step is to put this information back together to create a value for the date that we want to find values for. This date must be readable by Excel. The following formula should work:

=DATE($C2,MONTH($A2),D$1)

Of course, this formula would only post the date. It seems that a single formula for each date would be better than multiple formulas to reach the final calculation. Sometimes it's better to start with what you already have to avoid making mistakes. For that reason, I will simply refer to the date that we just established as "DateValue" in this text for now. To ensure that the formula works, a cell can be used as a temporary location for this formula and referenced by the formula that you are trying to build. The DateValue should match exactly 1 value from tblDaily. We can use that value, look in tblDaily, ensure that we find a perfect match (in case they are not sorted properly), and return the value that we calculated for the day's score:

=VLOOKUP(DateValue,TblDaily,13,FALSE)

In case you didn't already know, the false at the end of this formula indicates that an exact match is used rather than the first value that is the same or above. Also, the score that we are looking for is in the 13th column of tblDaily. Since this works, we can simply replace DateValue with the formula that it represents (if temporarily contained in a cell, click on that cell and copy everything from the formula bar except for the equal sign then paste over the temporary cell reference). This results in the following:

=VLOOKUP(DATE($C2,MONTH($A2),D$1),TblDaily,13,FALSE)

The first date that I have data for is June 7, 1948. I am starting from the beginning, and there is no data for June 1, 1948. Excel will return #N/A for this formula. A formula could be created to address this error as well as issues with months that shouldn't draw anything for the 29th, 30th, and 31st. Instead, I filled the formula, then deleted the formula for each date that should be empty. To simplify deletion, I sorted by month. This allowed me to delete days at the end from the short months (less than 31 days) as a block rather than individually.

After the columns for each day of the month, I added a column for averages. These averages will be viewed as the score for each month. This is a simple formula (filled down) that takes the averages from the 1st and 31st of each month:

=AVERAGE(D2:AH2)

Since I created this list as a table, I can filter and sort quite easily. For example, I can sort to only include March scores and sort by averages to find that 2011 was our best March ever (It definitely was a good March by my standards, so the formulas seems to work well enough).

In addition to monthly scores, I wanted yearly scores. I added another sheet called "Years" with a table called "tblYearly." I started with a column for the year starting in 1948 and filled down. I also gave the next twelve columns headings reflecting each of the twelve months. I used a date format with the month and year. For the year, I used '08 since that is when I first played around with this project. I have to use a VLOOKUP function again, so I will start by explaining the parameters. The date is the unique identifier to use, so I will have to generate a value for the month. Since the months in tblMonthly are defined by a month, a year, and a day value of 1, these are the values that have to be used. The year can be found in the first column of each row:

=$A2

The month can be pulled from the column heading:

=MONTH(B$1)

The date value can be calculated with the following formula:

=DATE($A2,MONTH(B$1),1)

This will be the lookup value for our VLOOKUP formula. We will also use tblMonthly for our range of data. The average is located in the 35th column of the table. Since I set up tblMonthly to be sorted in different ways and I have a unique identifier for each month, we will set the final parameter to FALSE to look for an exact match. This creates the following formula:

=VLOOKUP(DATE($A2,MONTH(B$1),1),tblMonthly,35,FALSE)

Before I continue with the yearly scores, I decided to add a little more to the table. I update monthly and want to be able to check after each month how we are doing so far in the year. In other words, I wanted to add a year to date column. To get this to work and calculate the average properly, I have to take into account the different lengths of the months. Of course, I would have had to do this for yearly anyway. For starters, I made a quick note of how many days each February has. I manually added 28 or 29 for each year in the "Feb Days" column.

For each month, I had to calculate a value that can be used for the final averages. I manually entered the formula for each month by multiplying the score for the month by the number of days. For example, here's the January formula:

=B2*31

Since February is variable, I multiplied that month's averages by the value that I had inserted in the "Feb Days" column:

=C2*N2

After entering the formula for each month, I filled down. I also edited to address missing data from 1948. I deleted that year’s formulas from January to May for both the score pulled from tblMonthly and the totals calculated. I also had to modify the formula for the June total to:

=G2*24

I added notes to the edited cell to indicate that I only had data for 24 days. Then I got to work on the year to date totals. Since I have been calculating monthly totals, finding the averages required me to sum those totals to the right months and divide those totals by the number of days in that time frame. The more complicated part of this would be finding out how many days in each period. I decided to edit tblMonths (Where I pulled the month name from the numerical value) to include the year to date totals for each month and include a separate column of the same for leap years. The first column that I added to this table shows the number of days for each month which I entered manually, including "28 or 29" for February. I entered 31 in both YTD columns for January. For the YTD – Feb 28 column, I enter the formula for February as:

=E2+28

For the YTD – Feb 29 column, 28 was replaced by 29 and E2 was replaced by F2. The remaining days, I replaced the 28 and 29 with [@Days] and filled. I probably would have used the more typical references if I had entered it manually, but I let Excel do the work for me on those formulas. The numbers for December resulted in 365 and 366. Those values are correct, so I can use them.

Once again, I used the VLOOKUP function to find these values. There's just one slight problem. I want to label these cells by the month range. I can't use those column headings to find the right values. There is a way to cheat, however. I already entered the months in another series of columns. If I reference those column headings, I could still fill instead of having to manually enter the month for each column. For example, January can be referenced by:

=MONTH(B$1)

Unlike previous use of the VLOOKUP function, the column that we will be drawing the value from will vary depending on whether or not it's a leap year. If it's a leap year, we will look in the sixth column. In the normal years, we will look in the fifth column. This can be represented as:

=IF($N2=28,5,6)

To pull the number of days in the time frame, we use the value of the month to look at the fifth or sixth column in tblMonths as in the following formula:

=VLOOKUP(MONTH(B$1),tblMonths,IF($N2=28,5,6))

For the calculated year to date totals that we will eventually divide by this number, we can use a SUM formula where January is always the first in the range and the final value is relative and changes with the formula:

=SUM($O2:O2)

Filling this formula would put the following for January-December:

=SUM($O2:Z2)

Since this is just part of the final calculation, the actual formula for January-January is:

=SUM($O2:O2)/VLOOKUP(MONTH(B$1),tblMonths,IF($N2=28,5,6))

Like before, I had to edit 1948 and place a comment to indicate missing data. In short, I calculated the days that I have missing and subtracted. This allowed me to fill the formula within that year. Here's the formula for January-June:

=SUM($O2:T2)/(VLOOKUP(MONTH(G$1),tblMonths,IF($N2=28,5,6))-158)

The primary purpose of this yearly table was to calculate yearly scores. My year to date approach does just that. January-December makes up a full year and can be used as the yearly score. This value can be sorted to find the best and worse years on record. You can also hide the "Feb Days" and monthly total columns since they were primarily meant as intermediate calculations.
 

Notes

I had planned on creating a Google Spreadsheets variant, but I bumped into their limitations and couldn't complete the spreadsheet. I also bumped into limitations in WebApps. The file can be viewed, but you will need Excel installed on your computer to open the file with the formulas. The particular spreadsheet that I have shared will not likely be updated. I will continue updating a personal copy. This spreadsheet could be revisited for future challenges.

I should also point out that the nature of my approach could theoretically be used to check how other climates compare to typical Washingtonian weather. If you had the data handy (and referenced the averages and extremes determined for Washington), you could check the Washingtonian Weather Index for cities such as New York or Los Angeles. Similarly, a related index cold be made for another location and applied to Washington's weather.
 

Feel Free to Comment

Now that this exercise has concluded, feel free to share any thoughts that you had regarding results, your approach, and any other ideas that popped in your head regarding this exercise.I do not provide storage to those who visit my blog. If you would like to share any files, you are going to have to find your own online storage. At that point, you can provide links with your comments. I would like to point out that macros can create security risks. If possible, save all documents as a standard workbook (not macro-enabled) unless macros are vital to your solution. If you are going to download any solutions, make sure that you are not loading macros by default. If a spreadsheet includes macros, check the code before running them.
 

About Sharing

Anyone referring to my Washingtonian Weather Index should give me credit. I will not demand credit for any other localized variations. You can even take credit for your own spreadsheet for the Washingtonian Weather Index as long as you credit me for developing the concept. Unless otherwise specified, similar guidelines should apply to solutions from other contributors.
 

Upon completion of this exercise, please fill out the following form:

https://docs.google.com/spreadsheet/viewform?formkey=dEZWUjgxaVpNVWs1NzhRWGhJVGNsd3c6MQ

Responses to the above form are collected and stored in a spreadsheet that can be viewed at:

https://docs.google.com/spreadsheet/pub?key=0Agcdh--NHOS0dEZWUjgxaVpNVWs1NzhRWGhJVGNsd3c&output=html

No comments:

Post a Comment