## Association Rule Learning Pt 1

Association rule learning is a common technique for rule discovery.  It is used for predicting the occurrence of an event or item based on the occurrences of items or events.  The most common example is that for market analysis, literally, at the market.  If we want to look inside people’s shopping carts to determine what items are likely to be bought with others in the store, association rule learning does that based on looking at the occurrences of each item that happen together.  Each rule has the form of A -> B where A and B can contain one or more items.

In this case, I’ve decided to try it out in order to look for appliances and end-use devices that are generally linked.  Two parameters need to be specified, which can be arbitrary to set, but makes intuitive sense.  The support answers the question, “How common are A & B occurring together?”  The higher the support, the more opportunities there are to activate some sort of action on that itemset.  The confidence parameter gives an indication of how exclusive that relationship between A & B is, which would give information on predicting the consumption of one end use given the information about another’s usage.

## Definitions

Itemset is a collection of one or more items
Support is the fraction of transactions that contain both A and B
 S=\frac{A,B}{# transactions} [\latex]
Confidence is the fraction of transactions where A and B appear in transactions out of all those that contain A
 C=\frac{A,B}{A} [\latex]
Frequent itemset is an itemset whose support is greater or equal to a threshold
Candidate set is the set of itemsets that require testing

## Application

The dataset I am using for this is the Smart* dataset as produced by UMass Amherst. It contains power traces from 3 smart homes, but only Home A has individual meter data, labeled on/off events, voltage/frequency data as well. It turns out that there are 47 separately measured devices and circuits in the house, which we can treat as our available items. First, we notice that the data is very sparse and skewed towards a few end-uses that dominate the entire set. The frequency of baseload and non-baseload devices is pretty apparent (except the TV, which may not be accurate since most of the data occurred on one day of the months examined).

Expressed as a fraction of the highest occurring item, for May the top 5 are:

1. living room tv: 1.0
2. living room lamp: .157
3. living room lamp2: .155
4. living room dvd: .154
5. living room receiver: .153

This is different for June, where the top 5 are:

1. living room tv: 1.0
2. basement dehumidifier: .230
3. master bedroom mac mini: .099
4. master bedroom ac: .032
5. basement freezer: .032

Given the large difference, as of now I decided to limit the algorithm to devices that appeared no greater than 1.5% of the entire set — effectively cutting out devices that are likely to be baseload than not. However, I can try other means of preprocessing the power readings that can address what it appears to be perhaps phantom loads or disparities between sensing intervals for each device.

Anyway, for the purposes of trying out an algorithm to see what we can find, an apriori search of all the candidate set revealed some interesting rules which confirm what I would suspect. Each month had rules extracted and ranked by order of descending confidence and support.  (Min Support = .02  Min Confidence = .3)

For May

•First 10 rules by support:
•Rule   (Support, Confidence)

livingroom:subwoofer -> livingroom:wii (7.8085%, 84.8601%)
livingroom:wii -> livingroom:subwoofer (7.8085%, 98.9614%)
livingroom:roku -> livingroom:subwoofer (7.7734%, 99.2526%)
livingroom:subwoofer -> livingroom:roku (7.7734%, 84.4784%)
livingroom:roku -> livingroom:wii (7.7382%, 98.8042%)
livingroom:wii -> livingroom:roku (7.7382%, 98.0712%)
livingroom:roku -> livingroom:subwoofer,livingroom:wii (7.6914%, 98.2063%)
livingroom:subwoofer -> livingroom:roku,livingroom:wii (7.6914%, 83.5878%)
livingroom:wii -> livingroom:roku,livingroom:subwoofer (7.6914%, 97.4777%)
livingroom:roku,livingroom:subwoofer -> livingroom:wii (7.6914%, 98.9458%)

•First 10 rules by confidence:
•Rule   (Support, Confidence)

bedroom:nitelight -> bedroom:noise (4.2028%, 99.7222%)
livingroom:roku,livingroom:wii -> livingroom:subwoofer (7.6914%, 99.3949%)
livingroom:roku -> livingroom:subwoofer (7.7734%, 99.2526%)
livingroom:wii -> livingroom:subwoofer (7.8085%, 98.9614%)
livingroom:roku,livingroom:subwoofer -> livingroom:wii (7.6914%, 98.9458%)
livingroom:roku -> livingroom:wii (7.7382%, 98.8042%)
bedroom:noise -> bedroom:nitelight (4.2028%, 98.6264%)
livingroom:subwoofer,livingroom:wii -> livingroom:roku (7.6914%, 98.5007%)
livingroom:roku -> livingroom:subwoofer,livingroom:wii (7.6914%, 98.2063%)
livingroom:wii -> livingroom:roku (7.7382%, 98.0712%)

For June

•First 10 rules by support:
•Rule   (Support, Confidence)
•master:desklamp -> master:nightstand2  (36.3531%, 97.5625%)
•master:nightstand2 -> master:desklamp  (36.3531%, 96.2096%)
•bedroom:nitelight -> bedroom:noise  (10.1188%, 97.4215%)
•bedroom:noise -> bedroom:nitelight  (10.1188%, 97.8604%)
•bedroom:lamp1 -> bedroom:noise  (9.1407%, 93.0095%)
•bedroom:noise -> bedroom:lamp1  (9.1407%, 88.4009%)
•bedroom:lamp1 -> bedroom:nitelight  (8.9544%, 91.1137%)
•bedroom:nitelight -> bedroom:lamp1  (8.9544%, 86.2108%)
•livingroom:subwoofer -> master:nightstand2  (5.7406%, 40.7775%)
•livingroom:subwoofer -> master:desklamp  (5.6707%, 40.2812%)
•First 10 rules by confidence:
•Rule   (Support, Confidence)
•livingroom:wii -> livingroom:subwoofer  (3.2371%, 99.2857%)
•basement:hrv,master:desklamp -> master:nightstand2  (4.285%, 98.9247%)
•bedroom:noise -> bedroom:nitelight  (10.1188%, 97.8604%)
•master:desklamp -> master:nightstand2  (36.3531%, 97.5625%)
•bedroom:nitelight -> bedroom:noise  (10.1188%, 97.4215%)
•master:nightstand2 -> master:desklamp  (36.3531%, 96.2096%)
•basement:hrv,master:nightstand2 -> master:desklamp  (4.285%, 95.8333%)
•bedroom:ac,bedroom:lamp1 -> bedroom:nitelight  (2.8412%, 94.5736%)
•bedroom:lamp1 -> bedroom:noise  (9.1407%, 93.0095%)
•bedroom:lamp1 -> bedroom:nitelight  (8.9544%, 91.1137%)

## Observations

Some devices have a seasonality (eg. AC, dehumidifier) and thus can become nearly baseload-like devices that probably have little sensitivity to the other activities in the home. Given the sparsity of the data, a low support threshold is necessary to have most of the devices considered. However, confidence appears fairly high for many of the rules found, although examining how the synchronization of the data affects this would help confirm this.

This method only captures the co-occurrence of items, so many very obvious connections are easily found, such as the subwoofer being connected to the other entertainment devices in the living room. At least for the first dozen or so rules, nothing appears out of the ordinary — sensors in the same room are linked together more so with sensors across different living spaces. The next step would be to adapt the algorithm or explore others that can look for sequential patterns as well.

• #### marioberges 12:18 pm on July 29, 2013 Permalink | Reply

Thanks for the writeup. Although I think that this is a step in the right direction, it is far from being enough to produce a paper or a chapter of your proposal to show some preliminary results, unfortunately.

Here are a few questions that came to my mind when reading this:

(a) How is association rule learning different from learning the joint and/or conditional probabilities? How exactly are the rules learned?

(b) How do you compute the “occurrences” for creating the top 5 lists that you show?

(c) Why was it useful to split the analysis by months?

• #### leneveo 6:52 pm on July 29, 2013 Permalink | Reply

Good points so I’ll try to clarify. To address your questions:

(a) In this basic form they are related — the confidence level is in fact conditional probability: Given A, what is the probability that B occurs? The support value is also the joint probability that the antecedent and consequent occur together. The rules discovered therefore can be ranked according to these two metrics, which is a nice simple way of making association rules practical to consider.

(b) This leads me to how events or items are determined to “occur” together. Association rule mining treats the problem as a binary problem, so the data was transformed to binary values. In this data I expected some of the devices to have a background consumption, so I subtracted the median of a set of the lowest non-zero values per device, then rounded it to the nearest integer, but only if this value was less than 10% of the max value.

This appeared to avoid false positives, at least well enough for demonstrating for this little experiment. An issue is that sensors don’t appear to regularly report values at the same rate — hence hopefully synchronization did not skew values too much.

(c) Mostly, I wanted to see how consistent the rules found were across the board. If the timeframe for data collection in the real world was limited, which is often the case, exploring the range of rules is probably one way of testing how useful this technique may be. So that is one downside: It’s impossible to find rules if certain rare events just don’t happen (then again, we might not care about rare events). I could have run it by the entire set, but splitting it (albeit arbitrarily) helped me remember that I should incorporate weather data as well.

• #### marioberges 7:06 pm on July 30, 2013 Permalink | Reply

Thanks.

Regarding (a), it would be good for you to describe the process of estimating the parameters for those probability distributions.

Regarding (b), I still don’t quite understand how you determine occurrences. Do you convert the power time series p[t] into a binary time series p_b[t] where p_b[t] in [0, 1] forall t?

## Relating DR to Occupancy

How can we begin to understand individual occupant behavior in a building? In what context is modeling individual behavior needed or is aggregated behavior more practical? Literature suggests in households, modeling an individual’s activity stream can capture the logical structure of energy behavior. Averaged schedules and consumption behaviors are insufficient and may not incorporate covariant end uses. In the case of low-energy or passive buildings especially, occupants completely drive the internal gains and hence energy demands. Simulations of daily behavior (extendable to multiple occupants) can then replicate daily household energy consumption.

For defining a DR potential model, Hae Young and I talked about how to distinguish “flexible” from “inflexible” behaviors, and how the data we can currently obtain might reflect that. Because “flexibility” is more of a spectrum and can be altered by the price, the relationships are not necessarily a hierarchy. Rather, I think putting it in this chart can help organize how we can think about the ends of the spectrum, understand sensor data, and interpret some meaning.

To explain the chart, the energy consumption data that would be collected from every measured load could show a range of activity, from showing strong, fixed patterns to appearing completely random in time and duration. Thus, splitting these into two categories of extremes, we can look at its relationship to occupancy information. Based on how correlated the energy consumption is to the occupancy level or status could mean many things about the flexibility of that load. Correlation could entail active occupancy is linked to the activity or the response could be delayed, where an activity or action does not consume energy until some time later or persists after the occupant(s) have left, such as running a laundry machine. Therefore, different ways of correlating activity is necessary.

In the first category, fixed consumption patterns could be a result of either a strong preference to use that device or appliance at some fixed time or because the load is really a baseload that is essentially inflexible. Of course, if such a device or appliance were inefficient, modifying its operation would be possible and be a low-cost source of DR potential. In a perfect system, these inefficiencies may be assumed to have already been removed. Although in reality, there are many barriers that may make temporarily reducing inefficiencies more viable than permanent reductions. Therefore, controlling loads with fixed consumption patterns would appear to translate into a function of cost. In the correlated case, that cost would be related to the user’s cost of comfort whereas in the other case, the cost is tied to long-term, user-aggregated, or overall building costs. An example would be the cycling of the refrigerator which has potential long-term maintenance costs and distributed impacts to all who share its usage.

In the second category, seemingly random consumption of an appliance or device can similarly be divided into its relationship to occupancy. Correlations with occupant behavior indicate it is dependent on user demand (e.g. lighting) and hence its flexibility will be price-driven. The discomfort cost is higher for some than others and will depend on the value of the activity itself. However, in the uncorrelated case, the usage of the device of appliance may not at all matter to the occupant and hence is likely to have a low impact on comfort and should be highly flexible. To be careful, though, the costs may not be linear in the sense that although doing laundry in 1 hour vs 5 hours makes little difference, not having it done by the next day at a certain time may in fact incur high discomfort costs. Capturing and predicting such preferences will be a challenge given that current appliances generally do not have such an option to express this preference. The main challenge in assessing DR potential in the uncorrelated case is predicting when such random activities occur so that they can be shifted.

As a next task, can experimental data fit such a representation?  Would such a representation be rational and useful down the road?

## Another article about writing

Today I read an old article that a friend had sent me on technical writing. Given all the journal writing ahead, I thought it was extremely relevant. Like the previous links we had gotten about the necessity of practicing writing, I think my writing muscle has indeed gotten lazy and atrophied. It’s not as if I hadn’t previously cranked out reports and such recently, but I definitely did not go over them with as much intricate thought and detail as I might have attempted to in the past. Anyway, thought it would be useful to share.

• #### Mario Bergés 1:52 pm on September 18, 2012 Permalink | Reply

Looks like a very useful article. I’ll add it to my to-read list.

## Hello world!

Welcome to INFER Lab Sites. This is your first post. Edit or delete it, then start blogging!

• #### Mr WordPress 4:16 pm on July 18, 2012 Permalink | Reply

Hi, this is a comment.
To delete a comment, just log in and view the post's comments. There you will have the option to edit or delete them.

## Introduction to Distribution Planning Costing

I think we often take for granted the work that occurs on the ground when it comes to electricity service. Ever wonder who pays for the crew to come out to build new lines, hook-up new connections, or fidget with the substations? We aren’t sent itemized bills with the cost of trimming the two trees that blocked the way or the long line that had to be pulled from the street to get to the house with the large setback. How does this massive infrastructure get funded despite all the situational complexity?

It is no surprise that marginal T&D capacity costs are area-specific and vary greatly. A costing methodology study by Energy & Environmental Economics, Inc. and Pacific Energy Associates looked at four utilities and obtained estimates from $73 to$556 per kW, and these are just averages of much larger ranges for each utility. An important reason is because costs are driven by peak loads in areas that need congestion relief. In other words, T&D costs are essentially price-driven since they translate into the rates charged for energy that generate income.

Due to the peaky nature of loads, utilities must be sensitive to location, time of year, and time of day when evaluating planning decisions. Demand response is ideally suited as a potential solution, given its similar dependencies. The difficult task is finding the proper match between demand and potential resource that will provide net benefits for the utility, resource owner, and any other involved parties. On the utility side, current practice for T&amp;D planning generally involves (1) identifying problem areas, (2) developing and evaluating potential solutions, and (3) allocating budget for the best projects. For a demand response (DR) program vs. a traditional technical solution (e.g. capital budgeting for a new line), understanding and possibly modifying the methodology for evaluating alternatives should be considered.

There are generally five cost tests that can be applied to utility projects. The Utility Cost Test (UCT) differs from the Rate Impact Measure (RIM) in that it measures as costs all the expenses that affect the ratebase, rather than impact the rate itself. The rate is the ratebase divided by sales, thus projects that impact sales will create different results between the two tests. The Total Resource Cost (TRC) considers the cost to the utility and customer as a whole, so transfers of cost between them aren’t recognized. The Societal Cost Test (SCT) extends the TRC to consider externalities. Of course, the participant alone can be considered, who would see incentives, revenue, and net costs of whatever program being subscribed to.

The elements that go into cost evaluations can be difficult to quantify. Environmental effects are usually monetized into cost per pound of air emissions or per kWh of energy. Power quality is translated into outage costs, repair costs, and/or value if quality-differentiated rates exist. Power reliability are usually treated as constraints, so meeting reliability index criteria become important. Even more difficult to quantify is managing risk (i.e. low freq and high impact events, probability estimation, cost estimation) and option/strategic value, which is from the flexibility to respond to by situation with limited information. Advanced methods that can be used include dynamic programming, game theory, contingent claims analysis, financial derivatives, and decision analysis.

The study also lists these as major cost drivers: location, load growth, load shape, equipment characteristics, operational details, financial parameters, synergies, environmental considerations, PQR, uncertainty, and intangibles (e.g. public relations, learning experience). They are not mutually exclusive of one another and can share some dependencies.

In terms of ranking and selecting projects, criteria and feasibility constraints must be met. Criteria refers to the usual financial measures of PV of cost, NPV, levelized cost, IRR, payback time, and benefit-cost ratio, as well as engineering standards, incremental measures of those technical aspects, and utility function. Feasibility refers to constraints placed by technology, budget, regulation, social and political, and the participants. Evaluating these with respect to an individual project, simple portfolio, interdependent portfolio, or even better, dynamically programmed, provide valuable decision support. Still, senior management decision-making based on simply experience and judgement is common.

Costs should be allocated by location and time, and methods exist for primarily measuring area- and time-specific marginal costs (ATSMC). Area-specific analysis is applicable for expansion plans, where budget is allocated based on engineering-defined boundaries, if possible. If not, zones are used to differentiate approximate costs. Facility sharing can be dealt with based on how load is allocated, for which there are several indices used in literature, but requires hourly load data and is thus historical and not necessarily predictive. As for time dependency, attributing costs to “peak block” shares and applying allocation factors is used. A peak period or block can be the top XX or all number of hours above a threshold. Allocation factors are more advanced and divide costs into each hour of the year. Two key examples are the loss-of-load-probability and peak capacity above a threshold level.

Finally, the study recognizes that costing is not able to address some related issues if the aim is to to incorporate DR. For example, some DR alternatives need longer lead-times to implement. Reducing the review process of evaluating programs thus would enable more alternatives. Load forecasting affects some aspects of costing, but due to different methods there can be biases that planners may need to be aware of. Also, the growing awareness of DG has resulted in creative proposals from customers and customer-utility partnerships that offer risks and benefits that may not be easy to incorporate into the costing. Insuring if DR alternatives are actually clean is another issue. For public policy, the goal should be to incentivize DR by internalizing differences between stakeholders so that costing can effect socially desirable outcomes. Costing in itself is subject to public policy and thus only evaluates utility projects on existing guidelines.

The common costing framework and methodology of where costs are derived, allocated, and evaluated is an interesting process that varies from utility to utility. Depending on existing practices, there are many aspects where improvements can be made, or challenges to be found with the expanding array of alternatives for distribution planning.

Sources:
[1] The Energy Foundation. Prepared by Energy & Environmental Economics, Inc. and Pacific Energy Associates. “Costing Methodology for Electric Distribution System Planning.” Nov. 9 2000.

[2] Chernick, Paul and Patrick Mehr. “Electricity Distribution Costs: Comparisons of Urban and Suburban Areas.” Lexington Electric Utility Committee. Oct. 28 2003.

[3] Filippini, Massimo and Jörg Wild. “Regional Differences in Electricity Distribution Costs and their Consequences for Yardstick Regulation of Access Prices.” May 2000.

## The Value of Residential Power Quality

For industrial and larger commercial customers, it is easier to measure why power quality is important. Machine downtime directly affects productivity, increased maintenance and replacement impacts the bottom line, and small inefficiencies multiply into huge wastes of energy at the meter. One of the selling points of demand response is improved power quality, which benefits both the customer and the utility. LaCommare and Eto from LBNL estimated the cost of power quality disturbances was $79 billion in 2004, with 67% due to momentary interruptions and 33% due to sustained interruptions. Of the$79 billion, nearly three-quarters was attributed to commercial customers and a quarter to industrial customers. This leaves very little to the residential sector, whose costs made up 2% of the total, or \$2 billion. Nearly a decade of grid R&D has since passed, but has the impact of power quality on the residential customer changed significantly? With a greater focus on household consumption, companies and utilities are looking for ways to increase acceptance of new technologies and programs. Understanding the value of power quality may provide impetus to customer adoption, or highlight continuing challenges.
The LBNL report recognizes the difficulty of quantifying residential costs and does its best to attribute costs to not just physical goods and appliances but to the experience of a disruption in service. The report states that: “…the other “costs” borne by residential customers are experiential in nature, such as resetting clocks, changing plans, and coping with inconvenience, fear, anxiety, etc. Analytical techniques to estimate these costs typically involve contingent valuation, which includes so-called “willingness to pay” and “willingness to accept” approaches as a means of addressing experiential costs in deriving outage costs for residential customers. The findings developed through application of contingent valuation methods have been controversial due to concerns regarding bias in the responses provided by customers to the hypothetical nature of situations they must rely on.”
The report also attempted to integrate multiple utility studies using a Tobit regression model to form cost of interruption functions, known as customer damage functions. These functions represent outage costs based on “outage duration, season, time of day, annual electricity use, and depending on the customer class, household income or number of employees.” Surveys can ask for a customer’s willingness-to-pay to avoid a certain outage scenario, rank scenarios/payment options, or estimate costs for an itemized list of mitigating actions. Other attempts to quantify residential costs generally involve surveys, but can be plagued by non-response and a lack of knowledge of the electric system for customers to provide adequate responses.
It is recognized that outage and power quality costs are non-homogenous. The distribution of these costs across different customer categories, times, service interruption types, and other characteristics is still an active field of study. Another LBNL study uses a newer set of utility survey studies to create a two-step model based on GLM rather than loglinear regression. For comparison, they demonstrate that the earlier Tobit model underestimates costs dramatically and that a Heckman two-step model underestimates C&I costs and overestimates residential costs. Nonetheless, the lack of consistent and relevant data limits the conclusions such mega-studies can draw.
The issue of contention focuses on how to treat the multitude of zero valued responses. Regardless of the model, the data continues to show residential customers often do not place a cost or WTP for many categories of reductions in power quality or service. This acts to reduce the economic justification for improving reliability standards. A 2011 customer satisfaction survey showed that residential electric utility customers were more satisfied in the categories examined except power quality, reliability and price. These declined by less than 10 points in a 1000 point scale. At what point does satisfaction levels translate into a real cost? On the flip side, the findings prove that there is a buffer in case disruption events occur more frequently under some circumstances. Thus, from an economic point of view there is a non-technological resiliency in the system that already exists. It would be interesting to see if this could be leveraged for promoting grid development.

Sources:

[1] LaCommare, Kristina Hamachi and Joseph H. Eto, ”Understanding the Cost of Power Interruptions to U.S. Electricity Consumers,” Ernest Orlando Lawrence Berkeley National Laboratory, Sep 2004.

[2] Scarpa, Ricardo and Anna Alberini, Applications of simulation methods in environmental and resource economics, 2005.

[3] Sullivan, Michael J., Ph.D., Matthew Mercurio, Ph.D., Josh Schellenberg, M.A, Freeman, Sullivan & Co. “Estimated Value of Service Reliability for Electric Utility Customers in the United States,” Jun 2009.

[4] J.D. Power and Associates, “2011 Electric Utility Residential Customer Satisfaction Study,” Press Release, 13 July 2011. http://www.jdpower.com/news/pressRelease.aspx?ID=2011101

c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel