A Methodology for Two-Level Product Partition Model Estimation of Normal Means

Friday, April 28, 2017
Paul Diver 4:00 Clark 102

Abstract In many instances, a collection of items can be thought of as having two levels: an individual-level inwhicheachitemisuniquelyidentifiedandagroup-level definedby asetofknowngroupmembershiplabels. Atwo-levelmeanestimationandclustering problem using probability models is addressed where each item has an independent observation following a normal distribution with an item specific unknown mean and constant variance. The two-level structure allows the individual-level item observations to be aggregated and averaged by group membership index. This implies that each group index has an associated mean equal to the average of its members’ means. This two-level setting is studied adapting probability models which allow means to be equal at both the individual and group-levels. The possibility of equal means at the group-level implies a two-level mean condition which restricts the possible values of the means to be estimated and thus is necessarily incorporated into the model. These probability models, called product partition models, provide a logical and flexible framework to the problem and permit the use of popular computational tools. Given a set of items, a partition is an arrangement of these items into a collection of non-empty, non-overlapping subsets. Items in the same set have observations with the same mean. Similarly, the group indices may also be partitioned into group-level sets. Random partitions at the group and individual levels jointly possess a probability distribution which is updated through the information contained in the data. This posterior distribution allows for the estimation of the unknown means. Markov sampling adapted for the two-level setting assists in providing two estimates: a two-level product estimate computed via a weighted average of posterior means summed over all possible pairs of group and individual-level partitions, and an estimate via the posterior mode, the maximum a posteriori clusteringofthedata. Theposteriormodeprovidesaclusteringstructure at both the individual and group-levels, herein automatically selecting the number ofclustersatbothlevelsandavoidingtheneedforpresettingaswithothermethods. ThisdissertationextendstheinsightfulworkofCrowley(1997)tothistwo-levelsetting with the two-level mean condition. Her important work focuses on estimating the means of normally distributed observations with known variance equal to 1 in the one-level setting where no group-level index information is utilized, whether or not it is known. The incorporation of this two-level mean condition provides not onlylogicallyconsistentanalysisregardingthemeansacrosslevels,butalsosuperior mean estimation results including when applied to Major League Baseball batting average data studied in Crowley (1997) and elsewhere.