Title: | Nonparametrics with Clustered Binary and Multinomial Data |
---|---|
Description: | Implements non-parametric analyses for clustered binary and multinomial data. The elements of the cluster are assumed exchangeable, and identical joint distribution (also known as marginal compatibility, or reproducibility) is assumed for clusters of different sizes. A trend test based on stochastic ordering is implemented. Szabo A, George EO. (2010) <doi:10.1093/biomet/asp077>; George EO, Cheon K, Yuan Y, Szabo A (2016) <doi:10.1093/biomet/asw009>. |
Authors: | Aniko Szabo [aut, cre, cph] |
Maintainer: | Aniko Szabo <[email protected]> |
License: | GPL (>=2) |
Version: | 1.6.2 |
Built: | 2025-02-26 05:01:02 UTC |
Source: | https://github.com/anikoszabo/corrbin |
This package implements nonparametric methods for analyzing exchangeable binary and multinomial data with variable cluster sizes with emphasis on trend testing. The input should specify the treatment group, cluster-size, and the number of responses (i.e. the number of cluster elements with the outcome of interest) for each cluster.
The CBData/CMData
and read.CBData/read.CMData
functions create a ‘CBData’ or ‘CMData’ object used by the analysis functions.
ran.CBData
and ran.CMData
can be used to generate random
binary or multinomial data using a variety of distributions.
mc.test.chisq
tests the assumption of marginal compatibility
underlying all the methods, while mc.est
estimates the
distribution of the number of responses under marginal compatibility.
Finally, trend.test
performs three different tests for trend
along the treatment groups for binomial data.
Aniko Szabo
Maintainer: Aniko Szabo <[email protected]>
Szabo A, George EO. (2009) On the Use of Stochastic Ordering to Test for Trend with Clustered Binary Data. Biometrika
Stefanescu, C. & Turnbull, B. W. (2003) Likelihood inference for exchangeable binary data with varying cluster sizes. Biometrics, 59, 18-24
Pang, Z. & Kuk, A. (2007) Test of marginal compatibility and smoothing methods for exchangeable binary data with unequal cluster sizes. Biometrics, 63, 218-227
The CBData
function creates an object of class CBData that is
used in further analyses. It identifies the variables that define treatment
group, clustersize and the number of responses.
CBData(x, trt, clustersize, nresp, freq = NULL)
CBData(x, trt, clustersize, nresp, freq = NULL)
x |
a data frame with one row representing a cluster or potentially a set of clusters of the same size and number of responses |
trt |
the name of the variable that defines treatment group |
clustersize |
the name of the variable that defines cluster size |
nresp |
the name of the variable that defines the number of responses in the cluster |
freq |
the name of the variable that defines the number of clusters
represented by the data row. If |
A data frame with each row representing all the clusters with the same trt/size/number of responses, and standardized variable names:
Trt |
factor, the treatment group |
ClusterSize |
numeric, the cluster size |
NResp |
numeric, the number of responses |
Freq |
numeric, number of clusters with the same values |
Aniko Szabo
read.CBData
for creating a CBData
object
directly from a file.
data(shelltox) sh <- CBData(shelltox, trt="Trt", clustersize="ClusterSize", nresp="NResp") str(sh)
data(shelltox) sh <- CBData(shelltox, trt="Trt", clustersize="ClusterSize", nresp="NResp") str(sh)
The CMData
function creates an object of class CMData that is
used in further analyses. It identifies the variables that define treatment
group, clustersize and the number of responses for each outcome type.
CMData(x, trt, nresp, clustersize = NULL, freq = NULL)
CMData(x, trt, nresp, clustersize = NULL, freq = NULL)
x |
a data frame with one row representing a cluster or potentially a set of clusters of the same size and number of responses for each outcome |
trt |
the name of the variable that defines treatment group |
nresp |
either a character vector with the names or a numeric vector with indices
of the variables that define the number of responses in
the cluster for each outcome type. If |
clustersize |
the name or index of the variable that defines cluster size, or |
freq |
the name or index of the variable that defines the number of clusters
represented by the data row. If |
A data frame with each row representing all the clusters with the same trt/size/number of responses, and standardized variable names:
Trt |
factor, the treatment group |
ClusterSize |
numeric, the cluster size |
NResp.1--NResp.K+1 |
numeric, the number of responses for each of the K+1 outcome types |
Freq |
numeric, number of clusters with the same values |
Aniko Szabo
read.CMData
for creating a CMData
object
directly from a file.
data(dehp) dehp <- CMData(dehp, trt="Trt", nresp=c("NResp.1","NResp.2","NResp.3")) str(dehp)
data(dehp) dehp <- CMData(dehp, trt="Trt", nresp=c("NResp.1","NResp.2","NResp.3")) str(dehp)
This data set is based on a National Toxicology Program study on diethylhexyl phthalate, DEHP. Pregnant CD-1 mice were randomly assigned to receive 0, 250, 500, 1000, or 1500 ppm of DEHP in their feed during gestational days 0-17. The uterine contents of the mice were examined for toxicity endpoints prior to normal delivery. The possible outcomes are 1) malformation, 2) death or resorption, 3) no adverse event.
data(dehp)
data(dehp)
A 'CMData' object, that is a data frame with the following variables
Trt | factor giving treatment group |
ClusterSize | the size of the litter |
NResp.1 | the number of fetuses with a type 1 outcome (malformation) |
NResp.2 | the number of fetuses with a type 2 outcome (death or resorption) |
NResp.3 | the number of fetuses with a type 3 outcome (normal) |
Freq | the number of litters with the given ClusterSize/NResp.1-NResp.3 combination |
National Toxicology Program, NTP Study TER84064.
Tyl, R. W., Price, C. J., Marr, M. C., and Kimmel, C. A. (1988). Developmental toxicity evaluation of dietary di(2-ethylhexy)phthalate in Fischer 344 rats and CD-1 mice. Fundamental and Applied Toxicology 10, 395-412.
data(dehp) library(lattice) pl <- xyplot(NResp.1/ClusterSize + NResp.2/ClusterSize + NResp.3/ClusterSize ~ Trt, data=dehp, outer=TRUE, type=c("p","a"), jitter.x=TRUE) pl$condlevels[[1]] <- c("Malformed", "Dead", "Normal") print(pl)
data(dehp) library(lattice) pl <- xyplot(NResp.1/ClusterSize + NResp.2/ClusterSize + NResp.3/ClusterSize ~ Trt, data=dehp, outer=TRUE, type=c("p","a"), jitter.x=TRUE) pl$condlevels[[1]] <- c("Malformed", "Dead", "Normal") print(pl)
The data set is based on a developmental toxicity experiment on the effect of ethylene glycol diethyl ether (EGDE) on fetal development of New Zealand white rabbits. In the study, four groups of pregnant does were randomly assigned to dose levels $0, 25, 50$, and $100$ milligrams per kilogram body weight of EGDE. For each litter and at each dose level, the adverse response used is the combined number of fetal malformation and fetal death.
data(egde)
data(egde)
A 'CBData' object, that is a data frame with the following variables
Trt | factor giving treatment group |
ClusterSize | the size of the litter |
NResp | the number of affected fetuses |
Freq | the number of litters with the given ClusterSize/NResp combination |
Krewski, D., Zhu, Y., and Fung, K. (1995). Statistical analysis of overdispersed multinomial data from developmental toxicity studies. In Statistics in Toxicology, Ed. B. Morgan, pp.\ 151–179. New York: Oxford University Press.
data(egde) stripchart(I(NResp/ClusterSize)~Trt, cex=sqrt(egde$Freq), data=egde, pch=1, method="jitter", vertical=TRUE, ylab="Proportion affected")
data(egde) stripchart(I(NResp/ClusterSize)~Trt, cex=sqrt(egde$Freq), data=egde, pch=1, method="jitter", vertical=TRUE, ylab="Proportion affected")
The extracting syntax works as for [.data.frame
, and in general the returned object is not a CBData
or CMData
object.
However if the columns are not modified, then the result is still a CBData
or CMData
object with appropriate attributes preserved,
and the unused levels of treatment groups dropped.
## S3 method for class 'CBData' x[i, j, drop] ## S3 method for class 'CMData' x[i, j, drop]
## S3 method for class 'CBData' x[i, j, drop] ## S3 method for class 'CMData' x[i, j, drop]
x |
|
i |
numeric, row index of extracted values |
j |
numeric, column index of extracted values |
drop |
logical. If TRUE the result is coerced to the lowest possible dimension.
The default is the same as for |
a CBData
or CMData
object
Aniko Szabo
CBData
, CMData
data(shelltox) str(shelltox[1:5,]) str(shelltox[1:5, 2:4]) data(dehp) str(dehp[1:5,]) str(dehp[1:5, 2:4])
data(shelltox) str(shelltox[1:5,]) str(shelltox[1:5, 2:4]) data(dehp) str(dehp[1:5,]) str(dehp[1:5, 2:4])
GEE.trend.test
implements a GEE based test for linear increasing trend
for correlated binary data.
GEE.trend.test(cbdata, scale.method = c("fixed", "trend", "all"))
GEE.trend.test(cbdata, scale.method = c("fixed", "trend", "all"))
cbdata |
a |
scale.method |
character string specifying the assumption about the change in the overdispersion (scale) parameter across the treatment groups: "fixed" - constant scale parameter (default); "trend" - linear trend for the log of the scale parameter; "all" - separate scale parameter for each group. |
The actual work is performed by the geese
function of
the geepack
library, which is required for this feature to work.
This function only provides a convenient wrapper
to obtain the results in the same format as RS.trend.test
and
SO.trend.test
.
The implementation aims for testing for increasing trend, and a one-sided p-value is reported. The test statistic is asymptotically normally distributed, and a two-sided p-value can be easily computed if needed.
A list with components
statistic |
numeric, the value of the test statistic |
p.val |
numeric, asymptotic one-sided p-value of the test |
Aniko Szabo, [email protected]
RS.trend.test
, SO.trend.test
for
alternative tests; CBData
for constructing a CBData object.
data(shelltox) if (require(geepack)){ GEE.trend.test(shelltox, "trend") }
data(shelltox) if (require(geepack)){ GEE.trend.test(shelltox, "trend") }
An exchangeable multinomial distribution with categories
, can be
parameterized by the joint probabilities of events
where and
.
The
jointprobs
function estimates these probabilities under various settings.
Note that when some of the 's equal zero, then no restriction on the number of outcomes of the
corresponding type are imposed, so the resulting probabilities are marginal.
jointprobs(cmdata, type = c("averaged", "cluster", "mc"))
jointprobs(cmdata, type = c("averaged", "cluster", "mc"))
cmdata |
a |
type |
character string describing the desired type of estimate:
|
a list with an array of estimates for each treatment. For a multinomial distribution with
categories the arrays will have either
or
dimensions, depending on whether
cluster-size specific estimates (
type="cluster"
) or pooled estimates
(type="averaged"
or type="mc"
) are requested. For the cluster-size specific estimates
the first dimension is the cluster-size. Each additional dimension is a possible outcome.
mc.est
for estimating the distribution under marginal compatibility,
uniprobs
and multi.corr
for extracting the univariate marginal event
probabilities, and the within-multinomial correlations from the joint probabilities.
data(dehp) # averaged over cluster-sizes tau.ave <- jointprobs(dehp, type="ave") # averaged P(X1=X2=O1, X3=O2) in the 1500 dose group tau.ave[["1500"]]["2","1"] # there are two type-1, and one type-2 outcome #plot P(X1=O1) - the marginal probability of a type-1 event over cluster-sizes tau <- jointprobs(dehp, type="cluster") ests <- as.data.frame(lapply(tau, function(x)x[,"1","0"])) matplot(ests, type="b")
data(dehp) # averaged over cluster-sizes tau.ave <- jointprobs(dehp, type="ave") # averaged P(X1=X2=O1, X3=O2) in the 1500 dose group tau.ave[["1500"]]["2","1"] # there are two type-1, and one type-2 outcome #plot P(X1=O1) - the marginal probability of a type-1 event over cluster-sizes tau <- jointprobs(dehp, type="cluster") ests <- as.data.frame(lapply(tau, function(x)x[,"1","0"])) matplot(ests, type="b")
The mc.est
function estimates the distribution of the number of
responses in a cluster under the assumption of marginal compatibility:
information from all cluster sizes is pooled. The estimation is performed
independently for each treatment group.
## S3 method for class 'CMData' mc.est(object, eps = 1e-06, ...) ## S3 method for class 'CBData' mc.est(object, ...) mc.est(object, ...)
## S3 method for class 'CMData' mc.est(object, eps = 1e-06, ...) ## S3 method for class 'CBData' mc.est(object, ...) mc.est(object, ...)
object |
|
eps |
numeric; EM iterations proceed until the sum of squared changes fall below |
... |
other potential arguments; not currently used |
The EM algorithm given by Stefanescu and Turnbull (2003) is used for the binary data.
For CMData
: A data frame giving the estimated pdf for each treatment and
clustersize. The probabilities add up to 1
for each Trt
/ClusterSize
combination. It has the following columns:
Prob |
numeric, the probability of |
Trt |
factor, the treatment group |
ClusterSize |
numeric, the cluster size |
NResp.1 - NResp.K |
numeric, the number of responses of each type |
For CBData
: A data frame giving the estimated pdf for each treatment and
clustersize. The probabilities add up to 1
for each Trt
/ClusterSize
combination. It has the following columns:
Prob |
numeric, the probability of |
Trt |
factor, the treatment group |
ClusterSize |
numeric, the cluster size |
NResp |
numeric, the number of responses |
For multinomial data, the implementation is currently written in R, so it is not very fast.
Aniko Szabo
George EO, Cheon K, Yuan Y, Szabo A (2016) On Exchangeable Multinomial Distributions. #'Biometrika 103(2), 397-408.
Stefanescu, C. & Turnbull, B. W. (2003) Likelihood inference for exchangeable binary data with varying cluster sizes. Biometrics, 59, 18-24
data(dehp) dehp.mc <- mc.est(subset(dehp, Trt=="0")) subset(dehp.mc, ClusterSize==2) data(shelltox) sh.mc <- mc.est(shelltox) if (require(lattice)){ xyplot(Prob~NResp|factor(ClusterSize), groups=Trt, data=sh.mc, subset=ClusterSize>0, type="l", as.table=TRUE, auto.key=list(columns=4, lines=TRUE, points=FALSE), xlab="Number of responses", ylab="Probability P(R=r|N=n)") }
data(dehp) dehp.mc <- mc.est(subset(dehp, Trt=="0")) subset(dehp.mc, ClusterSize==2) data(shelltox) sh.mc <- mc.est(shelltox) if (require(lattice)){ xyplot(Prob~NResp|factor(ClusterSize), groups=Trt, data=sh.mc, subset=ClusterSize>0, type="l", as.table=TRUE, auto.key=list(columns=4, lines=TRUE, points=FALSE), xlab="Number of responses", ylab="Probability P(R=r|N=n)") }
mc.test.chisq
tests whether the assumption of marginal compatibility is
violated in the data.
## S3 method for class 'CMData' mc.test.chisq(object, ...) ## S3 method for class 'CBData' mc.test.chisq(object, ...) mc.test.chisq(object, ...)
## S3 method for class 'CMData' mc.test.chisq(object, ...) ## S3 method for class 'CBData' mc.test.chisq(object, ...) mc.test.chisq(object, ...)
object |
|
... |
other potential arguments; not currently used |
The assumption of marginal compatibility (AKA reproducibility or interpretability) implies that
the marginal probability of response does not depend on clustersize.
Stefanescu and Turnbull (2003), and Pang and Kuk (2007) developed a
Cochran-Armitage type test for trend in the marginal probability of success
as a function of the clustersize. mc.test.chisq
implements a
generalization of that test extending it to multiple treatment groups.
A list with the following components:
overall.chi |
the test statistic; sum of the statistics for each group |
overall.p |
p-value of the test |
individual |
a list of the results of the test applied to each group separately:
|
Aniko Szabo
Stefanescu, C. & Turnbull, B. W. (2003) Likelihood inference for exchangeable binary data with varying cluster sizes. Biometrics, 59, 18-24
Pang, Z. & Kuk, A. (2007) Test of marginal compatibility and smoothing methods for exchangeable binary data with unequal cluster sizes. Biometrics, 63, 218-227
mc.est
for estimating the distribution under marginal
compatibility.
data(dehp) mc.test.chisq(dehp) data(shelltox) mc.test.chisq(shelltox)
data(dehp) mc.test.chisq(dehp) data(shelltox) mc.test.chisq(shelltox)
Calculates the within- and between-outcome correlation coefficients for exchangeable correlated
multinomial data based on joint probability estimates calculated by the jointprobs
function. These determine the variance inflation due the cluster structure.
multi.corr(jp, type = attr(jp, "type"))
multi.corr(jp, type = attr(jp, "type"))
jp |
the output of |
type |
one of c("averaged","cluster","mc") - the type of joint probability. By default,
the |
If and
is the number of events of type
and
, respectively, in a cluster of
size
, then
where and
are the marginal event probabilities and
are the correlation
coefficients computed by
multi.corr
.
a list of estimated correlation matrices by treatment group. If cluster-size specific
estimates were requested ((type="cluster")
), then each list elements are a list of
these matrices for each cluster size.
jointprobs
for calculating the joint probability arrays
data(dehp) tau <- jointprobs(dehp, type="averaged") multi.corr(tau)
data(dehp) tau <- jointprobs(dehp, type="averaged") multi.corr(tau)
These are built-in functions to be used by ran.CMData
for generating
random multinomial data.
mg.Resample(n, clustersizes, param) mg.DirMult(n, clustersizes, param) mg.LogitNorm(n, clustersizes, param) mg.MixMult(n, clustersizes, param)
mg.Resample(n, clustersizes, param) mg.DirMult(n, clustersizes, param) mg.LogitNorm(n, clustersizes, param) mg.MixMult(n, clustersizes, param)
n |
number of independent clusters to generate |
clustersizes |
an integer vector specifying the sizes of the clusters |
param |
a list of parameters for each specific generator |
For mg.Resample: the param
list should be list(param=...)
, in which
the CMData object to be resampled is passed.
For mg.DirMult: the param
list should be list(shape=...)
, in which
the parameter vector of the Dirichlet distribution is passed
(see rdirichlet).
For mg.LogitNorm: the param
list should be list(mu=...,sigma=...)
,
in which the mean vector and covariance matrix of the underlying Normal distribution
are passed. If sigma
is NULL (or missing), then an identity matrix is assumed.
They should have K-1 dimensions for a K-variate multinomial.
For mg.MixMult: the param
list should be list(q=...,m=...)
,
in which the vector of mixture probabilities q
and the matrix m
of logit-transformed means of each component are passed.
For a K-variate multinomial, the matrix m
should have K-1 columns
and length(q)
rows.
The NOSTASOT dose is the No-Statistical-Significance-Of-Trend dose – the largest dose at which no trend in the rate of response has been observed. It is often used to determine a safe dosage level for a potentially toxic compound.
NOSTASOT( cbdata, test = c("RS", "GEE", "GEEtrend", "GEEall", "SO"), exact = test == "SO", R = 100, sig.level = 0.05, control = soControl() )
NOSTASOT( cbdata, test = c("RS", "GEE", "GEEtrend", "GEEall", "SO"), exact = test == "SO", R = 100, sig.level = 0.05, control = soControl() )
cbdata |
a |
test |
character string defining the desired test statistic. See
|
exact |
logical, should an exact permutation test be performed. See
|
R |
integer, number of permutations for the exact test |
sig.level |
numeric between 0 and 1, significance level of the test |
control |
an optional list of control settings for the stochastic order
("SO") test, usually a call to |
A series of hypotheses about the presence of an increasing trend overall,
with all but the last group, all but the last two groups, etc. are tested.
Since this set of hypotheses forms a closed family, one can test these
hypotheses in a step-down manner with the same sig.level
type I error
rate at each step and still control the family-wise error rate.
The NOSTASOT dose is the largest dose at which the trend is not statistically significant. If the trend test is not significant with all the groups included, the largest dose is the NOSTASOT dose. If the testing sequence goes down all the way to two groups, and a significant trend is still detected, the lowest dose is the NOSTASOT dose. This assumes that the lowest dose is a control group, and this convention might not be meaningful otherwise.
a list with two components
NOSTASOT |
character string identifying the NOSTASOT dose. |
p |
numeric vector of the p-values of the tests actually performed. |
The last element corresponds to all doses included, and will not be missing. p-values for tests that were not actually performed due to the procedure stopping are set to NA.
Aniko Szabo, [email protected]
Tukey, J. W.; Ciminera, J. L. & Heyse, J. F. (1985) Testing the statistical certainty of a response to increasing doses of a drug. Biometrics 41, 295-301.
trend.test
for details about the available trend
tests.
data(shelltox) NOSTASOT(shelltox, test="RS")
data(shelltox) NOSTASOT(shelltox, test="RS")
qpower.pdf
and betabin.pdf
calculate the probability
distribution function for the number of responses in a cluster of the q-power
and beta-binomial distributions, respectively.
betabin.pdf(p, rho, n) qpower.pdf(p, rho, n)
betabin.pdf(p, rho, n) qpower.pdf(p, rho, n)
p |
numeric, the probability of success. |
rho |
numeric between 0 and 1 inclusive, the within-cluster correlation. |
n |
integer, cluster size. |
The pdf of the q-power distribution is
, where
, and the intra-cluster correlation
The pdf of the beta-binomial distribution is
, where
, and
.
a numeric vector of length giving the value of
for
.
Aniko Szabo, [email protected]
Kuk, A. A (2004) Litter-based approach to risk assessment in developmental toxicity studies via a power family of completely monotone functions Applied Statistics, 52, 51-61.
Williams, D. A. (1975) The Analysis of Binary Responses from Toxicological Experiments Involving Reproduction and Teratogenicity Biometrics, 31, 949-952.
ran.CBData
for generating an entire dataset using
these functions
#the distributions have quite different shapes #with q-power assigning more weight to the "all affected" event than other distributions plot(0:10, betabin.pdf(0.3, 0.4, 10), type="o", ylim=c(0,0.34), ylab="Density", xlab="Number of responses out of 10") lines(0:10, qpower.pdf(0.3, 0.4, 10), type="o", col="red")
#the distributions have quite different shapes #with q-power assigning more weight to the "all affected" event than other distributions plot(0:10, betabin.pdf(0.3, 0.4, 10), type="o", ylim=c(0,0.34), ylab="Density", xlab="Number of responses out of 10") lines(0:10, qpower.pdf(0.3, 0.4, 10), type="o", col="red")
ran.mc.CBData
generates a random CBData
object from a
given two-parameter distribution.
ran.CBData( sample.sizes, p.gen.fun = function(g) 0.3, rho.gen.fun = function(g) 0.2, pdf.fun = qpower.pdf )
ran.CBData( sample.sizes, p.gen.fun = function(g) 0.3, rho.gen.fun = function(g) 0.2, pdf.fun = qpower.pdf )
sample.sizes |
a dataset with variables Trt, ClusterSize and Freq giving the number of clusters to be generated for each Trt/ClusterSize combination. |
p.gen.fun |
a function of one parameter that generates the value of the
first parameter of |
rho.gen.fun |
a function of one parameter that generates the value of
the second parameter of |
pdf.fun |
a function of three parameters (p, rho, n) giving the
PDF of the number of responses in a cluster given the two parameters
(p, rho), and the cluster size (n). Functions implementing two
common distributions: the beta-binomial ( |
p.gen.fun and rho.gen.fun are functions that generate the parameter values for each treatment group; pdf.fun is a function generating the pdf of the number of responses given the two parameters p and rho, and the cluster size n.
p.gen.fun
and rho.gen.fun
expect the parameter value of 1 to
represent the first group, 2 - the second group, etc.
a CBData object with randomly generated number of responses with sample sizes specified in the call.
Aniko Szabo, [email protected]
set.seed(3486) ss <- expand.grid(Trt=0:3, ClusterSize=5, Freq=4) #Trt is converted to a factor rd <- ran.CBData(ss, p.gen.fun=function(g)0.2+0.1*g) rd
set.seed(3486) ss <- expand.grid(Trt=0:3, ClusterSize=5, Freq=4) #Trt is converted to a factor rd <- ran.CBData(ss, p.gen.fun=function(g)0.2+0.1*g) rd
Generates random exchangeably correlated multinomial data based on a parametric distribution or using resampling. The Dirichlet-Multinomial, Logistic-Normal multinomial, and discrete mixture multinomial parametric distributions are implemented. All observations will be assigned to the same treatment group, and there is no guarantee of a specific order of the observations in the output.
ran.CMData(n, ncat, clustersize.gen, distribution)
ran.CMData(n, ncat, clustersize.gen, distribution)
n |
number of independent clusters to generate |
ncat |
number of response categories |
clustersize.gen |
either an integer vector specifying the sizes of the clusters,
which will be recycled to achieve the target number of clusters |
distribution |
a list with two components: "multinom.gen" and "param" that specifies the generation process for each cluster. The "multinom.gen" component should be a function of three parameters: number of clusters, vector of cluster sizes, and parameter list, that a matrix of response counts where each row is a cluster and each column is the number of responses of a given type. The "param" component should specify the list of parameters needed by the multinom.gen function. |
a CMData
object with randomly generated number of responses with
sample sizes specified in the call
Aniko Szabo
CMData
for details about CMData
objects; multinom.gen
for built-in generating functions
# Resample from the dehp dataset data(dehp) ran.dehp <- ran.CMData(20, 3, 10, list(multinom.gen=mg.Resample, param=list(data=dehp))) # Dirichlet-Multinomial distribution with two treatment groups and random cluster sizes binom.cs <- function(n){rbinom(n, p=0.3, size=10)+1} dm1 <- ran.CMData(15, 4, binom.cs, list(multinom.gen=mg.DirMult, param=list(shape=c(2,3,2,1)))) dm2 <- ran.CMData(15, 4, binom.cs, list(multinom.gen=mg.DirMult, param=list(shape=c(1,1,4,1)))) ran.dm <- rbind(dm1, dm2) # Logit-Normal multinomial distribution ran.ln <- ran.CMData(13, 3, 3, list(multinom.gen=mg.LogitNorm, param=list(mu=c(-1, 1), sigma=matrix(c(1,0.8,0.8,2), nr=2)))) # Mixture of two multinomial distributions unif.cs <- function(n){sample(5:9, size=n, replace=TRUE)} ran.mm <- ran.CMData(6, 3, unif.cs, list(multinom.gen=mg.MixMult, param=list(q=c(0.8,0.2), m=rbind(c(-1,0), c(0,2)))))
# Resample from the dehp dataset data(dehp) ran.dehp <- ran.CMData(20, 3, 10, list(multinom.gen=mg.Resample, param=list(data=dehp))) # Dirichlet-Multinomial distribution with two treatment groups and random cluster sizes binom.cs <- function(n){rbinom(n, p=0.3, size=10)+1} dm1 <- ran.CMData(15, 4, binom.cs, list(multinom.gen=mg.DirMult, param=list(shape=c(2,3,2,1)))) dm2 <- ran.CMData(15, 4, binom.cs, list(multinom.gen=mg.DirMult, param=list(shape=c(1,1,4,1)))) ran.dm <- rbind(dm1, dm2) # Logit-Normal multinomial distribution ran.ln <- ran.CMData(13, 3, 3, list(multinom.gen=mg.LogitNorm, param=list(mu=c(-1, 1), sigma=matrix(c(1,0.8,0.8,2), nr=2)))) # Mixture of two multinomial distributions unif.cs <- function(n){sample(5:9, size=n, replace=TRUE)} ran.mm <- ran.CMData(6, 3, unif.cs, list(multinom.gen=mg.MixMult, param=list(q=c(0.8,0.2), m=rbind(c(-1,0), c(0,2)))))
A convenience function to read data from specially structured file directly
into a CBData
object.
read.CBData(file, with.freq = TRUE, ...)
read.CBData(file, with.freq = TRUE, ...)
file |
name of file with data. The first column should contain the treatment group, the second the size of the cluster, the third the number of responses in the cluster. Optionally, a fourth column could give the number of times the given combination occurs in the data. |
with.freq |
logical indicator of whether a frequency variable is present in the file |
... |
additional arguments passed to |
a CBData
object
Aniko Szabo
A convenience function to read data from specially structured file directly
into a CMData
object. There are two basic data format options: either the counts of responses of all categories are given (and the
cluster size is the sum of these counts), or the total cluster size is given with the counts of all but one category.
The first column should always give the treatment group, then either the counts for each category (first option, chosen by setting
with.clustersize = FALSE
), or the size of the cluster followed by the counts for all but one category (second option,
chosen by setting with.clustersize = TRUE
). Optionally, a last column could
give the number of times the given combination occurs in the data.
read.CMData(file, with.clustersize = TRUE, with.freq = TRUE, ...)
read.CMData(file, with.clustersize = TRUE, with.freq = TRUE, ...)
file |
name of file with data. The data in the file should be structured as described above. |
with.clustersize |
logical indicator of whether a cluster size variable is present in the file |
with.freq |
logical indicator of whether a frequency variable is present in the file |
... |
additional arguments passed to |
a CMData
object
Aniko Szabo
RS.trend.test
implements the Rao-Scott adjusted Cochran-Armitage test
for linear increasing trend with correlated data.
RS.trend.test(cbdata)
RS.trend.test(cbdata)
cbdata |
a |
The test is based on calculating a design effect for each cluster by dividing the observed variability by the one expected under independence. The number of responses and the cluster size are then divided by the design effect, and a Cochran-Armitage type test statistic is computed based on these adjusted values.
The implementation aims for testing for increasing trend, and a one-sided p-value is reported. The test statistic is asymptotically normally distributed, and a two-sided p-value can be easily computed if needed.
A list with components
statistic |
numeric, the value of the test statistic |
p.val |
numeric, asymptotic one-sided p-value of the test |
Aniko Szabo, [email protected]
Rao, J. N. K. & Scott, A. J. A (1992) Simple Method for the Analysis of Clustered Data Biometrics, 48, 577-586.
SO.trend.test
, GEE.trend.test
for
alternative tests; CBData
for constructing a CBData object.
data(shelltox) RS.trend.test(shelltox)
data(shelltox) RS.trend.test(shelltox)
This is a classical developmental toxicology data set. Pregnant banded Dutch rabbits were treated with one of four levels of a chemical. The actual doses are not known, instead the groups are designated as Control, Low, Medium, and High. Before term the animals were sacrificed, and the total number of fetuses, as well as the number affected by the treatment was recorded.
data(shelltox)
data(shelltox)
A 'CBData' object, that is a data frame with the following variables
Trt | factor giving treatment group |
ClusterSize | the size of the litter |
NResp | the number of affected fetuses |
Freq | the number of litters with the given ClusterSize/NResp combination |
Paul, S. R. (1982) Analysis of proportions of affected foetuses in teratological experiments. Biometrics, 38, 361-370.
This data set has been analyzed (and listed) in numerous papers, including
Rao, J. N. K. & Scott, A. J. (1992) A Simple Method for the Analysis of Clustered Data. Biometrics, 48, 577-586.
George, E. O. & Kodell, R. L. (1996) Tests of Independence, Treatment Heterogeneity, and Dose-Related Trend With Exchangeable Binary Data. Journal of the American Statistical Association, 91, 1602-1610.
Lee, S. (2003) Analysis of the Binary Littermate Data in the One-Way Layout. Biometrical Journal, 45, 195-206.
data(shelltox) stripchart(I(NResp/ClusterSize)~Trt, cex=sqrt(shelltox$Freq), data=shelltox, pch=1, method="jitter", vertical=TRUE, ylab="Proportion affected")
data(shelltox) stripchart(I(NResp/ClusterSize)~Trt, cex=sqrt(shelltox$Freq), data=shelltox, pch=1, method="jitter", vertical=TRUE, ylab="Proportion affected")
SO.LRT
computes the likelihood ratio test statistic for stochastic
ordering against equality assuming marginal compatibility for both
alternatives. Note that this statistic does not have a
distribution, so the p-value computation is not
straightforward. The
SO.trend.test
function implements a
permutation-based evaluation of the p-value for the likelihood-ratio test.
SO.LRT(cbdata, control = soControl())
SO.LRT(cbdata, control = soControl())
cbdata |
a |
control |
an optional list of control settings, usually a call to
|
The value of the likelihood ratio test statistic is returned with two attributes:
ll0 |
the log-likelihood under |
ll1 |
the log-likelihood under |
Aniko Szabo
data(shelltox) LRT <- SO.LRT(shelltox, control=soControl(max.iter = 100, max.directions = 50)) LRT
data(shelltox) LRT <- SO.LRT(shelltox, control=soControl(max.iter = 100, max.directions = 50)) LRT
SO.mc.est
computes the nonparametric maximum likelihood estimate of
the distribution of the number of responses in a cluster under
a stochastic ordering constraint. Umbrella ordering can be specified using
the
turn
parameter.
SO.mc.est(cbdata, turn = 1, control = soControl())
SO.mc.est(cbdata, turn = 1, control = soControl())
cbdata |
an object of class |
turn |
integer specifying the peak of the umbrella ordering (see Details). The default corresponds to a non-decreasing order. |
control |
an optional list of control settings, usually a call to
|
Two different algorithms: EM and ISDM are implemented. In general, ISDM (the
default) should be faster, though its performance depends on the tuning
parameter max.directions
: values that are too low or too high slow the
algorithm down.
SO.mc.est
allows extension to an umbrella ordering: by specifying the value of
as the
turn
parameter. This is an experimental feature, and at this point none of the
other functions can handle umbrella orderings.
A list with components:
Components Q
and D
are unlikely to be needed by the user.
MLest |
data frame with the maximum likelihood estimates of
|
Q |
numeric matrix; estimated weights for the mixing distribution |
D |
numeric matrix; directional derivative of the log-likelihood |
loglik |
the achieved value of the log-likelihood |
converge |
a 2-element vector with the achieved relative error and the performed number of iterations |
Aniko Szabo, [email protected]
Szabo A, George EO. (2010) On the Use of Stochastic Ordering to Test for Trend with Clustered Binary Data. Biometrika 97(1), 95-108.
data(shelltox) ml <- SO.mc.est(shelltox, control=soControl(eps=0.01, method="ISDM")) attr(ml, "converge") require(lattice) panel.cumsum <- function(x,y,...){ x.ord <- order(x) panel.xyplot(x[x.ord], cumsum(y[x.ord]), ...)} xyplot(Prob~NResp|factor(ClusterSize), groups=Trt, data=ml, type="s", panel=panel.superpose, panel.groups=panel.cumsum, as.table=TRUE, auto.key=list(columns=4, lines=TRUE, points=FALSE), xlab="Number of responses", ylab="Cumulative Probability R(R>=r|N=n)", ylim=c(0,1.1), main="Stochastically ordered estimates\n with marginal compatibility")
data(shelltox) ml <- SO.mc.est(shelltox, control=soControl(eps=0.01, method="ISDM")) attr(ml, "converge") require(lattice) panel.cumsum <- function(x,y,...){ x.ord <- order(x) panel.xyplot(x[x.ord], cumsum(y[x.ord]), ...)} xyplot(Prob~NResp|factor(ClusterSize), groups=Trt, data=ml, type="s", panel=panel.superpose, panel.groups=panel.cumsum, as.table=TRUE, auto.key=list(columns=4, lines=TRUE, points=FALSE), xlab="Number of responses", ylab="Cumulative Probability R(R>=r|N=n)", ylim=c(0,1.1), main="Stochastically ordered estimates\n with marginal compatibility")
Performs a likelihood ratio test of stochastic ordering versus equality using
permutations to estimate the null-distribution and the p-value. If only the
value of the test statistic is needed, use SO.LRT
instead.
SO.trend.test(cbdata, R = 100, control = soControl())
SO.trend.test(cbdata, R = 100, control = soControl())
cbdata |
a |
R |
an integer – the number of random permutations for estimating the null distribution. |
control |
an optional list of control settings, usually a call to
|
The test is valid only under the assumption that the cluster-size distribution does not depend on group. During the estimation of the null-distribution the group assignments of the clusters are permuted keeping the group sizes constant; the within-group distribution of the cluster-sizes will vary randomly during the permutation test.
The default value of R
is probably too low for the final data
analysis, and should be increased.
A list with the following components
LRT |
the value of the likelihood ratio test statistic. It has two
attributes: |
p.val |
the estimated one-sided p-value. |
boot.res |
an object of class "boot" with the detailed results of
the permutations. See |
Aniko Szabo, [email protected]
Szabo A, George EO. (2010) On the Use of Stochastic Ordering to Test for Trend with Clustered Binary Data. Biometrika 97(1), 95-108.
SO.LRT
for calculating only the test statistic,
soControl
data(shelltox) set.seed(45742) sh.test <- SO.trend.test(shelltox, R=10, control=soControl(eps=0.1, max.directions=25)) sh.test #a plot of the resampled LRT values #would look better with a reasonable value of R null.vals <- sh.test$boot.res$t[,1] hist(null.vals, breaks=10, freq=FALSE, xlab="Test statistic", ylab="Density", main="Simulated null-distribution", xlim=range(c(0,20,null.vals))) points(sh.test$LRT, 0, pch="*",col="red", cex=3)
data(shelltox) set.seed(45742) sh.test <- SO.trend.test(shelltox, R=10, control=soControl(eps=0.1, max.directions=25)) sh.test #a plot of the resampled LRT values #would look better with a reasonable value of R null.vals <- sh.test$boot.res$t[,1] hist(null.vals, breaks=10, freq=FALSE, xlab="Test statistic", ylab="Density", main="Simulated null-distribution", xlim=range(c(0,20,null.vals))) points(sh.test$LRT, 0, pch="*",col="red", cex=3)
The values supplied in the function call replace the defaults and a list with
all possible arguments is returned. The returned list is used as the control
argument to the mc.est
, SO.LRT
, and
SO.trend.test
functions.
soControl( method = c("ISDM", "EM"), eps = 0.005, max.iter = 5000, max.directions = 0, start = ifelse(method == "ISDM", "H0", "uniform"), verbose = FALSE )
soControl( method = c("ISDM", "EM"), eps = 0.005, max.iter = 5000, max.directions = 0, start = ifelse(method == "ISDM", "H0", "uniform"), verbose = FALSE )
method |
a string specifying the maximization method |
eps |
a numeric value giving the maximum absolute error in the log-likelihood |
max.iter |
an integer specifying the maximal number of iterations |
max.directions |
an integer giving the maximal number of directions considered at one step of the ISDM method. If zero or negative, it is set to the number of non-empty cells. A value of 1 corresponds to the VDM algorithm. |
start |
a string specifying the starting setup of the mixing distribution; "H0" puts weight only on constant vectors (corresponding to the null hypothesis of no change), "uniform" puts equal weight on all elements. Only a "uniform" start can be used for the "EM" algorithm. |
verbose |
a logical value; if TRUE details of the optimization are shown. |
a list with components for each of the possible arguments.
Aniko Szabo [email protected]
# decrease the maximum number iterations and # request the "EM" algorithm soControl(method="EM", max.iter=100)
# decrease the maximum number iterations and # request the "EM" algorithm soControl(method="EM", max.iter=100)
The trend.test
function provides a common interface to the trend tests
implemented in this package: SO.trend.test
,
RS.trend.test
, and GEE.trend.test
. The details of
each test can be found on their help page.
trend.test( cbdata, test = c("RS", "GEE", "GEEtrend", "GEEall", "SO"), exact = test == "SO", R = 100, control = soControl() )
trend.test( cbdata, test = c("RS", "GEE", "GEEtrend", "GEEall", "SO"), exact = test == "SO", R = 100, control = soControl() )
cbdata |
a |
test |
character string defining the desired test statistic. "RS"
performs the Rao-Scott test ( |
exact |
logical, should an exact permutation test be performed. Only an exact test can be performed for "SO". The default is to use the asymptotic p-values except for "SO". |
R |
integer, number of permutations for the exact test |
control |
an optional list of control settings for the stochastic order
("SO") test, usually a call to |
A list with two components and an optional "boot" attribute that
contains the detailed results of the permutation test as an object of class
boot
if an exact test was performed.
statistic |
numeric, the value of the test statistic |
p.val |
numeric, asymptotic one-sided p-value of the test |
Aniko Szabo, [email protected]
SO.trend.test
, RS.trend.test
, and
GEE.trend.test
for details about the available tests.
data(shelltox) trend.test(shelltox, test="RS") set.seed(5724) #R=50 is too low to get a good estimate of the p-value trend.test(shelltox, test="RS", R=50, exact=TRUE)
data(shelltox) trend.test(shelltox, test="RS") set.seed(5724) #R=50 is too low to get a good estimate of the p-value trend.test(shelltox, test="RS", R=50, exact=TRUE)
Calculates the marginal probability of each event type for exchangeable correlated multinomial
data based on joint probability estimates calculated by the jointprobs
function.
uniprobs(jp, type = attr(jp, "type"))
uniprobs(jp, type = attr(jp, "type"))
jp |
the output of |
type |
one of c("averaged","cluster","mc") - the type of joint probability. By default,
the |
a list of estimated probability of each outcome by treatment group. The elements are either
matrices or vectors depending on whether cluster-size specific estimates were requested
(type="cluster"
) or not.
jointprobs
for calculating the joint probability arrays
data(dehp) tau <- jointprobs(dehp, type="averaged") uniprobs(tau) #separately for each cluster size tau2 <- jointprobs(dehp, type="cluster") uniprobs(tau2)
data(dehp) tau <- jointprobs(dehp, type="averaged") uniprobs(tau) #separately for each cluster size tau2 <- jointprobs(dehp, type="cluster") uniprobs(tau2)
unwrap
is a utility function that reformats a CBData or CMData object so
that each row is one observation (instead of one or more clusters). A new
‘ID’ variable is added to indicate clusters. This form can be useful for
setting up the data for a different package.
## S3 method for class 'CBData' unwrap(object, ...) ## S3 method for class 'CMData' unwrap(object, ...) unwrap(object, ...)
## S3 method for class 'CBData' unwrap(object, ...) ## S3 method for class 'CMData' unwrap(object, ...) unwrap(object, ...)
object |
a |
... |
other potential arguments; not currently used |
For unwrap.CMData
: a data frame with one row for each cluster element (having a multinomial
outcome) with the following standardized column names
Trt |
factor, the treatment group |
ClusterSize |
numeric, the cluster size |
ID |
factor, each level representing a different cluster |
Resp |
numeric with integer values giving the response type of the cluster element |
For unwrap.CBData
: a data frame with one row for each cluster element (having a binary
outcome) with the following standardized column names
Trt |
factor, the treatment group |
ClusterSize |
numeric, the cluster size |
ID |
factor, each level representing a different cluster |
Resp |
numeric with 0/1 values, giving the response of the cluster element |
Aniko Szabo
data(dehp) dehp.long <- unwrap(dehp) head(dehp.long) data(shelltox) ush <- unwrap(shelltox) head(ush)
data(dehp) dehp.long <- unwrap(dehp) head(dehp.long) data(shelltox) ush <- unwrap(shelltox) head(ush)