Comparing hazard rates of two populations for left-truncated data

The data comes from section 1.16 on .  462 people (97 males and 365 females) entered a retirement center during the study time.  Information recorded includes their age of entry, age at death or leaving retirement center, gender and whether they die or not.  I am interested in whether male residents die faster than female residences in this retirement center.

This is an example of left-truncated data. As in the case of right-censored data, we define t1<t2<…<tD as discrete event times and di as the number of events at each ti. But we need redefine the number of people at risk Yi. For left-truncated data, each individual j enters into the study at a random time Lj and either dies or censors at Tj. We redefine Yi as the number of people at risk with Lj<ti<=Tj.  The following graph shows the number of people at risk Yi at each discrete event times.


Now, I abstract this problem into statistical problem. Ho:  the hazard rate of female equals that of male versus Ha: the hazard rate of female is smaller than that of male.

I use log-rank test that is when the following test statistic has w(ti) = 1 for all time points. This test statistics follows normal distribution. I can reject Ho when Z is less than Z alpha/2, the lower percentage point of a standard normal distribution. It turns out that Z equals -1.83 with p-value of 0.03, which provides evidence that male residences die faster than female residences.


For detailed codes, please check my github.



Is a mushroom poisonous or edible?

Kaggle provides well defined problems and some easy clean data to work on directly. I started working on an easy task of predicting whether a mushroom is poisonous or edible based on 22 characteristic features.

Even though 22 features are not complex, it provides difficulty in constructing statistical models. The first thing that came into my mind is reducing dimensions. Principle component analysis (PCA) is a great method to achieve this task. First, I removed four variables with one level from data.  From the above figure, 11 principle components explain more than 90% of variation in data. Now, I successfully reduce 22 features to 11 principle components (PC).

I predict data based on these 11 PCs using two methods: decision tree and logistic regression. For logistic regression, I continually remove three insignificant components: PC5, PC6, and PC12.

The sensitivity using decision tree is 0.98. It means that among 100 poisonous mushrooms, 98 can be correctly identified but 2 will be misclassified as edible. The sensitivity using logistic regression is 0.97, slightly lower than that using tree.  Both methods have the specificity of 0.67. It means that among 100 edible mushrooms, 67 can be classified as edible but 33 will be classified as poisonous. Though two models don’t perform well in identifying edible mushrooms as such, they do good jobs in recognizing poisonous mushrooms.

Stat Graduate School

Ok. I retook GRE, took ST 512 and MA 425, and then finally I got admitted to Statistics programs. 

This semester, I took ST 520 (statistical clinical trail principles), ST 521 (statistical inference), ST 711 (Design of Experiments), ST 730 (Applied Time Series). Before starting class, I thought I was smart. But after one week, I think everybody is smart. Yesterday I searched on the website : how to become smart. Two things I kept in mind. One is being with smart people. Great, I am now surrounded with many smart PhD Candidates, excellent in math. The other one is writing blog. Out of this motivation, also encouraged by David Smith’s interview, saying that blogging can keep track of what you learned everyday, I reset my forgotten password for this blog, and hope I can write something down everyday.

ST 520: study two areas epidemiology and clinical trails.

Epidemiology studies what cause a disease. Since you can only observe what happen in a population and cannot control treatment, epidemiology is about observational study. So you can only reach a conclusion about the association between disease and exposure. To determine whether there is association between disease and exposure, we can use Pearson Chi-Square Test’s Test of Independence. But to qualify the association, we have relative risk or odds ratio. Generally, there are cross-sectional study, perspective study, retrospective study ( or case-control study) and match case-control study. Cross-sectional study is to get a random sample representing the population at one time and then collect the data from the sample, so you can know the prevalence of disease. Prevalence is the proportion of individuals getting diseases in the population, including all the disease cases.  Perspective study is to find a group of people, record their all factors, especially something you want to study, like some smokers and nonsmokers, follow them for a period of time, and then see how many die or still live. Of course, there are problems, like somebody quits in the middle. Retrospective study is to get two groups of people of the same number, one group of getting disease, and one group of not. Then you ask them about their habit or exposure to something like getting lung cancer, asking they are smoker or not. Since some confounding variables (variables related with primary variables and the response) like age or gender affect one’s habit or exposure, like men tend to smoke, match case-control study can control those confounding variables to separate the effect of main exposure on diseases from confounding variables. 

Clinical trails can control the treatments to study the causal relationship between treatments (drugs) and diseases, so they are experimental study. Before applying drugs to human being, Pre-Clinical Phase can apply drug to animals and see efficacy and test toxicity. Then Phase I is the first time to apply drugs to man. This phase, the maximum tolerable dose (MTD, the amount that cause one out three toxic problem) need be determined. 

ST 521: Statistical Inference

One abstract idea I recently learned is algebra and sigma algebra. But the good thing is the idea of variable, Y(w), which mapped any element of sample space onto the real value. So sample space of rolling of die is {one dot, two dots,…six dots}. Now any element of sample space, the number of dots, is mapped to a real number. 

(S, A, P) —> Y(w) —> (R, B, F(.)) Sample space  S is replaced by real number R. The algebra A is mapped onto Borel set (B). You can assign the probability to random variable and its cumulative distribution function is F(.). 



HRP258: R Deducer

First, install R Deducer Java all together in Click on the Deducer icon where you install. Then, you see a console and data view window. Deducer is a shortcut to type out the code using the correct R syntax. It is an interface so that you can always check the R console window for notifications and/or error statements from R regarding your tasks. Here list some handy tasks.

Click on Data–subset or sort in the console window.

Click on Plots–Plot Builder–Templates(simple) or Geometric Elements(changing color), e.g. Histograms, Boxplot, Group Boxplot.

Click on Analysis–Frequencies or Descriptives in the console window. You can select the variables that you want to analyze (e.g. coffee, optimism by coffee). Next, you can choose what descriptive you want to display (like, mean, mediam).  You can also customize ‘Descriptives’ by clicking on ‘Custom’ (e.g. ‘inter-quartile-range’).

The above contents came from HRP258 statistics in medicine.

Boxplot: 25th percentile, the 75th percentile, interquartile range ( representing the length from the 25th percentile to 75th percentile of the data ), the special point (outlier: 1.5 times that range )

Met John

Today, I met John at Career Pro Inc.

He introduced me some ideas of shaping yourself like a brand.

First, sharp your resume, clean and tidy.

Second, build your network and leave others a very clean specific image, and you can learn from people in the network. How can I build the network? Talk with people, volunteer, and linkedin. More professionally, thoroughly, write journals and reports.