I've analysed newspapers by counting the language distributions of the articles. The results look l

pokoljitef2

pokoljitef2

Answered question

2022-06-15

I've analysed newspapers by counting the language distributions of the articles.
The results look like that:
Day1 Day2 Day3 E c o n o m y E c o n o m y E c o n o m y Language 1:  0.35 0.30 0.90 Language 2:  0.11 0.10 0.00 Language 3:  0.54 0.60 0.10 Sport Sport Sport Language 1:  0.40 0.30 1.00 Language 2:  0.20 0.20 0.00 Language 3:  0.40 0.50 0.00
I've have already posted another question on that topic (Remove statistical outliers), but here comes my second problem. First of all, I want to remove all statistical outliers from data (e.g. day 3), to make it "clean" (see my other question (other post). After that, I want to terminate which changes in my data are just "noise" and witch are significant changes. But I'm not sure how to do it.
I was thinking of the following approach:
I could calculate the standard deviation (like in my other post) and treat every value outside of it as a "significant change". But I think this will cause a mistake if all my values are slightly increasing or decreasing.
Is there any mathematical technique to find the significant changes in my data?
Thanks in advance.

Answer & Explanation

svirajueh

svirajueh

Beginner2022-06-16Added 29 answers

Removing outliers does not necessarily "make the data clean". The outliers may belong . They could be a sign of a real phenomenon. What you are doing is like throwing the baby out with the bath water. Finding outliers to study them to determine if they are errors or something interesting is a different story. That is useful statistics. But statistics cannot tell you the data is bad just because it is an outlier. I have mentioned on the Cross Validated site that Dixon or Grubbs test can be used to find an outlier or modified Dixon to find two or more. Lets just for my sake suppose for the moment that your original data had no outliers, so nothing got deleted. Your question has no answer.
To ask if an observation is pure noise versus a single with possible added noise cannot be tested statistically without an underlying model. For example if your series were a time series you could compare models. Is the series signal+noise or just noise. If the process is stationary you could test this using the autocorrelation function(testing that all lagged correlations are 0). But you are trying to do this sort of thing to isolated observations outside the context of a statistical model.

Do you have a similar question?

Recalculate according to your conditions!

New Questions in College Statistics

Ask your question.
Get an expert answer.

Let our experts help you. Answer in as fast as 15 minutes.

Didn't find what you were looking for?