A test is said to be valid if it measures accurately what it is intended to measure. This seems simple enough. When closely examined, however, the concept of validity reveals a number of aspects, each of which deserves our attention.


3.1.1   Content validity

A test is said to have content validity  if its content constitutes a representative sample of the language skills, structures, etc. with which it is meant to be concerned. It is obvious that grammar test, for instance, must be made up of items testing knowledge of control of grammar. But this in  itself does not ensure content validity. The test would have content validity if  it included a proper sample of the relevant structures. Just what are the relevant structures will depend, of course, upon  the purpose of the test. We wouldn’t expect an achievement  test for intermediate learners to contain just the same set of structures as one for advanced learners. In order to judge  whether or not  a test has content validity, we need a specification of the skills or structures etc. that is meant to cover. Such a specification should be  made at  a very early stage in test construction. It is not to be expected that everything in the specification will always appear in the test; there may simply be too many things for all of them to appear in a single test.
What is the importance of content validity? First, the greater a test’s content validity, the more likely it is to be an accurate measure of what it is supposed  to measure.  Secondly, such a test is likely to have harmful backwash effect. Areas which are likely to become areas ignored in teaching and learning. Too often the content of tests  is the best safeguard against this is to write full test specifications and to ensure that the test content is a fair reflection of these.
The effectiveness of a content validity strategy can be enhanced by making sure that the experts are truly experts in the appropriate field and that they have adequate and appropriate tools in the form of rating scales so that their judgments can be sound and focused. However, testers should never rest on their laurels. Once they have established that a test has adequate content validity, they must immediately explore other kinds of validity of the test in terms related to the specific performances of the types of students for whom the test was designed in the first place.


3.1.2  Criterion-related validity/ Empirical validity

There are essentially two kinds of criterion-related validity: concurrent validity and predictive validity. Concurrent validity is established when the test and the criterion are administered at about the same time. To exemplify this kind of validation in achievement testing, let us consider a situation where course objectives call for an oral component as part of the final achievement test. The objectives may list a large number of ‘function’ which students are expected to perform orally, to test of all which might take 45 minutes for each student. This could well be impractical.
The second kind of criterion-related validity is predictive validity. This concerns the degree to which a test can predict candidates’ future performance. An example would be how well a proficiency test could predict a student’s ability to cope with a graduate course at a British University. The criterion measure here might be an assessment of the student’s English as perceived by his or her supervisor at the university, or  it could be the outcome of the course (pass/fail etc.)


3.1.3  Construct validity

A test, part of test, or a testing technique is said to have construct validity if it can be demonstrated that it measures just the ability which it is supposed to measure. The word ‘construct’ refers to underlying ability (or trait) which is hypothesized in a theory of language ability. One might hypothesize, for example, that the ability to read involves a number of sub-abilities, such as the ability to guess the meaning of unknown words from the context in which they are met. It would be a mater of empirical research to establish whether or not such a distinct ability existed and could be measured. If we attempted to measure that ability in a particular test, then that part of the test would have construct validity only if we were  able to demonstrate that we were indeed measuring just that ability.
Construct validity is the most important form of validity because it asks the fundamental validity question: What this test really measuring? We have seen that all variables derive from constructs and that constructs are nonobservable traits, such as intelligence, anxiety, and honesty, “invented” to explain behavior. Constructs underlie the variables that researchers measure. You cannot see a construct, you can only observe its effect. “Why does the person act this way and that person a different way? Because one is intelligent and one is not – or one is dishonest and the other is not.” We cannot prove that constructs exist, just as we cannot perform brain surgery on a person to “see” his or her intelligence, anxiety, or honesty.


3.1.4  Face validity

A test is said to have face validity if it looks as if it measures what it is supposed to measure, for example, a test which pretended to measure  pronunciation ability but which did not  require the candidate to speak (and there have been more) might be thought to lack face validity. This would be true even if the test’s construct and criterion-related validity could be demonstrated. Face validity is hardly a scientific concept, yet it is very important. A test which does not face validity may not be accepted by candidates, teachers, education authorities or employers. It may simply not be used; and if it is used, the candidates’ reaction to it may mean that they do not perform on it in a way that truly reflects their ability.


3.1.5  The use of validity

What use is the reader to make of the notion of validity? First, every effort should be made in constructing tests to ensure content validity. Where possible, the tests should be validated empirically against some criterion. Particularly where it is intended to use indirect testing, reference should be made to the research  literature to confirm that  measurement of the relevant underlying constructs has been demonstrated using the testing techniques that are to be used.


Reliability is a necessary characteristic of any good test: for it to be valid at all, a test must first be reliable as a measuring instrument. It test is administered  to the same candidates on different occasion  (with no language practice work taking place between these occasion), then, to the extent that it produces differing results. It is not reliable. Reliability measured in this way is commonly referred to as  test/re-test reliability to distinguish it from mark/re-mark reliability. In short, in order to be reliable, a test must be consistent in its measurements.
Factors affecting the reliability of a test are:
  1. the extent of the sample  of material selected for testing; whereas validity is concerned chiefly with the content of the sample, reliability is concerned with the size. The larger the sample (i.e the more tasks the testees have to perform), the greater  the probability that the test as a whole is reliable-hence the favoring of objectives tests, which allow for a wide field to be covered.
  2. the administration of the test : is the same test administered to different groups under different conditions or at different times? Clearly, this is an important factor in deciding reliability, especially in tests of oral production and listening comprehension.
One method of measuring the reliability of a test is to re-administer the same test after a lapse of time. It is assumed that all candidates have been treated in the same way in the interval – that they have either all been taught or that none of them have.
Another means of estimating the reliability of a test is by administering parallel forms of the test to the same group. This assumes that two similar versions of a particular test can be constructed; such tests must be identical in the nature of their sampling, difficulty, length, rubrics, etc.


3.2.1  How to make tests more reliable

As we have seen, there are two components of test reliability: the performance of candidates from occasion to occasion, and the reliability of the scoring.
Take enough sample of behavior.   Other things equal, the more items that you have on a test, the more reliable that test will be. This seems intuitive right. While it is important to make a test long enough to achieve satisfactory reliability, it should not be made so long that the candidates become so bored or tired that the behavior that they exhibit  becomes unrepresentative of their ability. At the same time , it may often be necessary to resist pressure to make a test shorter than is appropriate. The usual argument for shortening a test is that it is not practical.
Do not allow  candidates too much freedom.   In some kinds of language test there is a tendency to offer candidates a choice of questions and then to allow them a great deal of freedom in the way that they answer the ones that they have chosen. Such a procedure  is likely to have a depressing effect on the reliability of the test. The more freedom that is given, the greater is likely to be the difference between the performance.
Write unambiguous items.   It is essential that candidates should not be presented with items whose meaning is not clear  or to which there is an acceptable answer which the test writer has not anticipated.
Provide clear and explicit instructions.   This applies both to written and oral instructions. It is possible for candidates to misinterpret what they are asked to do, then on some occasions some of them certainly will. Test writers should not rely on the students’ powers of telepathy to elicit the desired behavior.
Ensure that tests are well laid out and perfectly legible.   Too often, institutional tests are badly typed (or handwritten), have too much text in too small a space, and are poorly reproduced. As a result, students are faced with additional tasks  which are not ones meant  to measure their  language ability. Their variable performance on the unwanted  tasks will lower the reliability of a test.
Candidates should be familiar with format and testing techniques.   If any aspect of a test is unfamiliar to candidates, they are likely to perform less well they would do otherwise (on subsequently taking a parallel version, for example). For this reason, every effort must be made to ensure that all candidates have the opportunity to learn just what will be required of them.
Provide uniform and non-distracting conditions of administration.   The greater the differences between one administration of a test and another, the greater the differences one can expect between a candidate’s performance on two occasions. Great  care should be taken to ensure  uniformity.
Use items that permit scoring which is as objective as possible.   This may appear to be a recommendation to use multiple choice items, which permit  completely objective scoring. An alternative to multiple choice item which has a unique, possibly one word, correct response which the candidates produce themselves. This too should ensure objective scoring, but in fact problems with such matters as spelling which makes a candidate’s meaning unclear often make demands on the scorer’s judgment. The longer the required response, the greater the difficulties of this kind.
Make comparisons between candidates as direct as possible.   This reinforces the suggestion already made that candidates should not be given a choice of items and that they should be limited in the way that they are allowed to respond. Scoring the compositions all on one topic will be more reliable than if the candidates are allowed to choose from six topics, as has been the case in some well-known tests. The scoring should be all the more reliable if the compositions are guided. In this  section, do not allow candidates too much freedom.
Provide a detailed scoring key.   This should specify acceptable answer and assign points for partially correct responses. For high scorer reliability the key should e as detailed possible in its assignment of points.
Train scorers.    This is especially important where scoring is most subjective. The scoring of comparisons, for example,  should not be assigned to anyone who has not learned to score accurately compositions form past administrations. After each administration, patterns of scoring should be analyzed. Individuals whose scoring deviates markedly and inconsistently from the norm should not be used again.
Identify candidates by number; not name.   Scorers inevitably have expectations of candidates that they know. Except in purely objective testing, this will affect the way that they score. Studies have shown that even where the candidates are unknown to the scorers, the name on a script (or a photograph) will make a significant difference  to the scores given. For example, a scorer may be influenced by the gender or nationality of a name into making predictions which can affect the score given. The identification of candidates only by number will reduce such effects.
Employ  multiple, independent scoring.   As a general rule, and certainly where testing is subjective, all scripts should be scored by at least two  independent scorers. Neither scorer should know how the other has scored a test paper. Scores should be recorded on separate score sheets and passed to a third, senior, colleague, who compares the two sets of scores and investigates discrepancies.
A test must be practicable; in other words, it must be fairly straight forward to administer. It is only too easy to become so absorbed in the actual construction of the test items that the most obvious practical considerations concerning the test are overlooked. The length of time available for the administration of the test is frequently misjudged even by experienced test writers. Especially when the complete test consists of a number  of sub-tests. In such cases sufficient time may not be allowed for the administration of the test (i.e. a try out of the test to a small but representative group of testees)
Another practical consideration concerns the answer sheets and the stationary used. Many tests require the testees to enter their answers on the actual question paper (e.g. circling the letter of the correct option), thereby unfortunately reducing the speed  of the scoring and presenting the question paper from being used a second time. In some tests the candidates are presented  with a separate answer sheet, but too often insufficient thought has been given to possible errors arising from the (mental) transfer of the answer sheet itself.

A final point concerns the presentation of the test paper itself. Where possible, it should be printed or typewritten and appear neat, tidy and aesthetically pleasing. Nothing is worse and more disconcerting to the testee than an untidy test paper, full of misspellings, omissions and corrections.