Preparing Test Data
Representative sample data selection
This section describes TED notice data samples and methods used to generate them. At first a sampling is performed on the notices from 2021, and then on a wider set.
Sample TED notices from 2021
This data sample (test_data/sampling_2021
) contains carefully selected TED notices based on the following criteria: maximise representativeness, minimise the number of selected documents. The selected notices are guaranteed to cover all possible XPath configurations available in the data. The sampling was performed automatically.
-
The current sampling exercise was performed on all records collected in 2021;
-
The data space was partitioned twice based on two criteria: the notice form number (e.g.
F03
,F06
) and the eForms Subtype (e.g.cn-standard
,can-utilities
); -
The minimal set of the most representative notices have been selected for each form_number and eforms_subtype .
Sampling result statistics
Below are two tables with statistics indicating the total number for each selection criteria, the percentage relative to the total number of considered notices (i.e. total in 2021), and the number of selected most representative notices.
Sampling based on form_number:
form_number | nr. of notices | percentage of total number of notices | nr. of the most representative notices |
---|---|---|---|
F03 |
249618 |
37,35 |
20 |
F02 |
223430 |
33,43 |
18 |
F14 |
77564 |
11,6 |
11 |
F20 |
30244 |
4,52 |
7 |
F06 |
27746 |
4,15 |
21 |
F05 |
22452 |
3,36 |
29 |
F01 |
11483 |
1,72 |
23 |
F15 |
7788 |
1,17 |
43 |
F21 |
5844 |
0,87 |
20 |
F18 |
2035 |
0,3 |
25 |
F08 |
2034 |
0,3 |
10 |
F24 |
1764 |
0,26 |
19 |
F17 |
1680 |
0,25 |
27 |
F12 |
1546 |
0,23 |
16 |
F25 |
1115 |
0,17 |
19 |
F13 |
957 |
0,14 |
13 |
F04 |
932 |
0,14 |
20 |
F22 |
93 |
0,01 |
13 |
F16 |
56 |
0,01 |
11 |
F23 |
0 |
0 |
0 |
F07 |
0 |
0 |
0 |
F19 |
0 |
0 |
0 |
Total |
668381 |
100% |
365 |
Sampling based on eforms_subtype:
eforms_subtype | nr. of notices | percentage of total number of notices | nr. of most representative notices |
---|---|---|---|
29 |
249618 |
37,35 |
19 |
16 |
223430 |
33,43 |
19 |
41 |
77564 |
11,6 |
11 |
30 |
27746 |
4,15 |
21 |
17 |
22452 |
3,36 |
27 |
38 |
19151 |
2,87 |
8 |
39 |
10797 |
1,62 |
7 |
4 |
7177 |
1,07 |
10 |
25 |
7006 |
1,05 |
17 |
20 |
5771 |
0,86 |
18 |
7 |
3872 |
0,58 |
21 |
31 |
2035 |
0,3 |
24 |
1 |
1839 |
0,28 |
8 |
19 |
1764 |
0,26 |
19 |
18 |
1680 |
0,25 |
26 |
23 |
1502 |
0,22 |
15 |
32 |
1115 |
0,17 |
19 |
36 |
930 |
0,14 |
13 |
26 |
499 |
0,07 |
15 |
5 |
460 |
0,07 |
11 |
10 |
434 |
0,06 |
17 |
8 |
411 |
0,06 |
14 |
40 |
296 |
0,04 |
6 |
27 |
235 |
0,04 |
17 |
2 |
157 |
0,02 |
4 |
21 |
93 |
0,01 |
13 |
12 |
73 |
0,01 |
5 |
11 |
61 |
0,01 |
12 |
6 |
56 |
0,01 |
11 |
28 |
48 |
0,01 |
8 |
24 |
44 |
0,01 |
10 |
3 |
38 |
0,01 |
5 |
37 |
27 |
0 |
4 |
13 |
0 |
0 |
0 |
15 |
0 |
0 |
0 |
22 |
0 |
0 |
0 |
33 |
0 |
0 |
0 |
34 |
0 |
0 |
0 |
35 |
0 |
0 |
0 |
Total |
668381 |
100% |
454 |
Sampling of TED notices based on XSD versions
A separate sampling on all the TED notices published between 2014 and 2022 was also made. The samples were split up based on the various XSD versions according to which they were created. These samples can be found in the following folder structure
test_data sampling_2014_2022 R2.0.9.S01.E01 (702 notices) R2.0.9.S02.E01 (808 notices) R2.0.9.S03.E01 (809 notices) R2.0.9.S04.E01 (778 notices) R2.0.9.S05.E01 (662 notices)
Any comments on the documentation?