Skip to main content
Log in

An information-fusion method to identify pattern of spatial heterogeneity for improving the accuracy of estimation

  • Original Paper
  • Published:
Stochastic Environmental Research and Risk Assessment Aims and scope Submit manuscript

Abstract

While spatial autocorrelation is used in spatial sampling survey to improve the precision of the feature’s estimate of a certain population at area units, spatial heterogeneity as the stratification frame in survey also often have a considerable effect upon the precision. Under the context of increasingly enriched spatiotemporal data, this paper suggests an information-fusion method to identify pattern of spatial heterogeneity, which can be used as an informative stratification for improving the estimation accuracy. Data mining is major analysis components in our method: multivariate statistics, association analysis, decision tree and rough set are used in data filter, identification of contributing factors, and examination of relationship; classification and clustering are used to identify pattern of spatial heterogeneity using the auxiliary variables relevant to the goal and thus to stratify the samples. These methods are illustrated and examined in the case study of the cultivable land survey in Shandong Province in China. Different from many stratification schemes which just uses the goal variable to stratify which is too simplified, information from multiple sources can be fused to identify pattern of spatial heterogeneity, thus stratifying samples at geographical units as an informative polygon map, and thereby to increase the precision of estimates in sampling survey, as demonstrated in our case research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD conference on management of data, WA, USA

  • Alexander H, Daniel AK (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the 4th international conference on knowledge discovery and data mining, AAAI Press

  • Aleksander Ø (1999) Discernibility and rough sets in medicine: tools and applications (Dissertation). Norwegian University of Science and Technology, Norway

  • Anselin L (1992) Spacestat: a program for statistical analysis of spatial data, NCGIA, Santa Barbara

    Google Scholar 

  • Bonham-Carter FG (1994) Geographic information systems for geoscientists: modelling with GIS. Pergamon, Ottawa

    Google Scholar 

  • Bergen KM, Brown DG, Rutherford JF, Gustafson EJ (2005) Change detection with heterogeneous data using ecoregional stratification, statistical summaries and a land allocation algorithm. Remote Sens Environ 97:434–446

    Article  Google Scholar 

  • Cochran WG (1977) Sampling techniques. Wiley, New York

    Google Scholar 

  • Cressie N (1991) Statistics for spatial data. Wiley, New York

    Google Scholar 

  • Ertoz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of third SIAM international conference on data mining, San Francisco, CA, USA

  • ESRI (2004) ArcGIS desktop help. ESRI, Redlands

    Google Scholar 

  • Gallego FJ (2005) Stratified sampling of satellite images with a systematic grid of points. ISPRS J Photogramm Remote Sens 59:369–376

    Google Scholar 

  • Gediga G, Duntsch I (2000) Statistical techniques for rough set data analysis. In: Polkowski L, Tsumoto S, Lin T (eds) Rough set methods and applications. Physica Verlag, Heidelberg, pp 545–565

    Google Scholar 

  • Giarratano J, Riley G (1998) Expert systems: principles and programming. PWS Publishing Company, Boston

  • Goovaerts P, Jacquez GM, Marcus WA (2005) Geostatistical and local cluster analysis of high resolution hyperspectral imagery for detection of anomalies. Remote Sens Environ 95:351–367

    Article  Google Scholar 

  • Haining R (2003) Spatial data analysis: theory and practice. Cambridge University Press, Cambridge

    Google Scholar 

  • Komorowski J, Pawlak Z, Polkowski L, Skowron A (1999) Rough sets: a tutorial. In: Rough fuzzy hybridization. Springer, Heidelberg

    Google Scholar 

  • Kong W, Ou M (2006) Research on the change of the cultivated land’s areas and its driving factors of Shandong Province, (in Chinese). Agric Econ 28:74–76

    Google Scholar 

  • Lawrence R, Wright A (2001) Rule-based classification systems using classification and regression tree (CART) analysis. Photogramm Eng Remote Sens 67:1137–1142

    Google Scholar 

  • Lawrence R, Bunn A, Powell S, Zambon M (2004) Classification of remotely sensed imagery using stochastic gradient boosting as a refinement of classification tree analysis. Remote Sens Environ 90:331–336

    Article  Google Scholar 

  • Li D, Wang S, Li D, Wang X (2002) Theories and technologies of spatial data mining and knowledge discovery (in Chinese). Geomat Inf Sci Wuhan Univ 27(3):221–233

    Google Scholar 

  • Li L, Wang J, Liu J (2005) Optimal decision-making model of spatial sampling for survey of China’s land with remotely sensed data. Sci China Ser D 48(6):752–764

    Article  Google Scholar 

  • Liu J, Zheng X (2004) Correlate analysis between the dynamic changes of cultivated lands and grain total yield (in Chinese). Areal Res Dev 12(6):102–105

    Google Scholar 

  • Liu M, Zhuang D, Hu W (2001) On current cultivated land change based on geomorphology and spatial differentiation characteristics (in Chinese). Resour Sci 23(5):11–16

    CAS  Google Scholar 

  • McRoberts RE, Holden GR, Nelson MD, Liknes GC, Gormanson DD (2006) Using satellite imagery as ancillary data for increasing the precision of estimates for the Forest Inventory and Analysis program of USDA Forest Service. Can J For Res 36:2968–2980

    Google Scholar 

  • Michalski R, Bratko I, Kubat M (1998) Machine learning and data mining: methods and applications. Wiley, London

    Google Scholar 

  • Miller HJ, Han J (2001) Geographic data mining and knowledge discovery. Taylor & Francis, New York

    Google Scholar 

  • Mitchell TM (1997) Machine learning. McGraw-Hill, New York

    Google Scholar 

  • Mjolsness E, DeCoste D (2001) Machine learning for science: state of the art and future prospects. Science 293(5537):2051–2055

    Article  CAS  Google Scholar 

  • Pal SK, Ghosh A, Shankar BU (2000) Segmentation of remotely sensed images with fuzzy thresholding and quantitative evaluation. Int J Remote Sens 21(11):2269–2300

    Article  Google Scholar 

  • Rijsbergen CV (1979) Information retrieval, 2nd edn. Butterworths, London

    Google Scholar 

  • Ripley BD (1981) Spatial Statistics. John Wiley & Sons, New York

  • Rodriguez-Iturbe I, Mejia JM (1974) The design of rainfall networks in time and space. Water Resour Res 10:713–728

    Article  Google Scholar 

  • Steinbach M, Klooster S, Potter C (2003) Discovery of climate indices using clustering, KDD 2003 Washington, DC, http://www-users.cs.umn.edu/∼kumar/papers/kdd03_nasa.pdf

  • Tan PN, Steinbach M, Vipin K (2006) Introduction to data mining. Pearson Education, Inc., New York

    Google Scholar 

  • Tobler W (1979) In: Gale, Olsson (eds) Cellular geography, philosophy in geography. Reidel, Dordrecht

  • Wang G (2001) Theory and knowledge acquirement of rough set (in Chinese). Press of Xian Transportation University, Xian

    Google Scholar 

  • Wang J, Liu J, Zhuang D, Li L, Ge Y (2002) Spatial sampling design for monitoring the area of cultivated land. Int J Remote Sens 23(2):263–284

    Article  Google Scholar 

  • Witten IH, Frank E (2000) Data mining, practical machine learning tools and techniques with JAVA implementations, Elsevier, Singapore

    Google Scholar 

  • Zhang Y, Hang Y, Chen H, Xue F, Wang J, Sun G (2001) The impact of dimensions of sampling geographic cells of statistical indicators on the distribution of disease. Literature and Information of Preventative Medicine (in Chinese), 7(6):613–615

    Google Scholar 

  • Zhang P, Steinbach M, Kumar V, Shekhar S, Tan PN, Klooster S, Pot C (2005) Discovery of patterns in earth science data using data mining. In: New generation of data mining applications, vol 4. Wiley, New York

  • Zeng Z (2004) Research on computer classification of satellite images and application in geoscience (in Chinese). Science Press, Beijing

    Google Scholar 

Download references

Acknowledgments

This research has been done in support of the grants 40601077/D0120 and 40471111/D0120 from the Natural Science Foundation of China, and the grant 2007AA12Z233 from Hi-tech Research and Development Program of China (863).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lianfa Li.

Appendices

Appendix 1: The equation of k-nearest neighbor for a nominal variable

$$ \hat{f}(x_{i} ) \leftarrow {\mathop {\arg \max }\limits_{v\; \in \;V} }{\sum\limits_{i\; = \;1}^k {\delta (v,f(s_{i} \left| {s_{i} \in N(x_{i} )} \right.))} } $$
(5)

where \( \hat{f}(x_{i} ) \) is the estimate of the grid unit x i , V is the finite set {v 1,…,v s } of the band variable, k is the number of the nearest neighbors, f(s i |s i  ∈ N(x i )) is the categorical or discrete value of the ith nearest neighbor unit, s i belonging to the neighborhood of x i , N(x i ), δ(a,b) = 1 if a = b and δ(a,b) = 0 otherwise.

Appendix 2: A brief introduction to rough set

The rough set assumes that an information system consists of the 4-tuples S = 〈U, B, V, f〉 where U is a finite set of objects, i.e. each a gird unit, x i, in our dataset, B is a finite set of attributes, i.e. each a band in the dataset, \( V = {\bigcup\limits_{b \in B} {V_{b} } }, \) V b is the value domain of the band attribute, b and f: U × → V is a total function such that f(x i , b) ∈ V b for every ∈ B, x i  ∈ U, called information function. Any pair (b,v), ∈ B, ∈ V q is called descriptor in S. In rough set, those attributes used in classification are called the conditional variables, in fact, auxiliary variables X k (k is the order no. of band variables) and the classification variable is the decisive variable, Y. Conditional variables are used to classify the object x i in the system. If two objects in the dataset U, x i and x j have the relation, f(x i ,b) = f(x j ,b) for every b ∈ B, we call x i and x j indiscernible and all of such indiscernible objects composes a class set of the goal variable. Each conditional (auxiliary) variable has different levels of classifying the objects and the level is called significance of attribute (SA) in terms of the decisive variable. For more, please refer to Aleksander (1999), Komorowski et al. (1999) and Wang (2001).

Appendix 3: Modeling for the stratification survey of the cultivatable land’s areal proportion

Given

N :

is the number of all the aerial photos that cover the whole study region;

n :

is the number of the photo units sampled;

L :

is the number of stratums;

N h :

is the total number of units in the hth stratum;

n h :

is the number of units sampled for analysis in the hth stratum;

β ih :

is the areal proportion of the cultivable land of the ith aerial photo in the hth stratum;

  1. 1.

    If a sample unit of aerial photo is overlapped by several polygons within different strata the samples will be separated into smaller sub-sample units within the strata. The areal proportion of the cultivatable land in the sub-sample unit remains same and each sample unit’s weight in the stratum is proportional to the unit’s total area: w ih  = S ih /S h where S ih is the area of the sample unit in the hth stratum and S h is the total area of all the units sampled in the hth stratum.

  2. 2.

    Within each stratum, the units are randomly sampled according to the principle of SRS. However, the estimation equation with spatial proportion sampling is derived from Ripley (1981) and Wang et al. (2002).

    The sampling proportion:

    $$ f_{h} = n_{h} /N_{h} ; $$
    (6)

    The number of units sampled:

    $$ a = f_{h} N_{h} ; $$
    (7)

    The proportion:

    $$ \hat{\beta }_{h} (a) = {\sum\limits_{i = 1}^{n_{h} } {\beta _{{ih}} w_{{ih}} } } $$
    (8)

    The variance

    $$ \begin{aligned}{} \hat{\sigma }_{{\hat{\beta }_{h} (a)}} (n_{h} )^{2} & = E_{h} {\left[ {\hat{\beta }_{h} (a) - \beta _{h} (a)} \right]}^{2} = E{\left[ {\frac{1} {{n_{h} }}{\sum\limits_{a = 1}^{n_{h} } {n_{h} } }\beta _{h} (a)w_{h} (a) - \frac{1} {{N_{h} }}{\int\limits_{N_{h} } {(n_{h} \beta _{h} (a)w_{h} (a)\;{\text{d}}a} }} \right]}^{2} \\ {\text{ }} & = \frac{1} {{n_{h} }}\{ 1 - E_{h} [r(a - {a}\ifmmode{'}\else$'$\fi)]\} \hat{\sigma }^{2}_{{\hat{\beta }_{h} (N_{h} )}} = F(n_{h} )\hat{\sigma }^{2}_{{\hat{\beta }_{h} (N_{h} )}} \\ \end{aligned} $$
    (9)

    where β h (a) = β ah , w h (a) = w ah , and \( F(n_{h} ) = (1/n_{h} )\{ 1 - E_{p} [r(a - {a}\ifmmode{'}\else$'$\fi)\left| R \right.]\} ;\,E_{h} [r(a - {a}\ifmmode{'}\else$'$\fi)\left| R \right.] \) is the expected value of the spatial correlation structure of the target variable in the study region, R (Rodriguez-Iturbe and Mejia 1974; Ripley 1981):

    $$ \hat{\sigma }_{{\hat{\beta }_{h} (a)}} (N_{h} ) \equiv {\sum\limits_{a = 1}^{N_{h} } {{\left\{ {{\left[ {\beta _{h} (a)w_{h} (a) - {\sum\limits_{a = 1}^{N_{h} } {\beta _{h} (a)w_{h} (a)} }} \right]}^{2} w_{h} (a)} \right\}}} } $$
    (10)
  3. 3.

    For the estimation of the population’s mean, Cochran’s equation is mainly referred to. The estimate of \( \ifmmode\expandafter\bar\else\expandafter\=\fi{\beta }_{{{\text{STR}}}} \) is

    $$ \hat{\ifmmode\expandafter\bar\else\expandafter\=\fi{\beta }}_{{{\text{STR}}}} = \frac{{{\sum\limits_h^L {n_{h} \hat{\ifmmode\expandafter\bar\else\expandafter\=\fi{\beta }}_{h} } }}} {n} = {\sum\limits_{h = 1}^L {w_{h} \hat{\ifmmode\expandafter\bar\else\expandafter\=\fi{\beta }}_{h} } }\quad {\text{where}}\;n = n_{1} + n_{2} + \cdots + n_{L} $$
    (11)

    The sampling proportion in each stratum:

    $$ f_{h} = \frac{{n_{h} }} {n} = f = \frac{{{\sum\nolimits_{h = 1}^{n_{L} } {n_{h} } }}} {N} $$
    (12)

    The variance:

    $$ \hat{V}(\hat{\ifmmode\expandafter\bar\else\expandafter\=\fi{\beta }}) = {\sum\limits_{h = 1}^L {w^{2}_{h} \hat{V}_{h} (\hat{\ifmmode\expandafter\bar\else\expandafter\=\fi{\beta }})} } $$
    (13)

    where \( \hat{V}_{h} (\hat{\ifmmode\expandafter\bar\else\expandafter\=\fi{\beta }}) \) is the variance of the stratum h.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, L., Wang, J., Cao, Z. et al. An information-fusion method to identify pattern of spatial heterogeneity for improving the accuracy of estimation. Stoch Environ Res Risk Assess 22, 689–704 (2008). https://doi.org/10.1007/s00477-007-0179-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00477-007-0179-1

Keywords

Navigation