Structural Knowledge Discovery Used to Analyze Earthquake Activity

更新时间：2023-06-12 12:09:01 阅读量：实用文档文档下载

说明：文章内容仅供预览，部分内容可能不全。下载后的文档，内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的，是否完整无缺。

structural推荐度：
相关推荐

Jesus A. Gonzalez, Lawrence B. Holder and Diane J. Cook

Department of Computer Science and Engineering

University of Texas at ArlingtonBox 19015, Arlington, TX 76019-0015{gonzalez,holder,cook@cse.uta.edu}

Abstract

The Subdue structural discovery system is being used as theData Mining tool to study the "Orizaba Fault" located inMexico, as part of a research project of the geologist Dr.Burke Burkart. We analyze the information of theEarthquake Database to discover if the earthquake activityin the area is related to the fault. We experimented withdifferent samples of data mainly using two heuristics toguide Subdue through the substructure discovery process.We also added some spatio-temporal information asbackground knowledge. The results show how Subdue cansuccessfully be used as a Data Mining tool in Real WorldDomains.

Introduction

The advancement of technology has allowed not onlythe automation of complex processes but also theaccumulation of large amounts of process information indatabases. But having the information is useless if we donot take advantage and learn from it by extractingknowledge that helps to improve a process or identifying apossible failure manifested in the stored information.However this is a difficult task to achieve using standardtools due to the large amount and complexity of data.

That is the reason why different approaches in the fieldof Knowledge Discovery (Fayyad et al. 1996) have beendeveloped to extract hidden information from thosedatabases. In this project we use the Knowledge Discoveryprocess (Fayyad et al. 1996) and a specific Data Miningtool applied to a real-world domain problem. The DataMining tool is the Subdue program, and the domain is theEarthquake database that consists of reports ofearthquakes. In the case of this domain we worked with ageology expert, Dr. Burke Burkart who helped us toanalyze the results and to guide the geology-relatedresearch.

We experimented with different samples of data mainlyusing two heuristics to guide Subdue through thesubstructure discovery process. We also added somegeographical and time knowledge to connect earthquakesthat occurred close to each other in time and distance. Theresults show how Subdue was able to effectively findpatterns with a logical interpretation, and how it can beused as a research tool in the geological domain.

Substructure Discovery Using Subdue

Subdue (Cook, Holder and Djoko 1995) is a DataMining tool that achieves the task of clustering using analgorithm categorized as an example based and relationallearning method. This tool was first developed in 1990 andhas been expanded and optimized to generate better results.It is a general tool that can be applied to any domain thatcan be represented as a graph. Subdue has beensuccessfully used on several domains like CAD circuitanalysis, chemical compound analysis, and scene analysis(Cook, Holder and Djoko 1996, Cook, Holder and Djoko1995, Cook and Holder 1994, Chittimoori, Gonzalez andHolder 1999, and Djoko, Cook and Holder 1995).

Subdue implements two model evaluation criteria as ameans to decide which patterns are going to be chosen asimportant knowledge or structures. The first modelevaluation method is called “Minimum Encoding” that is atechnique derived from the minimum description lengthprinciple (Cook and Holder 1994) and chooses as bestsubstructures those that minimize the description lengthmetric that is the length in number of bits of the graphrepresentation. The number of bits is calculated based onthe size of the adjacency matrix representation of thegraph. According to this, the best substructure is the onethat minimizes I(S) + I(G|S), where I(S) is the number ofbits required to describe substructure S, and I(G|S) is thenumber of bits required to describe graph G after beingcompressed by substructure S. The second method choosesthe substructures according to how well they compress thegraph in terms of its size in number of vertices and edges.Another method used consists of finding largesubstructures in spite of their low number of instances. The main discovery algorithm is a computationallyconstrained beam search. The algorithm begins with thesubstructure matching a single vertex in the graph. Eachiteration the algorithm selects the best substructure andincrementally expands the instances of the substructure.The algorithm searches for the best substructure until allpossible substructures have been considered or the totalamount of computation exceeds a given limit. Evaluationof each substructure is determined by how well thesubstructure compresses the input graph according to theheuristic being used (MDL or Graph size). The bestsubstructure found by Subdue can be used to compress theinput graph, which can then be input to another iteration ofSubdue. After several iterations, Subdue builds a

hierarchical description of the input data where latersubstructures are defined in terms of substructuresdiscovered on previous iterations.

There are other components that make Subdue morepowerful. We can specify predefined substructures thatSubdue looks for in the data. This allows Subdue to useprevious knowledge as a starting point and guide thediscovery process. Subdue uses an inexact graph matchtechnique so that instances of substructures that are slightlydifferent can be matched. We can also iterate Subdue’sdiscovery process in order to find more substructures innew iterations that might contain substructures found inprevious iterations. Figure 1 shows a simple example ofSubdue’s operation. Subdue finds four instances of thetriangle-on-square substructure in the geometric figure.The graph representation used to describe the substructure,as well as the input graph, is shown in the middle.

Vertices: objects or attributesEdges: relationships

4 instances ofFigure 1: Subdue’s Example

The Earthquake Database

The earthquake database contains information collectedfrom several catalogs (gs.gov). Thesecatalogs were provided by sources like the NationalGeophysical Data Center of the National Oceanic andAtmospheric Administration (NOAA). The database hasrecords of earthquakes from 2000 B. C. through the currentweek. An earthquake record consists of 35 fields: sourcecatalog, date, time, latitude, longitude, magnitude, intensityand seismic related information such as cultural effects,isoseismal map, geographic region and stations used forthe computations. Earthquakes of magnitude below 1.0 arenot stored in the database; most of the magnitudes ofearthquakes range from 2.5 to 9.5.

There are some differences between catalogs, e.g. it ispossible to find the same earthquake with a slightlydifferent epicenter or magnitude in two catalogs. This isdue to the methods and instruments used to compute thedata. As an example we mention that currently epicentersand magnitudes are calculated with computer programsusing seismographic data. The problem is that thecomputer programs contain assumptions about the earth inthe formulae they use. If those assumptions are violatedthen the results can be different.

The size of the Earthquake database is extremely large(e.g. 2.2 MB only for 1995 data), so we could not use allthe information in our experiments; we just used subsets ofinformation corresponding to periods of time between 6months and 1 year. We created a relational databasecontaining the earthquake information (the 35 fields). Thiseased the extraction of information for the experiments,because we can use SQL queries to extract the desired

subset of the database. We use the Data Mining approachinstead of queries because we do not pre-set theinformation to be included in the result. This means thatwe prepare a query that can uncover novel structuralpatterns in the same way as the Subdue system.

Earthquake Database Knowledge Representation

Every record in the database represents an earthquakeevent. In this domain we used two kinds of edges toconnect the events (earthquakes). The first type of edge isthe “near_in_distance” edge, which is set between twoevents if the distance between them is equal or less than 75kilometers. The second type of edge is the “near_in_time”edge that is set between two events if they happened with adifference of time equal or less than 36 hours. We chosethose parameters because of two reasons. First, they were agood combination that generates enough edges so that thesystem may find them, and not too many to overload thegraph so that those were the only substructures found.Second, our geology specialist told us that 75 kilometerswas reasonable for the size of the area of study and that theeffects between one earthquake and another are usuallyshown within 36 hours. An earthquake event in graph formis shown in figure 2. All the fields of the Earthquakedatabase are included except for the empty fields, whichwould bias the system because of the large amount ofthem.

Figure 2: Earthquake Knowledge Representation

Earthquake Database Experimental Results

We chose only a subset of the database to run theexperiments. For example, we took 6 months ofinformation and ran Subdue on it, so the query to extractthe information from the database included the year andmonth of the earthquakes that we wanted. We started usingall the fields of the database, but the year field affected ourresults because the values were all the same, so we decidedto exclude that field.

We wanted to take a random sample from the database(from the 5 years of information and keeping the samegraph size) but that would affect the “near_in_time” edges,

because the sampled earthquakes would have a largerrange over time and cause a loss of important information(there would be less near_in_time related records). So wejust randomly sampled from the information collected inone year creating a graph with 10135 events, 136,077vertices, 125,941 attribute edges and 757,417 undirected“near_in_distance” and “near_in_time” connections and asize of 26,963,605 bytes.

Minimum Encoding Heuristic Results

With the minimum encoding heuristic Subdue was ableto find structures that linked events with the“near_in_time” and “near_in_distance” edges. The firstsubstructure (substructure 1, not shown) linked one eventto four others with near_in_time edges and to a fifth eventwith a “near_in_distance” edge. The second substructure(substructure 2, not shown) linked one event to three otherevents with “near_in_distance” edges and to the categoryfield “PDE-W” that corresponds to the source of anearthquake’s catalog entry. The third substructure(substructure 3, not shown) linked one event to anotherevent and to one substructure_2 with “near_in_distance”edges. The fourth substructure is more complex and isshown in figure 3. This substructure is interesting becauseseveral earthquakes happened in a short period of time andcould be related to a fault placement.

The interesting issue here is the potential to findimportant relations between earthquakes that happened in alocalized region within a short period of time.

Near_in_time

EventNear_in_time

Sub_1

Figure 3: Substructure 4, 90 Instances

Graph Size Heuristic Results

With the graph size heuristic we found moresubstructures in the Earthquake database. The reason isbecause it works faster and we could go deeper in thenumber of iterations. Subdue found relations betweenevents and substructures with the “near_in_time” and“near_in_distance” edges, but it also found relations thatincluded some other fields like “Catalog”, “Month”,“Mag1 Scale”, and “Depth”. Here, it was possible toconclude that the earthquakes related by the substructurewere provided by the “PDE-W” catalog which lists themost recent weeks in events and the “PDE-Q” catalog thatlists the most recent events that are still subject to change.It was also possible to conclude from the data that moreearthquakes occurred in the months of “June” and “May”and that a frequent depth for the related earthquakes was“33.0000” and “10.0000” kilometers. The fact that Subdue

found the depth characteristic of “33.0000” kilometers isvalidated in the Earthquakes database description where itis mentioned that this is the most common depth for anearthquake. As an example, figure 4 shows how in theeighth iteration Subdue found that 140 of the instances ofsubstructure 1 happened in a depth of 33 kilometers.Substructure 1 in figure 4 has 9465 instances and connectsan earthquake event to the category value “PDE_W”.Substructure 7 with 141 instances, connects an event tosubstructures found in previous iterations with “near-in-distance” and “near-in-time” edges and also contains the“PDE_Q” attribute.

Figure 4: Substructure 8, 140 Instances

Determining Earthquake Activity

We already mentioned how we used Subdue to findpatterns in the earthquakes database. Now we are going todescribe a project in which we used Subdue to determinethe earthquake activity of a specific area of Mexico. Dr.Burke Burkart, a Geologist at the University of Texas atArlington, who has studied Mexican geology andseismology for years, is interested in the study of theseismology caused by the Orizaba Fault (Burkart 1994,Burkart and Self 1985). This fault runs from the Vulcan“Pico de Orizaba” located in the state of Veracruz throughthe “Itsmo de Tehuantepec” in the state of Oaxaca.

A fault is defined as a fracture in a surface where adisplacement of rocks also happened. Faults are caused byforces acting over the rock bodies. When a rupture occurs,there is going to be two walls forming the fault. Faultsreceive a different name according to the rocks’ movement(Hamblin and Christianses 1998).

When the movement among the rocks happens in thevertical plane, the fault is called a Dip-Slip Fault, wherethe Hanging-wall is the one above the fault and the Foot-wall is the one below the fault. Dip-Slip Faults areclassified according to the direction of the rocks’movement. A Normal Dip-Slip Fault is created when apulling force generates the fracture, then by the gravityforce, the hanging wall is displaced downwards. ReverseDip-Slip Faults are created when a compression forceforms the fracture. In this case the hanging wall movesupward due to the compression force. A Thrust Fault is areverse fault with an inclination of less than 45o.

If the movement among the rocks happens in thehorizontal plane, the fault is called a Strike-Slip Fault. Thistype of fault is described as Left-Lateral Fault or Right-Lateral Fault.

Oblique Faults are those with the characteristics of both,Dip-Slip and Strike-Slip Faults, that is, the rocks move inboth planes, the vertical and horizontal. The “OrizabaFault” is a Strike-Slip Fault. We want to know the locationof the active zone of earthquakes, which will be located at

the weakest point of the fault. This is more complex than itappears, because the fault is not continuous. It isinterrupted in some locations, changes direction and isprobably connected to other faults. This means that theearthquakes might take place in a location out of the fault,but still as a consequence of this fault.

Figure 5: Area of Study of the Orizaba Fault.

This study started with the identification of the areawith more possibilities of being affected by the fault. Westarted by selecting two rectangles. The first hascoordinates 94.5W Longitude through 101.0W Longitudeand 17.0N Latitude through 18.0N Latitude. The secondhas coordinates 94.0W Longitude through 98.0WLongitude and 18.0N Latitude through 19.0N Latitude. Thearea includes parts of the states of Guerrero, Oaxaca,Puebla and Veracruz. We can see this area in figure 5.

We ran Subdue over the graph representation of theearthquakes in these two rectangles. Subdue helped us tofind not only a subarea with a high concentration ofearthquakes, but also some of the area’s characteristics.The most representative substructures found with Subdueare shown in figure 6. In Substructure 1 we can see that anearthquake Event is related to another earthquake Eventwith a Near_in_distance edge. We also see that one of theearthquake Events is linked to a node representing theregion number “59” and to another node representing theCatalog “PDE.” What this substructure is telling us is thatregion number 59, which is located in the state ofGuerrero, is the one with more earthquake activity (it hasmore occurrences than other regions), in this case with 556earthquakes. Finally the substructure tells us that there is adistance relation between some of the earthquakes,identified by the Near_in_distance edges, that means thatthere is a distance of less than 75 km. between the events.Dr. Burkart identified this area as very active. However,the cause of these earthquakes is not related to the fault instudy, at least this is not yet clear. Substructure 2 links twosubstructure 1s with a “Near_in_distance” edge. It alsolinks one of the Substructure 1s to a vertex describing thedepth of one of its events (Substructure 1 contains two

events as can be seen in figure 6) at 33 km. Thissubstructure tells us about a common depth among some ofthe earthquakes in the area of study.

Substructure 1, 278 instances.

Substructure 2, 138 instances.Figure 6: Substructures Found in the Whole Area of Sudy.

Next, we decided to divide the area and study the sub-areas. We divided the rectangles in small pieces of one halfof a degree in both longitude and latitude. For example,one of those rectangles has coordinates 101.0W to 100.5Wof Longitude and 17.0N to 17.5N of Latitude. We dividedthe total area into 44 sub-areas. After we divided the areaof study, we got all the available information aboutearthquakes in each sub-area from the EarthquakeDatabase described before (the database containsearthquake information from 1973 up to the present date).Table 1 shows how many earthquakes we found in each ofthese sub-areas.

Table 1: Sub-Areas of Study for the Orizaba Fault.

Once we collected the information about earthquakesregistered in each sub-area, we were ready to study itscharacteristics (e.g. common depth and intensity perregion). Here Subdue takes part in this research again. Wetook the information of each sub-area where more than tenearthquakes were registered and converted it into the graphrepresentation used by Subdue. Then we ran Subdue tofind out the characteristics of the earthquakes in that sub-area. As an example lets take sub-area 26 of table 1 labeledwith the name of “Ver1.” Figure 7 shows the first twosubstructures found by Subdue in that sub-area.Substructure 1 in the figure shows that the events happenedin the region number “61,” which corresponds to theselected sub-area. We can also see that the events’information was taken from the “PDE” catalog. Insubstructure 2 we find a pattern of some of the events at a

depth of “33 Km.” This is a very interesting pattern,because it might give us information about the cause ofthose earthquakes. If the earthquake is not caused bysubduction (a force caused by the Pacific plate, whicheffects depth based on the closeness to the Pacific Ocean),then there is more possibility that it is related to the fault.However, we first have to evaluate and determine the depthof earthquakes caused by subduction in that zone.

Substructure 1, 19 instances.Substructure 2, 8 instances.

Figure 7: Substructures Found in Sub-Area 26 from Table 1.

As we see in this study, Subdue is capable of finding notonly the shared characteristics of the events, but also spacerelations between them. In the case of the identification ofshared characteristics, we used the pattern containing theregion number specification to recognize the area beingstudied. The pattern containing the depth node at 33 km.gave us information that the Geology specialist Dr. Burkartis studying so that he can use it to give direction to thisresearch. In the case of the space relations, we expect tofind patterns that represent parts of the paths of theinvolved fault. The time relations (“near_in_time” edges)were not considered by Subdue, because the earthquakes inthe area are not close in time. However, there are otherareas with different characteristics where “near_in_time”connections provide important information, and we hope touncover these relations in future studies.

Conclusions

In this research, we showed that Subdue was able tosuccessfully analyze the real-world earthquake databasewhen applied as the Data Mining tool of the KnowledgeDiscovery process. It was found that Subdue can be used tofind interesting patterns that might represent newknowledge or that might lead to new knowledge.

It was also shown how Subdue used prior knowledge toguide the search with temporal and spatial relationsprovided by the “near_in_time” and “near_in_distance”edges. Subdue was able to find substructures that includedthose edges. Using this knowledge representation, thesystem not only found repetitive patterns in the data, butalso provided temporal and distance relations that madepossible the discovery of more interesting patterns. As anexample in the Earthquake database, spatial relations wereincorporated through the “near_in_distance” edges. Subduewas able to find substructures containing these edges, andthese substructures are being used to help study the“Orizaba Fault” in Mexico.

Something very important about the temporal andspatial relations is the definition of the “near_in_time” and“near_in_distance” edges. We need to establish themeaning of “near” in both cases. This is not a simple task,because it depends directly on the domain and thesemantics of the relation to be represented.

In our future work we will be working on a concept

learning approach that will learn substructuresdistinguishing two sets of sub-areas so that we can studytheir geological behavior based on earthquake activity. Wewill continue the analysis of earthquake activity incollaboration with Dr. Burkart. We have also used thespatio-temporal relation annotations to study the AviationSafety Reporting System Database (Chittimoori, Gonzalezand Holder 1999), and we plan to work with other domainsincluding a graph representation of program source code.We are also working on a theoretical analysis of Subduebased on the PAC learning theory (Jappy and Nock 1998)and conceptual graphs (Sowa 1984).

References

Burkart, Burke 1994. Geology of northern CentralAmerica, Book chapter for Geology of the Caribean,,Jamaican Geological Society, Kingston, S.Donovan Ed. p.265-284.

Burkart, Burke and Self, S. 1985. Extension and rotation ofcrustal blocks in northern Central America and its effectupon the volcanic arc, Geology, v 13, p 2226.

Cook, Diane J. and Holder, Lawrence B. 1994.Substructure Discovery Using Minimum DescriptionLength and Background Knowledge, Journal of ArtificialIntelligence Research, Vol. 1, pp. 231-255.

Cook, Diane J.; Holder, Lawrence B.; and Djoko, Surnjani1994. Knowledge Discovery from Structural Data, Journalof Intelligence and Information Sciences, Vol. 5, Number3, pp. 229-245.

Cook, Diane J.; Holder, Lawrence B.; and Djoko, Surnjani1996. Scalable Discovery of Informative StructuralConcepts Using Domain Knowledge, IEEE Expert vol. 11number 5, pp. 59-68, October.

Sowa, J. F. 1984. Conceptual Structures – InformationProcessing in Mind and Machine, Addison-Wesley.Jappy, Pascal and Nock, Richard 1998, PAC LearningConceptual Graphs, Proceedings of the 6th InternationalConference on Conceptual Structures, pp. 303-315.Chittimoori, Ravindra N.; Gonzalez, Jesus A.; and Holder,Lawrence B. 1999. Structural Knowledge Discovery inChemical and Spatio-Temporal Databases, Proceedings ofthe Sixteenth National Conference on ArtificialIntelligence, pp. 959.

Djoko, Surnjani; Cook, Diane J.; and Holder, Lawrence B.1995. Analyzing the Benefits of Domain Knowledge inSubstructure Discovery, Proceedings of the first Int. Conf.on Knowledge Discovery and Data Mining, pp. 75-80.

Fayyad, Usama M.; Piatetsky-Shapiro, Gregory; Smyth,Padhraic; and Uthurusamy, Ramasamy 1996. Advances inKnowledge Discovery and Data Mining, AAAI Press/TheMIT Press, Menlo Park, California.

Hamblin, W. Kenneth and Christianses, Eric H. 1998.Earth’s Dynamic Systems, Prentice Hall.

本文来源：https://www.bwwdw.com/article/l3b1.html

相关文章：

观看《建党伟业》观后心得范本08-17

中国法治展望12-29

福建省晋江市永春县第一中学高一化学下学期期末考试试题06-29

关于加强加气混凝土砌体工程05-17

化解非好评04-17

2010年自身免疫性肝病临床进展回顾04-23

公交车温馨提示语02-15

尿常规检验试题10-17

护士辞职报告怎么写02-25

上一篇：武大行管复试经验谈下一篇：基因文库及应用20141114