xml -> php -> mysql tutorial

chris (2004-05-02 23:50:16)
0 replies
I have been doing stuff in the last few weeks and I have explored all (most) of the XML tools within PHP and gotten quite a good feel for it.

Now that I have worked it out, I am able to parse an xml document and plop it into a mysql database.. The hard bit now is working out a way of explaining it to you in a way that makes sense.

Firstly, I did experiement with the DOMXML extension, but found it to be poorly supported and experimental to say the least. Many of the functions listed in the API docs are depracated and there isn't a whole lot of info out there.. However, it'll be worth keeping an eye on what goes on there - esp with the advent of PHP5 (which also brings libxml2 into the equation). . watch this space.

Anyway, the most common method of achieveing your 'parse' , and also by far the most efficient is to use SAX, which is based on expat. The parse is event-based and relies on a set of 'handlers' or 'callback functions' which are called on the occurrence of each event.

Okay, so what are these 'events'? Well as the parser churns through the xml, it will come across different bits and pieces - they will usually be either a <start> tag, or an </end> tag, or a bit of CDATA (character data) - that's the stuff between the tags. So we will probably need 3 different functions to tell the parser what to do in each of the 3 circumstances.

That's why you usually see 3 functions defined in these parsing examples.. usually named as:
function start_element($parser, $element_name, $element_attrs){
//$element_name is the name of the start element which is being handled here
//$element_attrs contains the attributes pulled out of this start tag

function end_element($parser, $element_name){
///$element_name contains the name of the end element we are dealing with here

function character_data($parser,$data){
//$data contains the data which we have stumbled accross.

Now, all you have to do is think very clearly about what your data looks like, and define some rules to put in these functions. You will definitely need to use some global arrays, (in your example you will need an article[] array), and you will need a global variable to store the name of the current element which is being parsed.. that way your start_element() function will always update $currentelement with the $element_name - this is so that your character_data() function knows which element of the article[] hash is being inserted. This is also handy, because if your are processing the end tag of an article, your end_element() handler will be able to tell this (also by checking $currentelement) and it can then plop the now-completed article[] array into a bigger container (also a global) called $collection[] (or something equally meaningful), which is where all your articles are stored. One other thing you will need to do is instantiate a parser object and tell it which functions you are defining as your 'start', 'end' and 'cdata' event handlers.. So your next lines would look something like:

# an array to store the current article
# another to store all the articles
# Initialize Parser for getting out the articles' data
# specify some little handlers for this parse
xml_set_element_handler($parser, "start_element", "end_element");
# and a cdata handler for this parse
xml_set_character_data_handler($parser, "character_data");
# Disable case folding - so that your text case isn't messed up.

okay, so now we can run all this by calling the parse as follows:
PHP Code:
if(!xml_parse($parser, $xmlsource,1)){
        die ("XML Error : $error in parser at line $line");

However, this isn't going to do much at the moment, because you haven't specified any rules in your callback functions..

I guess your start_element() would look a bit like this:
PHP Code:
function start_element($parser, $element_name, $element_attrs){
    global $currentelement;

and then your end_element would look like:
PHP Code:
function end_element($parser, $element_name){
        global $currentelement,$article,$collection;
    switch $currentelement{
        case "article":
            # end of article, so store it
        case "moreovernews":
            # end of list, so do db work


and your character_element() handler would look like:
PHP Code:
function character_element($parser,$data){
        global $currentelement,$article;
        case "headline_text":
            # store this as part of this article

        case "source":
            # store this as part of this article

        case "cluster":
            # store this as part of this article


I hope this is making some sense at least..

Anyway, all you have to do then is create the register_articles() function which is called at the end of the parse (see the end_element() callback).
PHP Code:
function register_articles(){
    global $collection;
    foreach($collection as $article){
        $source    = $article['source'];
        $cluster = $article['cluster'];

        build your sql:
        $query="insert into bla bla bla"
        # run it
        // usual mysql stuff here

I have typed a lot and I haven't checked any of it, so let me know if it's any use and I might post back with something more advanced in a while :)