Fun with the tokenizer...

2008-Sep-15

I was reminded, this past week, of how cool the tokenizer is.

One of the guys who works in the same office as I do had what seemed to be a simple problem: he had a php file that contained ~50 functions, and wanted to summarize the API without parsing through the file, manually, and cutting out the function declarations.

We introduced him to in-line phpdoc blocks (he works (as a Jr.-level PHP developer) in the same office, but for a different company, so he doesn't have to follow our coding standards, but I digress..), but the 50-function library in question didn't have docblocks.

Sure, he could (and did) pull up a list function NAMES with get_defined_functions (I assume by using array_diff against a before-and-after capture), but this didn't give him the argument names, or even the number of arguments for a given function, so I broke out some old tokenizer code I'd written.

In case you aren't familiar with the tokenizer, the PHP manual defines it as:

“[an interface to let you write] your own PHP source analyzing or modification tools without having to deal with the language specification at the lexical level.”

The extension (which has been part of the PHP core distribution since 4.3.0) consists only of two functions: token_get_all and token_name, and a boatload of constants.

Enough babble, though, let's get to the meat. I pulled out this code I'd written for PEARClops (on EFNet #PEAR) that parses PHP source files and figures out what classes, functions/methods and associated parameters are included.

<?php

function get_protos($in)
{
  if (is_file(realpath($in)))
  {
    $in = file_get_contents($in);
  }
  $tokens = token_get_all($in);
  $funcs = array();
  $currClass = '';
  $classDepth = 0;

  for ($i=0; $i<count($tokens); $i++)
  {
    if (is_array($tokens[$i]) && $tokens[$i][0] == T_CLASS)
    {
      ++$i; // whitespace;
      $currClass = $tokens[++$i][1];
      while ($tokens[++$i] != '{') {}
      ++$i;
      $classDepth = 1;
      continue;
    }
    elseif (is_array($tokens[$i]) && $tokens[$i][0] == T_FUNCTION)
    {
      $nextByRef = FALSE;
      $thisFunc = array();
      
      while ($tokens[++$i] != ')')
      {
        if (is_array($tokens[$i]) && $tokens[$i][0] != T_WHITESPACE)
        {
          if (!$thisFunc)
          {
            $thisFunc = array(
              'name'  => $tokens[$i][1],
              'class' => $currClass,
            );
          }
          else
          {
            $thisFunc['params'][] = array(
              'byRef'   => $nextByRef,
              'name'    => $tokens[$i][1],
            );
            $nextByRef = FALSE;
          }
        }
        elseif ($tokens[$i] == '&')
        {
          $nextByRef = TRUE;
        }
        elseif ($tokens[$i] == '=')
        {
          while (!in_array($tokens[++$i], array(')',',')))
          {
            if ($tokens[$i][0] != T_WHITESPACE)
            {
              break;
            }
          }
          $thisFunc['params'][count($thisFunc['params']) - 1]['default'] = $tokens[$i][1];
        }
      }
      $funcs[] = $thisFunc;
    }
    elseif ($tokens[$i] == '{')
    {
      ++$classDepth;
    }
    elseif ($tokens[$i] == '}')
    {
      --$classDepth;
    }

    if ($classDepth == 0)
    {
      $currClass = '';
    }
  }

  return $funcs;
}

function parse_protos($funcs)
{  
  $protos = array();
  foreach ($funcs AS $funcData)
  {
    $proto = '';
    if ($funcData['class'])
    {
      $proto .= $funcData['class'];
      $proto .= '::';
    }
    $proto .= $funcData['name'];
    $proto .= '(';
    if ($funcData['params'])
    {
      $isFirst = TRUE;
      foreach ($funcData['params'] AS $param)
      {
        if ($isFirst)
        {
          $isFirst = FALSE;
        }
        else
        {
          $proto .= ', ';
        }

        if ($param['byRef'])
        {
          $proto .= '&';
        }
        $proto .= $param['name'];
      }
    }
    $proto .= ")";
    $protos[] = $proto;
  }
  return $protos;
}

echo "Functions in {$_SERVER['argv'][1]}:\n";
foreach (parse_protos(get_protos($_SERVER['argv'][1])) AS $proto)
{
  echo "  $proto\n";
}
?>

Save it as "parse_funcs.php" (or whatever you like) and call it like so: php parse_funcs.php /path/to/php_file

For instance:

sean@iconoclast:~/php/scripts$ php token_funcs_cli.php ~/php/cvs/Mail_Mime/mime.php
Functions in /home/sean/php/cvs/Mail_Mime/mime.php:
  Mail_mime::Mail_mime($crlf)
  Mail_mime::__wakeup()
  Mail_mime::setTXTBody($data, $isfile, $append)
  Mail_mime::setHTMLBody($data, $isfile)
  Mail_mime::addHTMLImage($file, $c_type, $name, $isfilename)
  Mail_mime::addAttachment($file, $c_type, $name, $isfilename, $encoding)
  Mail_mime::_file2str(&$file_name)
  Mail_mime::_addTextPart(&$obj, $text)
  Mail_mime::_addHtmlPart(&$obj)
  Mail_mime::_addMixedPart()
  Mail_mime::_addAlternativePart(&$obj)
  Mail_mime::_addRelatedPart(&$obj)
  Mail_mime::_addHtmlImagePart(&$obj, $value)
  Mail_mime::_addAttachmentPart(&$obj, $value)
  Mail_mime::get(&$build_params)
  Mail_mime::headers(&$xtra_headers)
  Mail_mime::txtHeaders($xtra_headers)
  Mail_mime::setSubject($subject)
  Mail_mime::setFrom($email)
  Mail_mime::addCc($email)
  Mail_mime::addBcc($email)
  Mail_mime::_encodeHeaders($input)
  Mail_mime::_setEOL($eol)

Not bad, huh?

There are some not-so-obvious bugs (inheritance, mostly), but for a relatively short script, it does a pretty good job.