People Research Academics Events Publications Resources

The Paul Davis Moment

Each week at our computational linguistics discussion group, Clippers, we spend the first few minutes discussing interesting, time-saving, or just plain nifty tools that we've found. Here is a listing of recently shared resources:

  • Mendeley. "iTunes for research papers" Mendeley provides a nice GUI for interacting with your collection of research papers. The parallel to a playlist is a collection: create collections sorted howsoever you choose, and put papers in multiple collections. Mendeley does an alright job of importing the correct metadat for many papers, especially if they exist in a public archive or are text-based (as opposed to scanned pages). It is possible to take notes on the PDFs and in a separate comment, and it is also possible to highlight text. You can also sync your library with the Mendeley website to have access to them from anywhere. (January 2011)

  • PGF and TikZ. "A TeX macro package for generating graphics" When you need to show an MT model or generate any other graphics for your papers or presentations, PGF and TikZ will help you out. See the TeXample.net page to see some example images generated using PGF and TikZ. (October 2010)

  • beamerposter. A LaTeX package for creating scientific posters. beamerposter allows you to create beautiful posters for your conference presentations. (October 2010)

  • Looglefight. A tool to help you find the right phrasing for your comp ling papers. Takes two words or phrases input by the user and returns their frequencies in the ACL Reference Corpus to help you determine which phrasing works "better". (October 2010)

  • PDFMiner. Extracts meaningful information out of PDF documents. PDFMiner is written in Python, has support for preserving layout, and could be useful the next time you're processing PDFs. (January 2010)

  • Zotero. Collect, manage, and cite your research resources. Watch the video featured on the main page for a quick introduction to the tool. Their beta product is web based for easy accessibility. (October 2009)

  • Recaptcha. Know how when you buy from Ticketmaster, you have to type in the words that appear all squirrely in the picture? Now you can use that same technology to hide your own email address on your webpage. This can help stop spam. (Feburary 09)

  • Bibtex Citations. IEEE Explore, which is available to OSU students on campus or if you log in to the library from off campus, now allows you to download the bibtex citation of the articles you're looking at. Check for the option on the menu on the left. (Feburary 09)

  • IR Systems. Two IR systems that are available for research purposes are Galago and Terrier. Each has its ups and downs, both are worth exploring. Talk to Chris for more info. (Feb 09)

  • Version Control. There are a few new programs out there that improve on software like Subversion for doing distributed version control. For instance, Git and Mercurial offer some features that make collaboration easier, have better branching capabilites, and more intuitive command line incantations. For advice, contact Jon Dehdari. (Feb 2009)

  • Digital Union. For teachers and students at OSU, check out the Digital Union for all kinds of technologically enhanced classroom supplies or study aids. Computers, speakers, cameras, and pedagogical tools you'd never thought of. (Feb 2009)

  • Hadoop. For those wishing to try out MapReduce code on a computing cluster, a la Google, Hadoop is set up on the Slate machines. You can learn about MapReduce by reading this paper, or if you're wanting something more in-depth, you can watch this lecture series. The distributed file system is called Hadoop. When you're ready to get started, Chris or Ilana can show you where to go on our system. (January 09)

  • Apache Mahout. This software implements several machine learning algorithms using the MapReduce framework. Includes Naive Bayes, KNN, others. Should work on our Hadoop setup. Read more here. (January 09)

  • Encodings. If you have a document in UTF-8, and need it to be in Latin encoding, use unix's utf2latin1. But if you need to go back the other direction, use iconv. (January 09)

  • Octet. Handy add-on for emacs that inserts latex code as you write. Helps you avoid leaving off that table end tag and that sort of thing. Puts keystroke bindings with common latex tags. Ask Crystal for help using this handy feature. (January 09)

  • LaPrint. Save your Matlab figures in a way that makes them show up nicer in your latex documents, including adding latex tags to the text (labels, axes, titles) on the figure.

  • LiveJournal on SLaTe. For those wishing to work with blog data, we have zipped-up versions of three months' worth of LiveJournal webpages on the slate server. This is a standard data set for working with blogs. Talk to Eric or Chris if you're interested in getting started with it. (January 09)

  • Higher Order Perl A new book is now available about programming elegantly in everyone's favorite scripting language. Order it or download it for free. (January 09)

  • OpenFST. An open source finite state toolkit from the same folks who brought us the AT&T finite state toolkit. Has many of the same features, some new ones, and searchable source code! Find it here. (March 08)

  • rename. A unix command that will change the extension of a bunch of files all at once. For instance, "rename .raw .au *.raw" will change all of the files within the current directory that have a .raw extension to have a .au extension. Quicker than writing a script. (March 08)

  • Speed Reading. To practice speed reading, find a freely available program called RSVP. It will take any webpage or document and present it to you, word by word, at the speed you set. Then increase the speed as you get better.(Feb 08)
  • PDFs on the Internets. If you use a Mozilla browser, you can download a plug-in that lets you choose how to view a PDF: in a new browser or tab, by opening Acrobat on your desktop, or by just downloading it to your computer without opening. Pretty handy. Find it on the Mozilla or Firefox plug-ins page. (Feb 08)

  • Adobe Plug-In 8. The newest version of Adobe Acrobat, at least for Linux, begins to display long documents incrementally as it downloads, rather than waiting for the download to complete. (Jan 08)

  • Machine Learning Bootcamp. At http://videolectures.net/bootcamp07_vilanova there are various video lectures with syncronized slides that some people might be interested in. The main topics covered are
    • Basic Math and TCS for Machine Learning
    • Useful existing software for Machine Learning
    • Introduction to Machine Learning
    • Theoretical frameworks and foundations
    • Experimental Machine Learning
    • Feature extraction and model selection
    • Graphical models
    • Kernel methods and linear predictors
    • Clustering
    • General view of application areas
    • Machine learning in vision
    • Machine learning in user interfaces
    • Machine learning for data mining
    (Jan 08)

  • Google N-gram search. First off, the Google English n-gram data is available to those with access to the ling dep't server. Find it at /home/corpora/EN/WebIT. There is also available some software that searches these n-grams efficiently on the web. I lost that reference, but will update when it's found. (Dec 07)

  • Penn Discourse Treebank. This is also currently available on the linguistic corpora server. (Dec 07)

  • Stinkpot. A repository of helpful hints on all kinds of tools we tend to use to do our work: Emacs, Python, Latex, Matlab... it's a personal blog of a grad student at MIT who works on silly things like evolution. His version of the Paul Davis moment is something you might find helpful. (Dec 07)

  • Semantic. If you'd like to be able to create your own math symbols in latex, specifically those with ligatures, try installing this package. (Dec 07)

  • Anti-Word. If you use a linux machine pretty much exclusively, but get email attachments from people who use Windows products, they you might be interested in Anti-Word, which will convert .doc files to plain text. (Nov 07)

  • MIT Workshop on Syntax. It's not up as of this writing, but check on mitworld.mid.edu for a video of their one day workshop titled "Where Does Syntax Come From? Have We All Been Wrong?", with guest speakers Sandiway Fong, Chris Manning, and Noam Chomsky, among others. (Nov 07)

  • Machine Learning Slides. UC Berkeley's RAD Lab has made slides and videos available on the web from a recent two-day short course on applied machine learning for its industrial affiliates: (Nov 07)
    Video
    Slides

  • PrimoPdf. You can make PDFs of your MS Office documents for free with this nifty app. Get it here. (Oct 07)

  • TigerSearch. This software for searching through syntactically annotated corpora is now available on the Mac portion of the ling machines in Oxley 201. It has a java interface, and allows you to search for examples of general or specific syntactic constructions within many corpora. Ask Detmar or Adriane if you need help. (Oct 07)

  • sshfs. This unix application allows you to mount an entire filesystem. Then it's easier to access your ling files from home. This website has details. It should be available on most linux installations: try 'appget install sshfs'. (May 07)

  • Subversion. If you missed Scott's presentation on version control using SVN, or if you'd like to see it again, you can access his slides via the LCC tutorials webpage or Scott's webpage (May 07).

  • Firefox tip. It seems like the new version of Firefox doesn't let you close all your tabs with one button. In fact, that's just the default, but you can set it to do as you like if you look closely at the settings options. (May 07).

  • SQLite. This is a good database system to use because it is portable, keeps your data in a single file, works in the user space, and has good software carpentry, that is, it was built intelligently so that you can build on top of it. (May 07).

  • Wikipedia Downloads. It is possible to download all of Wikipedia, or various portions of it, for use in NLP tasks. The website can be a bit hard to find, so Adriane found it for us: Get Wikipedia here (April 07)

  • Website Accessibility. In constructing a website, it's recommended (required at OSU, in fact), to make it accessible to the disabled. That means to make sure that vision-impaired folks will be able to get your information by using a screen reader. To make sure your website is compliant, use a tool like Fangs to get an idea of what your website "sounds" like. (April 07)

  • RSS News Feeds. If you wish to work with current news documents, and are looking for a standard, uniform format in which to work, RSS is a good choice. To obtain a news article in RSS format, you can use URLs of the form:
    • http://news.google.com/news?q=Ohio+State&output=rss
    Where "Ohio State" was the search term; to restrict it to specific news sites, use the "source:" operator, i.e.
    • http://news.google.com/news?q=Ohio+State+source:new_york_times&output=rss
    Other formats are available. (April 07)

  • Corpora. The BRENT corpus is available within at /home/corpora/EN/childes/Brent. Ask Anton for details on using this corpus, or the Stephanie corpus.

  • Pseudocode in Latex. The style file crlscode.sty works well and produces very pretty pseudocode, same as in the Introduction to Algorithms book. Get the code and the documentation here.

  • Arabtex. It's also possible to type in Arabic using Latex. Correct right-to-left formatting is included in the arabtex package. It's a biggie, and complicated, so you're best off using the package that's already installed on bardolph.

  • Machine Learning Toolkit. YALE: Yet Another Learning Engine. Available on SourceForge, among other things, it can do word vector processing. (Mar 07)

  • Text Editing. The creator of the vim text editor gave a talk to the Google folks on efficient text editing: how to identify when you're doing things inefficiently, and how to fix that. Emacs users can benefit, too. Find the talk at Google Video. (Mar 07)

  • Semantic Annotation. RST Tool, available from wagsoft, is a pointy-clicky, slightly non-intuitive but easy to install tool for doing semantic annotation according to the discourse theory of your choice, especially Rhetorical Structure Theory. Also installed on /home/compling (Mar 07)

  • Website User Authentication. If you are building an OSU website for which you wish to require users to identify and/or authenticate themselves before accessing the material, you can use the library's proxy service to accomplish this. Ask Detmar for details.

  • Carmen Tip. Keep backups. The system can go down, and it can take you with it. Exporting and importing is relatively simple. (Feb 07)

  • Finite State Software. Been using the AT&T Finite State Toolkit? Looking for a similar product with the option of looking at the source code? Try the MIT Finite State Toolkit, which is open source, and has many of the same functionalities as the former.

  • Google Books. With a Google account, you can use their service to search through many books. You can't necessarily read them from cover to cover, but it can be a helpful resource if you need to search for particular topics within a text. (Feb 07)

  • CL Olympiad. High school students nationwide are encouraged to participate in the Computational Linguistics Olympiad. Students are given traditional linguistic problems, and problems involving computational thinking and issues regarding natural language processing. As of Feb 2, the organization is looking for suggestions for contest problems. (Feb 07)

  • ICE. For inter-process communication, collobarating on projects across universities, etc. This is also called middleware. Read more about ICE here. Competing sofware is OAA: Open Agent Architecture, and Multiplatform: Multiple Language / Target Integration Platform for Modules (Jan 07).

  • Firefox browser. Version 2 supports many standards, incl. SVG and there are nice, free extensions available, including:
    • Webdeveloper (live editing of html, css, etc.)
    • Aardvark (modify what's displayed on any webpage, for doing screenshots etc.)
    • Greasemonkey: various neat user scripts
    • Firebug (Debugger and network traffic profiler)

  • mechanize. This perl module will fill in form values in html documents automatically. (Jan 07)

  • Anonymous Feedback. Teachers might find it useful to allow their students to send them anonymous feedback. See Detmar's example, and if you'd like, copy his on your own website. To do that, copy the entire directory on our department network: ~dm/public_html/feedback . Don't forget to change all instances of the name and email address! (Jan 07)

  • Permanent URLs. A permanent URL will allow your website to retain a single, simple address, regardless of whether you change your employment or web-hosting position. purl.org provides a good service for this. tinyurl.com has a slightly different service, allowing you to create a very short URL that links to a website you may have with a long address. For an example of purl, you can find the OSU ICALL group and its projects at http://purl.org/net/icall. (Jan 07)

  • AJAX. Not just a cleaning solution, it can solve your messy, slow, database-driven web page problems as well. For an overview, examples, and tutorial of how to use AJAX, see Scott's slides (Jan 07).

  • HeVeA. A utility for converting very simple tex files into webpages. Appropriate for text-heavy, graphics-poor websites like online syllabi, course descriptions, etc. Already installed on the Linguistics department computers. (Jan 07)

  • Prefuse. A Java visualization toolkit. This software can help you make web-ready graphics of parse trees, etc. Could be useful for teaching parsing, grammar, syntax, etc. Find it at www.prefuse.org, and similar tools at graphviz.org. (Jan 07)

  • SVG. Scalable Vector Graphics are a great idea if you think your graphics might be seen on a wide variety of monitors - there is no distortion in size when going from movie screen to cell phone screen. Use SVG to build representations of xml documents, or any other node-based structure. See croczilla.com for examples. (Jan 07)

  • Version Control. It's a good idea to use Version Control to keep track of your work. Two options are CVS, which is easy to use within Emacs, and Subversion, which is newer and has some extra useful options. Version Control is important if you are working on a large project on your own, to keep a running log of work you've done and changes you've made. This applies both to papers you may be writing, or code you are developing. Version Control is even more important if you are working on a team project: keep track of everyone's contributions, avoid duplication of effort and mistaken overwriting of the team's work. Ask around in the department if you need help getting started, or keep an eye out for the upcoming tutorial. (Jan 07)

  • Syncing Your Files. Along with version control, it is a good idea to keep the many files you may have on the various computers in your life synced up. You can use programs such as unison or r-sync to help you do this. Keep your home directory at school and at home looking the same, and avoid reduplicating your own work, or overwriting your own files. Also helpful if a server goes down - you have your work, ready to use, elsewhere. (Jan 07)

  • Picture Naming Database. The International Picture Naming Project at CRL-UCSD contains a database of black-and-white drawings along with norms for what names they are given, in a variety of languages. Also given are norms for things including naming time. It contains some pictures published in an earlier set collected by Snodgrass & Vanderwart, which is used in a lot of studies, so you might want to use those pictures to duplicate prior results. If any of those pictures are used, the following paper should be cited (this is their condition of use):
    • Snodgrass, J.G., & Vanderwart, M. (1980). JEP: Human Learning and Memory, 6:3, 174-215.
    The S&V pictures are black and white, if you use the colored versions, you need to cite both Snodgrass & Vanderwart, and Rossion & Pourtois, who modified them to make them in full color:
    • Rossion, B. & Pourtois, G. (2001). Revisiting Snodgrass and Vanderwart's Object database: Color and Texture improve Object Recognition. 1st Vision Conference, Sarasota, FL.

  • bibdesk. A point-and-click interface for creating your very own BibTex file. Reduces typos. Find it on SourceForge, at least for Mac. (Jan 07)

  • latex2rtf. Have a latex file and need a Windows document? Try this resource, which works with fair accuracy. Another option is to use OpenOffice, from which documents can be directly exported to pdf, or presentations to Flash or .ppt - but use with caution, fonts can get messy. (Jan 07)

  • BibTex Yourself. When you list a citation to one of your own papers on your website, be sure to put a BibTex entry right next to it. That way, others won't mis-cite your work. (Jan 07)

  • Google's BibTex resource. If you use Google Scholar to find academic articles, change the Preferences to have it provide a BibTex entry for the various resources it finds. Use with caution - a quick sample done in our meeting showed some errors - but it's a good start. (Jan 07)

  • pdflatex. This is an easy way to embed pdf files within your own latex files. Find details in this document. Or, try Googling 'pdfpages'. (October 06)

  • CCG Parser. A new CCG parser and supertagger is available from Clark and Curran. You can find the software and related literature at: The CCG site. (September 06)

  • yab2web. This facility allows easy publication of bibtex entries into html, ideal for listing your publication list on your website. See Donna Byron's website for an example. (March 06)

  • Statistics Primer. A good introductory text to basic statistics can be found at http://faculty.vassar.edu/lowry/webtext.html. If you follow the link for VassarStats, you will find tools for calculating various statistics. (Sept 06)

  • Enron Corpus. Interested in naturally occurring language in the electronic domain? Search inter-office emails sent by employees of the Enron Corporation before the company's downfall. Scripts are available that filter out emails repeated throughout the corpus. On the linguistics server, see /home/corpora/EN/enron. (Feb 06)

  • General Language Ontology. There is a recently acquired ontology on the ling server that may be useful for those who need a basic semantic representation of general concepts. Read more about the ontology, what concepts it encodes, and what it may be useful for at http://research.cyc.com, and find the resource itself at /home/corpora/EN/cyc. You can also contact Stacey (s.bailey @ ling) for further information. (Feb 06)

  • Corpora Search Tool. The tools xkwic and cqp, both found on the linguistics department computers, are useful for decoding corpora such as the BNC, and for running complex queries on them. Ask Adriane for details on how to use these effectively (adriane @ ling). (Jan 06)

  • Internships. It's high time to start thinking about summer internships in CL. If you're interested in working someplace like Microsoft or elsewhere on the West Coast, have a chat with Chris (cbrew @ ling), Eric (fosler @ ling), or Donna (dbyron @ ling) for information and contacts. (Jan 06)

  • Assessment in Academic Pursuits. When it becomes necessary to list your achievements in the academic arena, it is useful to have some information on hand that goes beyond publication titles and dates. Other data to collect as your publication list lengthens:
    • Acceptance rate of papers at each venue/journal. Available in front matter of conference proceedings, journal issues.
    • Your percentage of contribution to a paper. Include actual research/project work, amount of writing, and creative input when you make this calculation.
    • Citation rate. Consult Google, ISI Database, Citeseer for information on how often your papers have been cited. These resources use different metrics for determining citation rates, so you may need to defend the actual citation rate that you choose to report.
    • Relative impact of the venue. Ratings given in ISI database (available in OSCAR).
    Also, let colleagues know what you're doing, what you've published and where, and make sure your publications get into the hands of people who you think should read them. Be annoying if necessary. (Jan 06)

  • Publication Strategy To get the ball rolling on getting published, consider taking the advice in Publication, Publication by Gary King. Main ideas:
    • Build on someone else's previous research by making one change.
    • Be able to defend the reasons for that change, and the impact it makes.
    • Clearly describe exactly the work that you did.
    • Make your data available. (Jan 06)

  • Idea Solicitation. What kind of project management tools would you like to see in the Linguistics department? How can we make group projects more manageable? Bring your suggestions to Clippers, or email the CL list. (Sept 05)

  • text2onto. Automatically extracts a candidate concept hierarchy and instances from a corpus of plain text. Not fully functional, but possibly handy for small projects, or getting started with ontologies. See the website for more details. (Oct 05)

  • IPA in HTML. The two following websites will get you started in publishing pages on the web with IPA fonts included:

  • Web Download Tool. The Unix tool 'wget' will download each page and all included content from a specified website. This can be helpful, if, for instance, a corpus is available for download only as a large series of small files. The tool will drill down through all links from the specified start page with the '-r' option. Example: $> wget -r http://www.ling.ohio-state.edu will download everything from the linguistics website (not recommended).

  • Language Generator. This website includes Perl code that will randomly generate 'pointy-hair-boss mission statements', as well as a link (near the end) to similar random language generators. (Oct 05)

  • Boolistic. Not just another search engine, this website may come in handy for those teaching boolean logic: www.boolistic.com. Enter your search terms, then click on different parts of the Venn diagram to alter the search query. (Sept 05)

  • Corpora Mailing List. Sign up here to receive email regarding new corpora and corpus tools.

    (Sept 05)


  • Last modified: 7 October 2010