The Paul Davis Moment
Each week at our computational linguistics discussion group, Clippers, we spend the first few minutes
discussing interesting, time-saving, or just plain nifty tools that
we've found. Here is a listing of recently shared resources:
Mendeley. "iTunes for research papers"
Mendeley provides a nice GUI for interacting with your collection of research papers. The parallel to a playlist
is a collection: create collections sorted howsoever you choose, and put papers in multiple collections. Mendeley
does an alright job of importing the correct metadat for many papers, especially if they exist in a public archive
or are text-based (as opposed to scanned pages). It is possible to take notes on the PDFs and in a separate comment,
and it is also possible to highlight text. You can also sync your library with the Mendeley website to have access to them from anywhere. (January 2011)
PGF and TikZ. "A TeX macro package for generating graphics"
When you need to show an MT model or generate any other graphics for your papers or presentations, PGF and TikZ will help you out. See
the TeXample.net page
to see some example images generated using PGF and TikZ. (October 2010)
beamerposter.
A LaTeX package for creating scientific posters. beamerposter allows you to create beautiful posters for your conference presentations. (October 2010)
Looglefight. A tool to help you find the right phrasing for your comp ling papers.
Takes two words or phrases input by the user and returns their frequencies in the ACL Reference Corpus to help you determine which phrasing works "better". (October 2010)
PDFMiner. Extracts
meaningful information out of PDF documents. PDFMiner is
written in Python, has support for preserving layout, and could be
useful the next time you're processing PDFs. (January 2010)
Zotero. Collect, manage,
and cite your research resources. Watch the video featured on
the main page for a quick introduction to the tool. Their beta
product is web based for easy accessibility. (October 2009)
Recaptcha. Know how when you buy from Ticketmaster, you
have to type in the words that appear all squirrely in the picture?
Now you can use that same technology to hide your own email address on
your webpage. This can help
stop spam. (Feburary 09)
Bibtex Citations. IEEE Explore, which is available to OSU
students on campus or if you log in to the library from off campus,
now allows you to download the bibtex citation of the articles you're
looking at. Check for the option on the menu on the left. (Feburary
09)
IR Systems. Two IR systems that are available for research
purposes are Galago and Terrier. Each has its ups and downs, both are
worth exploring. Talk to Chris for more info. (Feb 09)
Version Control. There are a few new programs out there
that improve on software like Subversion for doing distributed version
control. For instance, Git and Mercurial offer some features that
make collaboration easier, have better branching capabilites, and more
intuitive command line incantations. For advice, contact Jon
Dehdari. (Feb 2009)
Digital Union. For teachers and students at OSU, check out
the Digital Union for all
kinds of technologically enhanced classroom supplies or study aids.
Computers, speakers, cameras, and pedagogical tools you'd never
thought of. (Feb 2009)
Hadoop. For those wishing to try out MapReduce code on a
computing cluster, a la Google, Hadoop is set up on the Slate
machines. You can learn about MapReduce by reading this paper, or
if you're wanting something more in-depth, you can watch this lecture series. The
distributed file system is called Hadoop. When you're ready to get
started, Chris or Ilana can show you where to go on our
system. (January 09)
Apache Mahout. This software implements several machine
learning algorithms using the MapReduce framework. Includes Naive
Bayes, KNN, others. Should work on our Hadoop setup. Read more here. (January 09)
Encodings. If you have a document in UTF-8, and need it to
be in Latin encoding, use unix's utf2latin1. But if you need to go
back the other direction, use iconv. (January 09)
Octet. Handy add-on for emacs that inserts latex code as
you write. Helps you avoid leaving off that table end tag and that
sort of thing. Puts keystroke bindings with common latex tags. Ask
Crystal for help using this handy feature. (January 09)
LaPrint. Save
your Matlab figures in a way that makes them show up nicer in your
latex documents, including adding latex tags to the text (labels,
axes, titles) on the figure.
LiveJournal on SLaTe. For those wishing to work with blog
data, we have zipped-up versions of three months' worth of LiveJournal
webpages on the slate server. This is a standard data set for working
with blogs. Talk to Eric or Chris if you're interested in getting
started with it. (January 09)
Higher Order Perl A new book is now available about
programming elegantly in everyone's favorite scripting language.
Order it or download it for
free. (January 09)
OpenFST. An open source finite state toolkit from the same
folks who brought us the AT&T finite state toolkit. Has many of the
same features, some new ones, and searchable source code! Find it here. (March 08)
rename. A unix command that will change the extension of a
bunch of files all at once. For instance, "rename .raw .au *.raw"
will change all of the files within the current directory that have a
.raw extension to have a .au extension. Quicker than writing a
script. (March 08)
Speed Reading. To practice speed reading, find a freely
available program called RSVP. It will take any webpage or document
and present it to you, word by word, at the speed you set. Then
increase the speed as you get better.(Feb 08)
PDFs on the Internets. If you use a Mozilla browser, you
can download a plug-in that lets you choose how to view a PDF: in a
new browser or tab, by opening Acrobat on your desktop, or by just
downloading it to your computer without opening. Pretty handy. Find
it on the Mozilla or Firefox plug-ins page. (Feb 08)
Adobe Plug-In 8. The newest version of Adobe Acrobat, at
least for Linux, begins to display long documents incrementally as it
downloads, rather than waiting for the download to complete. (Jan 08)
Machine Learning Bootcamp. At http://videolectures.net/bootcamp07_vilanova
there are various video lectures with syncronized slides that some people might be interested in. The main topics covered
are
- Basic Math and TCS for Machine Learning
- Useful existing software for Machine Learning
- Introduction to Machine Learning
- Theoretical frameworks and foundations
- Experimental Machine Learning
- Feature extraction and model selection
- Graphical models
- Kernel methods and linear predictors
- Clustering
- General view of application areas
- Machine learning in vision
- Machine learning in user interfaces
- Machine learning for data mining
(Jan 08)
Google N-gram search. First off, the Google English n-gram
data is available to those with access to the ling dep't server. Find
it at /home/corpora/EN/WebIT. There is also available some software
that searches these n-grams efficiently on the web. I lost that
reference, but will update when it's found. (Dec 07)
Penn Discourse Treebank. This is also currently available
on the linguistic corpora server. (Dec 07)
Stinkpot. A repository of helpful hints on all kinds of
tools we tend to use to do our work: Emacs, Python, Latex,
Matlab... it's a personal blog of a grad student at MIT who works on
silly things like evolution. His version of the Paul
Davis moment is something you might find helpful. (Dec 07)
Semantic. If you'd like to be able to create your own math
symbols in latex, specifically those with ligatures, try installing
this
package. (Dec 07)
Anti-Word. If you use a linux machine pretty much
exclusively, but get email attachments from people who use Windows
products, they you might be interested in Anti-Word, which will convert
.doc files to plain text. (Nov 07)
MIT Workshop on Syntax. It's not up as of this writing, but
check on mitworld.mid.edu for a
video of their one day workshop titled "Where Does Syntax Come From?
Have We All Been Wrong?", with guest speakers Sandiway Fong, Chris
Manning, and Noam Chomsky, among others. (Nov 07)
Machine Learning Slides. UC Berkeley's RAD Lab has made
slides and videos available on the web from a recent two-day short
course on applied machine learning for its industrial affiliates:
(Nov 07)
Video
Slides
PrimoPdf. You can make PDFs of your MS Office documents for
free with this nifty app. Get it
here. (Oct 07)
TigerSearch. This software for searching through
syntactically annotated corpora is now available on the Mac portion of
the ling machines in Oxley 201. It has a java interface, and allows
you to search for examples of general or specific syntactic
constructions within many corpora. Ask Detmar or Adriane if you need
help. (Oct 07)
sshfs. This unix application allows you to mount an entire
filesystem. Then it's easier to access your ling files from home. This website has
details. It should be available on most linux installations: try
'appget install sshfs'. (May 07)
Subversion. If you missed Scott's presentation on version
control using SVN, or if you'd like to see it again, you can access
his slides via the LCC tutorials
webpage or Scott's webpage (May
07).
Firefox tip. It seems like the new version of Firefox
doesn't let you close all your tabs with one button. In fact, that's
just the default, but you can set it to do as you like if you look
closely at the settings options. (May 07).
SQLite. This is a good database system to use because it is
portable, keeps your data in a single file, works in the user space,
and has good software carpentry, that is, it was built intelligently
so that you can build on top of it. (May 07).
Wikipedia Downloads. It is possible to download all of
Wikipedia, or various portions of it, for use in NLP tasks. The
website can be a bit hard to find, so Adriane found it for us:
Get
Wikipedia here (April 07)
Website Accessibility. In constructing a website, it's
recommended (required at OSU, in fact), to make it accessible to the
disabled. That means to make sure that vision-impaired folks will be
able to get your information by using a screen reader. To make sure
your website is compliant, use a tool like Fangs
to get an idea of what your website "sounds" like. (April 07)
RSS News Feeds. If you wish to work with current news
documents, and are looking for a standard, uniform format in which to
work, RSS is a good choice. To obtain a news article in RSS format,
you can use URLs of the form:
- http://news.google.com/news?q=Ohio+State&output=rss
Where "Ohio State" was the search term; to restrict it to specific
news sites, use the "source:" operator, i.e.
-
http://news.google.com/news?q=Ohio+State+source:new_york_times&output=rss
Other formats are available. (April 07)
Corpora. The BRENT corpus is available within at
/home/corpora/EN/childes/Brent. Ask Anton for details on using this
corpus, or the Stephanie corpus.
Pseudocode in Latex. The style file crlscode.sty works well
and produces very pretty pseudocode, same as in the Introduction to
Algorithms book. Get the code and the documentation here.
Arabtex. It's also possible to type in Arabic using Latex.
Correct right-to-left formatting is included in the arabtex package.
It's a biggie, and complicated, so you're best off using the package
that's already installed on bardolph.
Machine Learning Toolkit. YALE: Yet Another Learning
Engine. Available on SourceForge, among other things, it can do word
vector processing. (Mar 07)
Text Editing. The creator of the vim text editor gave a
talk to the Google folks on efficient text editing: how to identify
when you're doing things inefficiently, and how to fix that. Emacs
users can benefit, too. Find the talk at
Google
Video. (Mar 07)
Semantic Annotation. RST Tool, available from wagsoft, is a
pointy-clicky, slightly non-intuitive but easy to install tool for
doing semantic annotation according to the discourse theory of your
choice, especially Rhetorical Structure Theory. Also installed on
/home/compling (Mar 07)
Website User Authentication. If you are building an OSU
website for which you wish to require users to identify and/or
authenticate themselves before accessing the material, you can use the
library's proxy service to accomplish this. Ask Detmar for details.
Carmen Tip. Keep backups. The system can go down, and it
can take you with it. Exporting and importing is relatively simple.
(Feb 07)
Finite State Software. Been using the AT&T Finite State
Toolkit? Looking for a similar product with the option of looking at
the source code? Try the MIT Finite State Toolkit, which is open
source, and has many of the same functionalities as the former.
Google Books. With a Google account, you can use their
service to search through many books. You can't necessarily read them
from cover to cover, but it can be a helpful resource if you need to
search for particular topics within a text. (Feb 07)
CL Olympiad. High school students nationwide are encouraged
to participate in the Computational
Linguistics Olympiad. Students are given traditional linguistic
problems, and problems involving computational thinking and issues
regarding natural language processing. As of Feb 2, the organization
is looking for suggestions for
contest problems. (Feb 07)
ICE. For inter-process communication, collobarating on
projects across universities, etc. This is also called middleware.
Read more about ICE here.
Competing sofware is OAA: Open Agent Architecture, and Multiplatform:
Multiple Language / Target Integration Platform for Modules (Jan 07).
Firefox browser. Version 2 supports many standards, incl. SVG
and there are nice, free extensions available, including:
- Webdeveloper (live editing of html, css, etc.)
- Aardvark (modify what's displayed on any webpage, for doing screenshots etc.)
- Greasemonkey: various neat user scripts
- Firebug (Debugger and network traffic profiler)
mechanize. This perl module will fill in form values in html
documents automatically. (Jan 07)
Anonymous Feedback. Teachers might find it useful to allow
their students to send them anonymous feedback. See Detmar's example, and if
you'd like, copy his on your own website. To do that, copy the
entire directory on our department network: ~dm/public_html/feedback .
Don't forget to change all instances of the name and email address!
(Jan 07)
Permanent URLs. A permanent URL will allow your website to
retain a single, simple address, regardless of whether you change
your employment or web-hosting position. purl.org provides a good service for
this. tinyurl.com has a slightly
different service, allowing you to create a very short URL that
links to a website you may have with a long address. For an example
of purl, you can find the OSU ICALL group and its projects
at http://purl.org/net/icall. (Jan 07)
AJAX. Not just a cleaning solution, it can solve your
messy, slow, database-driven web page problems as well. For an
overview, examples, and tutorial of how to use AJAX, see Scott's
slides (Jan 07).
HeVeA. A utility for converting very simple tex files into
webpages. Appropriate for text-heavy, graphics-poor websites like
online syllabi, course descriptions, etc. Already installed on the
Linguistics department computers. (Jan 07)
Prefuse. A Java visualization toolkit. This software can
help you make web-ready graphics of parse trees, etc. Could be useful
for teaching parsing, grammar, syntax, etc. Find it at www.prefuse.org, and similar tools
at graphviz.org. (Jan 07)
SVG. Scalable Vector Graphics are a great idea if you think
your graphics might be seen on a wide variety of monitors - there is
no distortion in size when going from movie screen to cell phone
screen. Use SVG to build representations of xml documents, or any
other node-based structure. See croczilla.com for
examples. (Jan 07)
Version Control. It's a good idea to use Version Control to
keep track of your work. Two options are CVS, which is easy to use
within Emacs, and Subversion, which is newer and has some extra useful
options. Version Control is important if you are working on a large
project on your own, to keep a running log of work you've done and
changes you've made. This applies both to papers you may be writing,
or code you are developing. Version Control is even more important if
you are working on a team project: keep track of everyone's
contributions, avoid duplication of effort and mistaken overwriting of
the team's work. Ask around in the department if you need help
getting started, or keep an eye out for the upcoming tutorial. (Jan
07)
Syncing Your Files. Along with version control, it is a
good idea to keep the many files you may have on the various computers
in your life synced up. You can use programs such as unison
or r-sync to help you do this. Keep your home directory at
school and at home looking the same, and avoid reduplicating your own
work, or overwriting your own files. Also helpful if a server goes
down - you have your work, ready to use, elsewhere. (Jan 07)
Picture Naming Database. The International
Picture Naming Project at CRL-UCSD contains a database of
black-and-white drawings along with norms for what names they are
given, in a variety of languages. Also given are norms for things
including naming time. It contains some pictures published in an
earlier set collected by Snodgrass & Vanderwart, which is used in a lot
of studies, so you might want to use those pictures to duplicate prior
results. If any of those pictures are used, the following paper should
be cited (this is their condition of use):
- Snodgrass, J.G., & Vanderwart, M. (1980). JEP: Human Learning and Memory, 6:3, 174-215.
The S&V pictures are black and white, if you use the colored versions,
you need to cite both Snodgrass & Vanderwart, and Rossion & Pourtois,
who modified them to make them in full color:
- Rossion, B. & Pourtois, G. (2001). Revisiting Snodgrass and
Vanderwart's Object database: Color and Texture improve Object
Recognition. 1st Vision Conference, Sarasota, FL.
bibdesk. A point-and-click interface for creating your very
own BibTex file. Reduces typos. Find it on SourceForge, at least for
Mac. (Jan 07)
latex2rtf. Have a latex file and need a Windows document?
Try this resource, which works with fair accuracy. Another option is
to use OpenOffice, from which documents can be directly exported to
pdf, or presentations to Flash or .ppt - but use with caution, fonts
can get messy. (Jan 07)
BibTex Yourself. When you list a citation to one of your own
papers on your website, be sure to put a BibTex entry right next to
it. That way, others won't mis-cite your work. (Jan 07)
Google's BibTex resource. If you use Google Scholar to find
academic articles, change the Preferences to have it provide a BibTex
entry for the various resources it finds. Use with caution - a quick
sample done in our meeting showed some errors - but it's a good
start. (Jan 07)
pdflatex. This is an easy way to embed pdf files within
your own latex files. Find details in this
document. Or, try Googling 'pdfpages'. (October 06)
CCG Parser. A new CCG parser and supertagger is available
from Clark and Curran. You can find the software and related literature at: The CCG site.
(September 06)
yab2web. This facility allows easy publication of bibtex
entries into html, ideal for listing your publication list on your
website. See Donna
Byron's website for an example. (March 06)
Statistics Primer. A good introductory text to basic
statistics can be found at http://faculty.vassar.edu/lowry/webtext.html.
If you follow the link for VassarStats, you will find tools for
calculating various statistics. (Sept 06)
Enron Corpus. Interested in naturally occurring language in
the electronic domain? Search inter-office emails sent by employees
of the Enron Corporation before the company's downfall. Scripts are
available that filter out emails repeated throughout the corpus. On
the linguistics server, see /home/corpora/EN/enron. (Feb 06)
General Language Ontology. There is a recently acquired ontology
on the ling server that may be useful for those who need a basic
semantic representation of general concepts. Read more about the
ontology, what concepts it encodes, and what it may be useful for at
http://research.cyc.com, and find
the resource itself at /home/corpora/EN/cyc. You can also contact
Stacey (s.bailey @ ling) for further information. (Feb 06)
Corpora Search Tool. The tools xkwic and cqp, both found on
the linguistics department computers, are useful for decoding corpora
such as the BNC, and for running complex queries on them. Ask Adriane
for details on how to use these effectively (adriane @ ling). (Jan 06)
Internships. It's high time to start thinking about summer
internships in CL. If you're interested in working someplace like
Microsoft or elsewhere on the West Coast, have a chat with Chris
(cbrew @ ling),
Eric (fosler @ ling), or Donna (dbyron @ ling) for information and
contacts. (Jan 06)
Assessment in Academic Pursuits. When it becomes necessary
to list your achievements in the academic arena, it is useful to have
some information on hand that goes beyond publication titles and
dates. Other data to collect as your publication list lengthens:
- Acceptance rate of papers at each venue/journal. Available in
front matter of conference proceedings, journal issues.
- Your percentage of contribution to a paper. Include actual
research/project work, amount of writing, and creative input when you
make this calculation.
- Citation rate. Consult Google, ISI Database, Citeseer for
information on how often your papers have been cited. These resources
use different metrics for determining citation rates, so you may need
to defend the actual citation rate that you choose to report.
- Relative impact of the venue. Ratings given in ISI database (available
in OSCAR).
Also, let colleagues know what you're doing, what you've published and
where, and make sure your publications get into the hands of people
who you think should read them. Be annoying if necessary. (Jan 06)
Publication Strategy To get the ball rolling on getting
published, consider taking the advice in Publication,
Publication by Gary King. Main ideas:
- Build on someone else's previous research by making one change.
- Be able to defend the reasons for that change, and the impact it
makes.
- Clearly describe exactly the work that you did.
- Make your data available. (Jan 06)
Idea Solicitation. What kind of project management tools
would you like to see in the Linguistics department? How can we make
group projects more manageable? Bring your suggestions to Clippers,
or email the CL list. (Sept 05)
text2onto. Automatically extracts a candidate concept
hierarchy and instances from a corpus of plain text. Not fully
functional, but possibly handy for small projects, or getting started
with ontologies. See the website for more
details. (Oct 05)
IPA in HTML. The two following websites will get you
started in publishing pages on the web with IPA fonts included:
Web Download Tool. The Unix tool 'wget' will download each
page and all included content from a specified website. This can be
helpful, if, for instance, a corpus is available for download only as
a large series of small files. The tool will drill down through all
links from the specified start page with the '-r' option. Example:
$> wget -r http://www.ling.ohio-state.edu
will download everything from the linguistics website (not recommended).
Language Generator. This website includes Perl code that
will randomly generate 'pointy-hair-boss mission statements', as well as
a link (near the end) to similar random language generators. (Oct 05)
Boolistic. Not just another search engine, this website may
come in handy for those teaching boolean logic: www.boolistic.com. Enter your
search terms, then click on different parts of the Venn diagram to
alter the search query. (Sept 05)
Corpora Mailing List. Sign up here to receive email
regarding new corpora and corpus tools. (Sept 05)
|
|
Last modified: 7 October 2010 |
|